tsfresh

functime has rewritten most of the time-series features extractors from tsfresh into Polars. Approximately 80% of the implementations are optimized lazy queries.

The rest are eager implementations. The overall performance improvements compared to tsfresh ranges between 5x to 50x. Speed ups depend on the size of the input, the feature, and whether common subplan elimination is invoked (i.e. multiple lazy features are collected together). Moreover, windowed / grouped features in functime can be a further 100x faster than tsfresh.

Usage Example

import numpy as np
import polars as pl

from functime.feature_extraction.tsfresh import (
    approximate_entropy
    benford_correlation,
    binned_entropy,
    c3
)

sin_x = np.sin(np.arange(120))

# Pass series directly
entropy = approximate_entropy(
    x=pl.Series("ts", sin_x),
    run_length=5,
    filtering_level=0.0
)

# Lazy operations
features = (
    pl.LazyFrame({"ts": sin_x})
    .select(
        approximate_entropy=approximate_entropy(
            pl.col("ts"),
            run_length=5,
            filtering_level=0.0
        ),
        benford_correlation=benford_correlation(pl.col("ts")),
        binned_entropy=binned_entropy(pl.col("ts"), bin_count=10),
        c3=c3(),
    )
    .collect()
)

`absolute_energy(x)`

Compute the absolute energy of a time series.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`absolute_maximum(x)`

Compute the absolute maximum of a time series.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`absolute_sum_of_changes(x)`

Compute the absolute sum of changes of a time series.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`approximate_entropy(x, run_length, filtering_level, scale_by_std=True)`

Approximate sample entropies of a time series given the filtering level. This only works for Series input right now.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`run_length`	`int`	Length of compared run of data. This is `m` in the wikipedia article.	required
`filtering_level`	`float`	Filtering level, must be positive. This is `r` in the wikipedia article.	required
`scale_by_std`	`bool`	Whether to scale filter level by std of data. In most applications, this is the default behavior, but not in some other cases.	`True`

Returns:

Type	Description
`float`

`augmented_dickey_fuller(x, n_lags)`

Calculates the Augmented Dickey-Fuller (ADF) test statistic. This only works for Series input right now.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`n_lags`	`int`	The number of lags to include in the test.	required

Returns:

Type	Description
`float`

`autocorrelation(x, n_lags)`

Calculate the autocorrelation for a specified lag.

The autocorrelation measures the linear dependence between a time-series and a lagged version of itself.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`n_lags`	`int`	The lag at which to calculate the autocorrelation. Must be a non-negative integer.	required

Returns:

Type	Description
`float \| Expr`	Autocorrelation at the given lag. Returns None, if lag is less than 0.

`autoregressive_coefficients(x, n_lags)`

Computes coefficients for an AR(n_lags) process. This only works for Series input right now.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`n_lags`	`int`	The number of lags in the autoregressive process.	required

Returns:

Type	Description
`list of float`

`benford_correlation(x)`

Returns the correlation between the first digit distribution of the input time series and the Newcomb-Benford's Law distribution.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`benford_correlation2(x)`

Returns the correlation between the first digit distribution of the input time series and the Newcomb-Benford's Law distribution. This version may hit some float point precision issues for some rare numbers.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`An expression for benford_correlation representing a float`

`binned_entropy(x, bin_count=10)`

Calculates the entropy of a binned histogram for a given time series.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`bin_count`	`int`	The number of bins to use in the histogram. Default is 10.	`10`

Returns:

Type	Description
`float \| Expr`

`c3(x, n_lags)`

Measure of non-linearity in the time series using c3 statistics.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`n_lags`	`int`	The lag that should be used in the calculation of the feature.	required

Returns:

Type	Description
`float \| Expr`

`change_quantiles(x, q_low, q_high, is_abs)`

First fixes a corridor given by the quantiles ql and qh of the distribution of x. It will return a list of changes coming from consecutive values that both lie within the quantile range. The user may optionally get abssolute value of the changes, and compute stats from these changes. If q_low >= q_high, it will return null.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	A single time-series.	required
`q_low`	`float`	The lower quantile of the corridor. Must be less than `q_high`.	required
`q_high`	`float`	The upper quantile of the corridor. Must be greater than `q_low`.	required
`is_abs`	`bool`	If True, takes absolute difference.	required

Returns:

Type	Description
`list of float \| Expr`

`cid_ce(x, normalize=False)`

Computes estimate of time-series complexity[^1].

A more complex time series has more peaks and valleys. This feature is calculated by:

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	A single time-series.	required
`normalize`	`bool`	If True, z-normalizes the time-series before computing the feature. Default is False.	`False`

Returns:

Type	Description
`float \| Expr`

`count_above(x, threshold=0.0)`

Calculate the percentage of values above or equal to a threshold.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`threshold`	`float`	The threshold value for comparison.	`0.0`

Returns:

Type	Description
`float \| Expr`

`count_above_mean(x)`

Count the number of values that are above the mean.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`int \| Expr`

`count_below(x, threshold=0.0)`

Calculate the percentage of values below or equal to a threshold.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`threshold`	`float`	The threshold value for comparison.	`0.0`

Returns:

Type	Description
`float \| Expr`

`count_below_mean(x)`

Count the number of values that are below the mean.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`int \| Expr`

`cwt_coefficients(x, widths=(2, 5, 10, 20), n_coefficients=14)`

Calculates a Continuous wavelet transform for the Ricker wavelet.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`widths`	`Sequence[int]`	The widths of the Ricker wavelet to use for the CWT. Default is (2, 5, 10, 20).	`(2, 5, 10, 20)`
`n_coefficients`	`int`	The number of CWT coefficients to return. Default is 14.	`14`

Returns:

Type	Description
`list of float`

`energy_ratios(x, n_chunks=10)`

Calculates sum of squares over the whole series for n_chunks equally segmented parts of the time-series.

Parameters:

Name	Type	Description	Default
`x`	`list of float`	The time-series to be segmented and analyzed.	required
`n_chunks`	`int`	The number of equally segmented parts to divide the time-series into. Default is 10.	`10`

Returns:

Type	Description
`list of float \| Expr`

`fft_coefficients(x)`

Calculates Fourier coefficients and phase angles of the the 1-D discrete Fourier Transform. This only works for Series input right now.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time series.	required
`n_threads`	`int`	Number of threads to use. If None, uses all threads available. Defaults to None.	required

Returns:

Type	Description
`dict of list of floats \| Expr`

`first_location_of_maximum(x)`

Returns the first location of the maximum value of x. The position is calculated relatively to the length of x.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`first_location_of_minimum(x)`

Returns the first location of the minimum value of x. The position is calculated relatively to the length of x.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`fourier_entropy(x, n_bins=10)`

Calculate the Fourier entropy of a time series. This only works for Series input right now.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`n_bins`	`int`	The number of bins to use for the entropy calculation. Default is 10.	`10`

Returns:

Type	Description
`float`

`friedrich_coefficients(x, polynomial_order=3, n_quantiles=30)`

Calculate the Friedrich coefficients of a time series.

Parameters:

Name	Type	Description	Default
`x`	`TIME_SERIES_T`	The time series to calculate the Friedrich coefficients of.	required
`polynomial_order`	`int`	The order of the polynomial to fit to the quantile means. Default is 3.	`3`
`n_quantiles`	`int`	The number of quantiles to use for the calculation. Default is 30.	`30`

Returns:

Type	Description
`list of float`

`harmonic_mean(x)`

Returns the harmonic mean of the expression

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time series.	required

Returns:

Type	Description
`float \| Expr`

`has_duplicate(x)`

Check if the time-series contains any duplicate values.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`bool \| Expr`

`has_duplicate_max(x)`

Check if the time-series contains any duplicate values equal to its maximum value.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`bool \| Expr`

`has_duplicate_min(x)`

Check if the time-series contains duplicate values equal to its minimum value.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`bool \| Expr`

`index_mass_quantile(x, q)`

Calculates the relative index i of time series x where q% of the mass of x lies left of i. For example for q = 50% this feature calculator will return the mass center of the time series.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`q`	`float`	The quantile.	required

Returns:

Type	Description
`float \| Expr`

`large_standard_deviation(x, ratio=0.25)`

Checks if the time-series has a large standard deviation: std(x) > r * (max(X)-min(X)).

As a heuristic, the standard deviation should be a forth of the range of the values.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`ratio`	`float`	The ratio of the interval to compare with.	`0.25`

Returns:

Type	Description
`bool \| Expr`

`last_location_of_maximum(x)`

Returns the last location of the maximum value of x. The position is calculated relatively to the length of x.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`last_location_of_minimum(x)`

Returns the last location of the minimum value of x. The position is calculated relatively to the length of x.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`lempel_ziv_complexity(x, n_bins)`

Calculate a complexity estimate based on the Lempel-Ziv compression algorithm.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`n_bins`	`int`	An integer specifying the number of bins to use for discretizing the time series.	required

Returns:

Type	Description
`list of float`

`linear_trend(x)`

Compute the slope, intercept, and RSS of the linear trend.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`Mapping[str, float] \| Expr`

`longest_strike_above_mean(x)`

Returns the length of the longest consecutive subsequence in x that is greater than the mean of x. If all values in x are null, 0 will be returned.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`int \| Expr`

`longest_strike_below_mean(x)`

Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x. If all values in x are null, 0 will be returned.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`int \| Expr`

`mean_abs_change(x)`

Compute mean absolute change.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	A single time-series.	required

Returns:

Type	Description
`float \| Expr`

`mean_change(x)`

Compute mean change.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	A single time-series.	required

Returns:

Type	Description
`float \| Expr`

`mean_n_absolute_max(x, n_maxima)`

Calculates the arithmetic mean of the n absolute maximum values of the time series.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`n_maxima`	`int`	The number of maxima to consider.	required

Returns:

Type	Description
`float \| Expr`

`mean_second_derivative_central(x)`

Returns the mean value of a central approximation of the second derivative.

Parameters:

Name	Type	Description	Default
`x`	`Series`	A time series to calculate the feature of.	required

Returns:

Type	Description
`Series`

`number_crossings(x, crossing_value=0.0)`

Calculates the number of crossings of x on m, where m is the crossing value.

A crossing is defined as two sequential values where the first value is lower than m and the next is greater, or vice-versa. If you set m to zero, you will get the number of zero crossings.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	A single time-series.	required
`crossing_value`	`float`	The crossing value. Defaults to 0.0.	`0.0`

Returns:

Type	Description
`float \| Expr`

`number_cwt_peaks(x, max_width=5)`

Number of different peaks in x.

To estimate the numbers of peaks, x is smoothed by a ricker wavelet for widths ranging from 1 to n. This feature calculator returns the number of peaks that occur at enough width scales and with sufficiently high Signal-to-Noise-Ratio (SNR)

Parameters:

Name	Type	Description	Default
`x`	`Series`	A single time-series.	required

max_width : int maximum width to consider

Returns:

Type	Description
`float`

`number_peaks(x, support)`

Calculates the number of peaks of at least support n in the time series x. A peak of support n is defined as a subsequence of x where a value occurs, which is bigger than its n neighbours to the left and to the right.

Hence in the sequence

x = [3, 0, 0, 4, 0, 0, 13]

4 is a peak of support 1 and 2 because in the subsequences

[0, 4, 0] [0, 0, 4, 0, 0]

4 is still the highest value. Here, 4 is not a peak of support 3 because 13 is the 3th neighbour to the right of 4 and its bigger than 4.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`support`	`int`	Support of the peak	required

Returns:

Type	Description
`int \| Expr`

`percent_reoccuring_values(x)`

Returns the percentage of values that are present in the time series more than once.

The percentage is calculated as follows:

len(different values occurring more than once) / len(different values)

This means the percentage is normalized to the number of unique values in the time series, in contrast to the percent_reocurring_points function.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`percent_reocurring_points(x)`

Returns the percentage of non-unique data points in the time series. Non-unique data points are those that occur more than once in the time series.

The percentage is calculated as follows:

# of data points occurring more than once / # of all data points

This means the ratio is normalized to the number of data points in the time series, in contrast to the percent_reoccuring_values function.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float`

`permutation_entropy(x, tau=1, n_dims=3, base=math.e)`

Computes permutation entropy.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`tau`	`int`	The embedding time delay which controls the number of time periods between elements of each of the new column vectors.	`1`
`n_dims`	`int, > 1`	The embedding dimension which controls the length of each of the new column vectors	`3`
`base`	`float`	The base for log in the entropy computation	`e`

Returns:

Type	Description
`float \| Expr`

`range_count(x, lower, upper, closed='left')`

Computes values of input expression that is between lower (inclusive) and upper (exclusive).

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`lower`	`float`	The lower bound, inclusive	required
`upper`	`float`	The upper bound, exclusive	required
`closed`	`ClosedInterval`	Whether or not the boundaries should be included/excluded	`'left'`

Returns:

Type	Description
`int \| Expr`

`ratio_beyond_r_sigma(x, ratio=0.25)`

Returns the ratio of values in the series that is beyond r*std from mean on both sides.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required
`ratio`	`float`	The scaling factor for std	`0.25`

Returns:

Type	Description
`float \| Expr`

`ratio_n_unique_to_length(x)`

Calculate the ratio of the number of unique values to the length of the time-series.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`root_mean_square(x)`

Calculate the root mean square.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`sample_entropy(x, ratio=0.2)`

Calculate the sample entropy of a time series. This only works for Series input right now.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	The input time series.	required
`ratio`	`float`	The tolerance parameter. Default is 0.2.	`0.2`

Returns:

Type	Description
`float \| Expr`

`spkt_welch_density(x, n_coeffs=None)`

This estimates the cross power spectral density of the time series x at different frequencies. This only works for Series input right now.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	The input time series.	required
`n_coeffs`	`Optional[int]`	The number of coefficients you want to take. If none, will take all, which will be a list as long as the input time series.	`None`

Returns:

Type	Description
`list of floats`

`sum_reocurring_points(x)`

Returns the sum of all data points that are present in the time series more than once.

For example, sum_reocurring_points(pl.Series([2, 2, 2, 2, 1])) returns 8, as 2 is a reoccurring value, so all 2's are summed up.

This is in contrast to the sum_reocurring_values function, where each reoccuring value is only counted once.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`sum_reocurring_values(x)`

Returns the sum of all values that are present in the time series more than once.

For example, sum_reocurring_values(pl.Series([2, 2, 2, 2, 1])) returns 2, as 2 is a reoccurring value, so it is summed up with all other reoccuring values (there is none), so the result is 2.

This is in contrast to the sum_reocurring_points function, where each reoccuring value is only counted as often as it is present in the data.

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time-series.	required

Returns:

Type	Description
`float \| Expr`

`symmetry_looking(x, ratio=0.25)`

Check if the distribution of x looks symmetric.

A distribution is considered symmetric if: | mean(X)-median(X) | < ratio * (max(X)-min(X))

Parameters:

Name	Type	Description	Default
`x`	`Series`	Input time-series.	required
`ratio`	`float`	Multiplier on distance between max and min.	`0.25`

Returns:

Type	Description
`bool \| Expr`

`time_reversal_asymmetry_statistic(x, n_lags)`

Returns the time reversal asymmetry statistic.

Parameters:

Name	Type	Description	Default
`x`	`Series`	Input time-series.	required
`n_lags`	`int`	The lag that should be used in the calculation of the feature.	required

Returns:

Type	Description
`float \| Expr`

`var_gt_std(x, ddof=1)`

Is the variance >= std? In other words, is var >= 1?

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time series.	required
`ddof`	`int`	Delta Degrees of Freedom used when computing var/std.	`1`

Returns:

Type	Description
`bool \| Expr`

`variation_coefficient(x)`

Calculate the coefficient of variation (CV).

Parameters:

Name	Type	Description	Default
`x`	`Expr \| Series`	Input time series.	required

Returns:

Type	Description
`float \| Expr`