A few questions about time series and the delay function
• Hello all,

I am new to Eureqa so I apologize if the questions I'm about to ask are trivial.  First, let me describe what I am working on.  I have measured inputs and outputs in 1Hz time series format - each row is the measurement for that second and each column is a different parameter.  As the output of my data is most certainty correlated to the history of at least some of the input parameters, I have chosen to use the 'Delayed Variable' building block.  I am also defining a maximum time widow of 20 seconds to use.  Since the 'Data Settings' 'Maximum History Data' is in % I simply did (20/(total number of data points))*100 = %.  I'd just like to verify that this is an appropriate way of setting the Maximum time window.

Additionally, in the 'Data Settings' window, there is a check box - 'Shuffle rows before splitting'.  I haven't been able to find any documentation on this and am interpreting it as 'if checked Eureqa will randomize the input completely compromising the ordinal nature of the data'.  Should this check box be unchecked for time series data?

Once Eureqa has run and formula are generated with terms such as 'delay((X), 11)', does this represent X at (t-11) or does it represent X at (t, t-1, t-2,...,t-11)?

Lastly, as I run my data, Eureqa will output solutions with good r^2 values (~0.749) and then will be 'bested' by a new solution with an r^2 value of -117250 which doesn't quite make sense.  The plot of the predicted value also does not align with the measured output.  I'm assuming this is a software issue but maybe I am setting the problem up incorrectly.

Questions:
-method for defining the maximum time window
-What is 'Shuffle rows before splitting' and should it be unchecked for time series analysis
-Does 'delay((X), 11)' mean X at (t-11) or X at (t,t-1,t-2,...,t-11)?
-Are the erroneous r^2 values a software issue or a user setup issue?

Tyler

 Tweet
• Hi Tyler,

As long as your time series data is not being shuffled, your use of the maximum history data setting sounds correct: http://formulize.nutonian.com/documentation/eureqa/tutorials/modeling-time-series/#history-limits.

Time series data should not be shuffled, as for time series you want past data to predict future data. There is a data splitting option especially for time series problems ("Split data points for extracting future values"), or you can simply turn off data shuffling in the custom data splitting settings.

If you have 3 rows of data of y:
1
2
3
At the third row, delay(y,1) would be 2 and delay(y,2) would be 1. At the second row, delay(y,1) would be 1.

Can you clarify what you are seeing with the R^2 values? At each level of complexity, Eureqa will replace a solution if a new solution at the same level of complexity is found with a higher level of accuracy. Are you looking at solutions with the same levels of complexities? The solution fit plot tracks the predicted values for the training data sets and validation data sets separately.

Thanks,
Jess
• Hi Jess,

I am not trying to predict the future with my time series data, I am simply trying to model transient data.  In this instance, my data is consistently low in the beginning and has high output events at the end.  Training with just the time series data from the beginning would result in a poor fit for the higher output events at the end of the time series. In this case, it would be preferable to train the program using a random selection of the available data to make sure it is sampling from low and high output events and then to validate on the remaining data which would also contain data from the low and high output events.

Thanks,

Tyler

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!