Eureqa Desktop User Guide

Preparing Data in Eureqa Desktop

In this view, Eureqa provides several options for the preprocessing of data: you can smooth the data, handle missing values, remove outliers, normalize scale and offset, and apply a filter.

These processes can be applied to a single variable or simultaneously to any combination of variables. To select a variable for processing, click on that variable in the "Variables" window in the upper left. To select multiple variables, drag across or Ctrl-click on the desired variables.

Check the box next to any of the following preprocessing options and the necessary controls will appear.

Contents

Smoothing Data

Smoothing can greatly improve both the speed of the search and the likelihood of finding accurate solutions. However, you should smooth your data only if you have reason to believe that the source of the data is somewhat continuous.

To smooth one or more variables, do the following:

  1. Select the variable or variables you want to smooth. (The data will be plotted in the lower window.)
  2. Select an independent variable to smooth along, or just smooth across rows.
  3. (Optional) If you want non-uniform smoothing, where some data points are given more weight than others, you can select (or type in) a variable or expression, and that variable or expression will determine the weight given to each row. Details are on the Row Weight section of the Set Target page.
  4. Set the desired smoothing level or let Eureqa choose it automatically. (Eureqa will choose the setting giving the best smooth as determined by generalized cross-validation among cubic b-splines.)
return to top ⇑

Handling Missing Values

When a row contains values for one or more variables but has an empty cell in the column for one or more other variables, Formulize can handle the situation in a variety of ways. Choose from among the following options in the drop-down menu labeled "method":

  • Ignore the entire row.
  • Copy value from the previous row.
  • Copy value from the most similar row.
  • Interpolate between rows. (Inserts the mean of the value in the previous row and the value in the next row.)
  • Estimate using other variables. (Linear regression is used to model the variable in terms of the other variables. Missing values are then filled in based on that model. For example, in a data set with variables x and y where y has missing values, y will be modeled as a*x + b, and, in each row with a missing value, the missing value will be filled in by evaluating that expression using the values in that row.)
  • Set to the mean value. (Inserts the mean of all values of that variable.)
  • Set to the median value. (Inserts the median of all values of that variable.
  • Set to zero.
return to top ⇑

Normalizing Scale and Offset

While normalizing your variables (rescaling the numeric values) is completely optional, it can greatly improve the performance of Eureqa. You can normalize data prior by entering an expression of your own, but several common normalizing options are built into Eureqa. The next few sections describe when and how to normalize your variable and how normalization works in Eureqa.

When to Normalize

Eureqa works best when all variables in your data have small to medium magnitudes, on the order of 1 to 100. For example, if you have variables that range over a million, it would be best to rescale the values to larger units.

Additionally, magnitudes of the variable should be similar to the mean or offset of the variable. For example, if you have a variable that only varies between 100.0 and 100.5, it would be best to subtract of 100 so that it ranges between 0 and 0.5.

For example, consider the following two variables in some data set:

Notice that both variables look rather flat. You can't see any interesting variation because the variable b has such a large offset. Let's try subtracting off an offset of 15,000 from b:

Now we can see some interesting variation in the variable b, but the variable a still looks flat because the variable still has a small magnitude relative to b. Next, let's try dividing the values of b by 1,000:

Now we can see the interesting variation in both variables, as they now have the same relative scale and offset. This is ideally how we went data to look before being processed by Eureqa. When the variables are reasonably scaled, Eureqa is most likely to use their variation to build simple and accurate solutions.

How to Normalize a Variable

Eureqa comes with a number of built-in normalization options. To normalize a variable in Eureqa, select the variable in the "Variables" pane and the select the check box to "Normalize the scale and offset" of the variable. Multiple variables may be selected and normalized at once.

The general formula for a normalizing a variable y is:

        y_normalized = (y - offset)/scale

where offset and scale are the normalization parameters. When you select the option to normalize a variable in Eureqa, you will see options to select the values for "Subtract by", which corresponds to the offset normalization parameter and "Divide by" which corresponds to the scale normalization parameters. Eureqa provides the options to normalize by commonly used values, for example subtracting by the mean, median, or interquartile mean, and dividing by the standard deviation, interquartile range, or a power of 10 (e.g. 10^3, 10^6, 10^9).

Note that the normalized versions of the input variables are used for the model search only. All models, error metric values, etc. displayed in the "View Results" tab are in terms of the raw (unnormalized) variable values.

If you choose not to use Eureqa's built-in normalization options, you may instead consider these normalization options:

  1. Consider changing the units of the data you enter into Eureqa. Could you measure in meters instead of centimeters? Could you measure currency in millions-of-dollars instead of dollars? Pick units such that the numerical values have a range of approximately 1 to 100. One way to perform this conversion is using an expression to enter a new column in the "Enter Data" tab.
  2. Consider measuring values from an offset. Could you measure time since the time of your first data point, instead of since the beginning of the year or century?
  3. Check your data; look for outliers. Are there any values that are drastically out of proportion with the rest of the values? If so, consider removing this entire row in your data set or giving it a very low weight.

Automatic Normalization Checks

By default, Eureqa will check your data for extreme cases of variables that need to be normalized. When viewing data in the "Prepare Data" tab, you may see variables tagged with the red "unnormalized" indicator:

This is an indication that the variable may have a large scale or offset and that normalization is recommended before starting a search in order to get the best search performance.

return to top ⇑

Filtering Data

To ignore rows that don't meet certain requirements, enter the requirements in the box. Here are some examples based on a data set containing variables x and y:

  • x > 0 filters out rows in which x has a negative value.
  • (x > 0) & (y > 0) filters out rows in which either x or y has a negative value.
  • (x = 0) | (abs(x-y) > 42) filters out rows in which the value of x is 0 or the difference between x and y is greater than 42.
return to top ⇑

Need More?

If your data could benefit from more sophisticated preprocessing, you may want to process it in another application then transfer the resulting data into Eureqa.

return to top ⇑