Eureqa Desktop User Guide

Setting a Search Target in Eureqa Desktop

In this view, you can choose the type of formula to search for and how to search for it.

Contents

Target Expression

The "Target Expression" in the field at the top of the "Set Target" tab tells Eureqa what type of model to search for. By default, the target expression is an equation where y (or, if there's no y, whatever variable is in the final column of your data) is modeled as a function of all other variables.

Edit the formula to specify the type of relationship you want to model. To edit the target expression, click on it, then make the desired alterations. For example, if you want to model the variable z as a function of x and y, enter z=f(x,y). If you want to ignore y, you could enter just z=f(x). Use the special function f(...) to specify the part of the equation that Eureqa will attempt to fill in. Eureqa will search for the formula f(...) using the variables you put inside the parentheses.

More complex expressions are also possible and give you a lot of power to search for complex relationships including the modeling of differential equations, polynomial equations, and binary classification. The following are some examples of advanced target expressions. The examples assume that the data entered into the Eureqa spreadsheet has four variables named w, x, y, and z.

Basic Examples

Model the variable y as a function of variable x:

        y = f(x)

Model the variable z as a function of two variables x and y:

        z = f(x, y)

Model the variable z as a function of y and an expression, sin(x):

        z = f(y, sin(x))

Model arbitrary expressions with target expressions like the following:

        sin(z + 3) = 100*w + x*f(y, y/(x+1))

Multiple Functions

To incorporate multiple functions into your target expression, use numbered functions starting with f0(). For example:

        y = f0(x) + f1(w, z)

Fitting Coefficients

You can represent an unknown constant or coefficient as a function with no arguments, f(). You can use multiple no-argument functions f0(), f1(), etc to fit the coefficients of arbitrary nonlinear equations. For example, if you're looking for a polynomial of the form:

        y = a*x + b*x^2 + c*x^3

use the following target expression:

        y = f0()*x + f1()*x^2 + f2()*x^3

Binary Classification

If y is a binary variable filled with 0s and 1s, model it using a squashing function, such as the logistic function:

        y = logistic(f0() + f1()*f2(x))

If the output variable has three or more class values, represent the data as a binary variable for each class value. For example, if y can have values a, b, or c, you might transfer the data into binary variables y_a, y_b, and y_c:

x y y_a y_b y_c
1 a 1 0 0
2 c 0 0 1
3 b 0 1 0
4 c 0 0 1

For more information see Labels and Categorical Variables on the Entering Data page.

You can then model each of these binary variables individually in three different searches using the following target expressions:

        y_a = logistic(f(x))

        y_b = logistic(f(x))

        y_c = logistic(f(x))

You then have three equations, one for each output.

Differential Equations

Model the derivative of y with respect to the spreadsheet row number as a function of x and y"

        D(y) = f(x, y)

Model the derivative of y with respect to x as a function of x and y:

        D(y, x) = f(x, y)

Model the second derivative of y with respect to x, and the first derivative of y

        D(y, x, 2) = f(x, y, D(y, x))

Custom Error Metrics

You can design your target expression so that it functions as a custom error metric. For example, to minimize the 4th-power error:

        (y - f(x))^4 = 0

More details can be found in the section on Custom Error Metrics.

Nested Functions

Model the output of a recursive or iterated function to a depth of 3:

        y = f(f(f(x)))

Tips and Tricks

Entering all columns as input variables

Entering an ellipsis, "..." will automatically fill in all of your variable names (except for the target variable) separated by commas. For example:

        y = f(...)

after pressing enter, will become:

        y = f(w, x, z)

Forcing a variable or term to appear in solutions

You can use a special set of building blocks to force a particular variable or term to appear in your solutions. To do this, add a term that nests the require() and contains() building blocks as in this example which requires solutions to contain an x^2 term:

        y = f(x) + 0*require(contains(f(x),x^2))

For this to work, the first term of the contains operator must exactly match the functional term you are trying to fit (f(x) in this case). By multiplying the second term by 0 you guarantee that it won't impact the value produced by a particular solution f(x). Note that these operators make it harder to find solutions and may significantly slow down your search.

For more information on the require and contains building blocks, see the Advanced Building Blocks section on the Model Building Blocks reference page.

return to top ⇑

Building Blocks

Addition, subtraction, modulus, floor, factorial, Gaussian, If-Then-Else, these are a few of the 46 (and counting) building blocks that Eureqa will be happy to combine in a few trillion ways as it seeks out good solutions. Check the boxes next to building blocks you want included in the mix. (See the Building Blocks List.)

Which building blocks should you choose? Expert knowledge will help you here. Which of the building blocks tend to show up in your field? Which ones are found in solutions to problems related to yours? Which ones are suggested by graphs of your data? Which ones just seem like good candidates based on your intuition (expert or otherwise)?

The trade-off to keep in mind is this: Limiting the number of building blocks will speed up your search and may increase the likelihood that Eureqa will find an exact solution; on the other hand, disabling too many building blocks could preclude the discovery of an exact solution if a necessary operation is disabled.

Building Block Complexity

Complexity values are configurable and can be changed by clicking on the complexity value you want to change and entering a new value. The defaults typically work well but if you have prior knowledge of the system you are trying to model you may want to modify them. If there are particular building blocks that you know or expect to be a part of a solution that accurately captures the core dynamics of a system you might lower their complexity values to make it more likely that they will appear in Eureqa's solution. Similarly if there are building blocks that you don't want to appear unless they significantly improve the fit of a solution you might raise the complexity values of those building blocks.

return to top ⇑

Error Metrics

This drop-down box allows you to choose how potential solutions are assessed. The default setting, where absolute error is minimized, works well in most cases, but you can also choose to minimize squared error, worst-case error, logarithm error, median error, interquartile absolute error, or signed difference. Additionally, you can choose to maximize the correlation coefficient or the R-squared goodness of fit, or you can try our experimental hybrid that considers both absolute error and correlation. See the Error Metrics reference page for additional details on the full list of error metrics supported by Eureqa.

Custom Error Metrics

In addition to using the built-in error metrics, you can use the target expression to specify custom error metrics for the search to optimize; or more specifically, arbitrary custom loss functions for the fitness calculation.

Custom Fitness Using Minimize Difference

Eureqa has a built-in fitness metric named "signed difference". This fitness minimizes the signed difference between the left- and right-hand sides of the search relationship. For example, specifying:

        y = f(x)

with the minimize difference fitness metric selected tells Eureqa to find an f(x) to minimize y - f(x). A trivial solution to this relationship would be f(x) = negative infinity, however you can enter other expressions that are more useful. Consider the following target expression:

        (y - f(x))^4 = 0

Here, the minimize difference fitness metric would minimize the 4th-power error. In Eureqa this setting looks like:

In fact, you can enter any such expression and the f(x) can appear multiple times. For example:

        max(abs(y - f(x)), (y - f(x))^2) = 0

would minimize the maximum of the absolute error and squared error, at each point in the data set.

Other Methods

There are many other possible ways to alter the fitness metric using the target expression.

Example: Using a sigmoid function to effectively cap large error

One example is that you could use a normal fitness metric (e.g. mean absolute error or mean squared error) but scale both sides of the relation. For example, you could wrap each side of the search relation with a sigmoid function like tanh:

        tanh(y) = tanh(f(x))

Now, both the left and right sides get squashed down to a tanh function (an s-shaped curve that ranges from -1 to 1) before being compared. This effectively caps large errors, reducing their impact on the fitness.

Example: Leveraging NaN and infinite error to constrain solutions

Another example would be to use the target expression to forbid certain values by exploiting NaN values (NaN = Not a Number). For example, consider the following search relation, which forbids models with negative values:

        y = f(x) + 0*log(f(x))

Notice the unusual 0*log(f(x)) term. Whenever f(x) is positive, the log is real-value and the multiplication with zero reduces the expression y = f(x). However, whenever f(x) is negative, log(f(x)) is undefined, and produces a NaN value. Whenever NaN appears in the fitness calculation, Eureqa automatically assigns the solution infinite error. Therefore, this search relationship tells Eureqa to find an f(x) that models y, but f(x) must be positive on each point in the data set.

This behavior can be used in other ways as well. Any operation that would produce an IEEE floating point NaN, undefined, or infinity will trigger Eureqa to assign infinite error. You can also add multiple terms like this to place multiple constraints on solutions.

return to top ⇑

Row Weight

You can designate one of your variables as an indicator of how much relative weight (i.e., importance) you want Eureqa to give to the data in each row. For example, if the designated row-weight variable has a value of 10 in the first row and 20 in the second row, data in the second row will be given twice the weight of the data in the first row. Row weight can be specified either by using a row-weight variable or by using a row-weight expression.

When to Use a Row Weights

The following are some common scenarios for which using a row weight may help to improve performance:

  • Suppose that for each data point you have a confidence value - perhaps you determined it while collecting the data or computed it in some other program. Create a variable (i.e. a column) containing those values, designate it as the row weight variable. Eureqa will then weight the data accordingly giving more weight to those values with higher confidence.
  • Suppose you want to give extra weight to a few important data points - the zero points of some variable, for example. You could give those points more weight by first filling a row weight column with ones (by entering =1, then manually changing the values in the important rows to 10, or 100, or 1000, or whatever value represents the relative weight you want to bestow on that row.
  • Suppose you want to balance your data by giving more weight to rare events than to common ones. More specifically, suppose you want to model credit card fraud, and 99.99% of the data points are legitimate transactions while 0.01% are fraudulent. You could create a variable whose value is 1 in rows representing legitimate transactions and 9999 (i.e. 99.99% / 0.01%) in rows representing fraud, thereby creating equal pressure to model both legitimate and fraudulant cases.

Using Row Weight Variable

Once a row weight variable has been established, the rest of the data in each row will be weighted in proportion to the value of the row weight variable in that row. To establish a row weight variable, enter the name and values of the variable in a column of the spreadsheet in the "Enter Data" tab, then select that variable name in the "Row Weight" drop-down box on the "Set Target" tab.

Using a Row Weight Expression

Some row weighting schemes can be more easily achieved with a row weight expression than with a row weight variable. It works like this: the row weight expression is evaluated using the values in that row, and the row is weighted with the result. Several row weight expressions are automatically built in to the row weight drop-down menu, or you can use your own.

1/occurrences(variable)

You'll find expression of this form, one for each of your variables, in the "Row Weight" drop-down box. These expression provide a quick way to balance data. To illustrate, let's imagine a toy data set containing just three values of one variable:

var x
1 99
2 99
3 86

The value returned by occurrences(x) is the number of times a given row of x occurs in the data set, so in this case it would return 2 in the first row, 2 in the second row, and 1 in the third row. Selecting 1/occurrences(x) as your row weight would therefore give the first row a weight of 1/2, the second row a weight of 1/2, and the third row a weight of 1.

Returning to the credit card fraud example, we could simplify things by creating variable z with a value of 0 in rows representing legitimate transactions and 1 in rows representing fraudulent ones. Selecting a row weight of 1/occurrences(x) would then automatically create equal pressure to model legitimate and fraudulent transactions. If new data is added, weights are automatically adjusted to maintain the balance.

<row>

Another expression built into the "Row Weight" drop-down menu is the special variable <row>, which takes on the value of the row number. Using this as the row weight will therefore give the first row a value of 1, the second row a value of 2, and so on.

Other Expressions

If you want to use a row weight expression that isn't automatically built into the "Row Weight" drop-down, the best option is to add add a new column for row weight on the Enter Data tab and use an expression to automatically populate the column with the desired row weights.

Example expressions, assuming a data set containing variables x and y may include:

  • =abs(x) gives row weights in proportion to the absolute value of x.
  • =1/abs(x-y) gives row weights in inverse proportion to the difference between x and y.
  • =1/<row> gives row 1 a weight of 1, row 2 a weight of 1/2, row 3 a weight of 1/3, ...
  • =0.5 + 0.5*(<row> <= 100) gives row 1 through 100 a weight of 1 and remaining rows a weight of 0.5. (Note: <= returns 1 if satisfied, 0 if unsatisfied).
return to top ⇑

Data Splitting

Eureqa automatically splits your data into a training data set and a validation data set. The training set is used to generate and optimize solutions, and the validation set is used to test how well those models generalize to new data. Eureqa also uses the validation data to filter out the best models to display in the Eureqa interface.

The drop-down box called "Data Splitting" gives you three options for data splitting:

  • Treat all data points equally (default)
  • Split data points for extrapolating future values
  • Use custom data settings

All model metrics that are reported, for example, those in the "Solution Details" pane in the "View Results" tab are computed against the validation data.

Default Splitting

By default, Eureqa will randomly shuffle your data and then split it into training and validation data sets based on the total size of your data. Training data will be taken from the start of data set and validation data will be taken from the end (after shuffling).

Note that with the default shuffling, under some circumstances such as very small input data sets, there may be overlap between the training and validation data sets.

Using Validation Data to Test Extrapolating Future Values

If you are using time series data and are trying to predict future time series values, you may want to create a validation data split that emphasizes the ability of the models to predict future values that were not used for optimizing the model directly. To do this, you can choose the option to "split data points to extrapolate future values".

Assuming that your data has been entered in increasing time order, this will split the data such that the training data is comprised of data points that were earlier in time and the validation data points are those that occurred later. After starting the search, you will see your data split as seen here:

Now the list of best solutions will be filtered by their ability to predict only future values - the last rows in the data set which were not used to optimize the models directly.

Custom Data Splitting

You can modify how Eureqa chooses training and validation data sets by selecting the drop-down option to "use custom data settings" or selecting the "Set Custom..." option to the right of the "Data Splitting" drop-down menu. Either will open the "Data Settings" pane.

Here you can change the portion of the data that is used for the training data and the portion that is used for the validation data. The two sets are allowed to overlap but can also be mutually exclusive if the percentages specified for training and validation data sum to 100%.

For very small data sets (under a few hundred points) it is usually best to use almost all of the data for both training and validation. When the data is extremely small or has very little or no noise, you may want to both sets to include 100% of the data. Model selection can be done using the model complexity alone in these cases.

For very large data sets (over 1,000 points) it is usually best to use a smaller fraction of data for training. It is recommended to choose a fraction such that the size of the training data is approximately 10,000 rows or less. Then use all remaining data for validation.

Finally, the option to "shuffle rows before splitting" will let you choose whether to randomly shuffle the data before splitting it. One reason to disable the shuffling is if you want to choose specific rows at the start of the data set to use for training and specific rows at the end of the data set to use for validation. If you have time series data that has been ordered by time, not selecting this option will have the effect of keeping the most recent data aside for validation.

return to top ⇑

Base and Prior Solutions

You can start Eureqa off on the right track by entering equations that give partial solutions or that express relationships you believe will play some role in an eventual solution. To enter a prior solution, find the "Base or prior solutions" box at the bottom of the "Set Target" tab and type or copy and paste the expression into the prior solutions text box. Enter each prior solution on a separate line.

For example, to enter two prior solutions y = (x - 1) and y = sin(2*x), enter the following text:

        (x - 1)
        sin(2*x)

When starting or resuming a search, seeded solutions are shown in the project log in the Start Search View. The confirmation message will look like:

6:24:53PM| Search started: y = f(x, w)
6:24:53PM| Seeding solution: y = x - 1
6:24:53PM| Seeding solution: y = sin(2*x)

Complex Search Relations

Complex search relations that specify more than one formula for each solution must be entered differently. Each line must contain an expression for each fi() that appears in the Target Expression.

For example, if the search relation contains two formula such as y = f0(x)*f1(x), and the prior the model is y = (x-1)*sin(2*x), one would enter:

        f0 = (x-1), f1 = sin(2*x)

Each fi() is listed with its sub-expression, each is separated by a comma, on the same line. To specify multiple prior models, enter each set of f's on a new line.

return to top ⇑