Alternative Data For Trading — Statistical Arbitrage

Abhinav Unnam
4 min readFeb 17, 2021

There is no single market secret to discover, no single correct way to trade the markets. Those seeking the one true answer to the markets haven’t even gotten as far as asking the right question, let alone getting the right answer.

Jack Schwager — Author of Market Wizar

Lot’s ways to trade but some simple ways to gain the edges. In recent times, the way is either through access to information fast and act on it as fast as possible (think HFTs) or being able to process and extract better.

The second part is more the focus for this article where either you are looking at processing the existing data better or gain access to alternative sources aka non-price action related data for trading.

As it has gotten harder over the years, to be able to squeeze more juice out of the same price action data and the growth of digitisation. There are a bunch of really interesting alternative data sets available. This is an awesome collection of several vendors selling alternative data sources here.

So, you have received a signal and need to check it’s efficacy against a well-know instrument. How do go about doing it? This is an experimental case study and does not use any data I worked on commercially.

This is what our sample data set looks like

We just need to verify and determine if the signal is anyway useful. There are a couple of steps and key things to remember before embarking on checking the signal quality. A quick data verification and we should be able to move ahead.

Data Quality Checks

The data looks fine except for a couple of obvious errors such as the -150 closing price. We just need an automated way to fix such obvious and any other errors.

To deal with the data quality checks, I computed a couple of extra metrics.

  • The rolling mean of last 5–20 data points.
  • The standard deviation of the last 5–20 data points.
  • We create a couple of lower and upper barriers using 6 sigmas as the threshold.
  • Flag any points lying beyond the six sigma from the average.

Interestingly, we seemed to have been able to catch these errors right. We will go ahead and fix the errors by replacing the erroneous data points with the average values.

We go ahead and do a similar process for Signal as well, as extreme outliers are possible for this data vector as well. The data is a lot more compressed and we are good to go to the next steps.

Efficacy Check

Now, that we want to understand if the signal has any predictive power. We need to be able to define a metric and measure the same. We use the percentage change in signal and if it has any predictive power over computing 1-day forward-looking price change or returns.

It’s necessary to do this over something like returns since these are stationary data points and it’s a requirement for lots of statistical assumptions and independence between data points.

We map the X(Signal Change) and Y (The forward-looking 1-day returns). This is necessary so, the signal change today is used to forecast the returns by end of tomorrow.

A quick peek into the distribution of X and Y is a good visual to have. The returns have distribution ranging from -6% to around 4%.

On the other hand, the signal is very polarising in terms of its change. It, therefore, can use a gate with +1 or -1 output.

Now, to measure the efficacy. I tried to compute the RMSE (Root Mean Square Error) between the signal change and the actual returns. This should be as low a number as possible.

This RMSE turns out to be ~0.055. Which roughly translates to 5.5% error and is quite significant in nature. Such a large error will wipe out any trading possibility and is validated by the cumulative performance graph.

As can be visualised, we don’t have a profitable strategy and this is even before, we have accounted for any kind of trading costs.

Alternative data, in this case, was used as a direct input for trading the instrument. Besides a simple model with one day change as input, we could have explored building a more involved model using sequential inputs of this signal besides just the one day change.

This signal could have further been combined with the price change of the previous data for predictive purposes. Some of these concepts and the use of sequential neural networks are explored here.


This is one of the ways to go about measuring the efficacy and looks into the use of alternative data for trading instead of trying to use the same data points and processing them better aka use more complex models/data transformations.

While LSTMs and RNNs have been very useful across language/text problems, their usage has been a challenge in the market space. They are more useful as a means to extract signals from alternative data sources than trying to use the raw data to build complex models.

Originally published at