Quant Finance (Machine Learning Trading) — Statistical Arbitrage

Abhinav Unnam
5 min readApr 3, 2021

Warren Buffet

Quant Finance is one of the toughest areas to apply Machine Learning. In recent times, it has undergone tremendous iterations and changes in terms of how things are being practised. Following there has been a constant attempt to democratize the space with several quantitative platforms in the market.

Firstly I have been involved in this space for some time through my association with WorldQuant as a research consultant. The program with an idea to democratise alpha generation was an eye-opening experience. Also as I’m getting back into the fold of Quant Finance with a full-time role, I wanted to quickly recap all my understanding and basics.

In this attempt to get a good overall overview, I collected and collated feedback and inputs from multiple sources. They included namely Reddit, personal bookmarks, books and plain old google search.

Similarly, this is a basic compendium of major resources, I found. Will keep updating this as I go ahead.


This provides a nice little refresher into the possibilities of using ML in Quant Finance. Additionally uses Python (Pandas) for processing. Gives a neat introduction.


There were a couple of books on Quant Finance recommended from multiple sources especially from several handles on Reddit. The first one is a nice hands-on book, allowing you to build on the course of Udacity.

These are some of the blogs which have interesting commentary, a couple of them are by the same authors as the books.

Hard technical topics are best leveraged using notes. Besides the already available resources, it helps to retain and recollect things fast.

Competitions/ Platforms

  • Numerai: Had participated in this portal way back when they had just got started. Had even received some bitcoin as a part of the reward. Ever since they have gone on to create their cryptocurrency and made several changes to the overall platform. Think ML for finance in Kaggle style.
  • Quantopian: Another popular platform but more traditional in nature with a much simpler scheme. Has plenty of data sources, a nice research environment and a thriving community on forums.
  • Websim: Another simulation platform, by WorldQuant. This one has a nice payout scheme if you do well with revenue sharing and fixed stipends.

The entire trade cycle can be divided into four parts. They are based on the different aspects which go into researching, building, deploying and then evaluate an alpha. Each node provides ample opportunities to employ data science and machine learning. I also collated Reddit thread replies and comments.

The edge here is in terms of size, latency and novelty of the type of data being used. Price and order book data are heavily used with Alternate Data emerging up.

  • Gaining access to quality data is the biggest challenge in terms of the entry barrier.
  • Cost is extremely prohibitive.
  • Openly available data is ubiquitous and has low signal power.
  • Non-stationary, Non-IID & Non-Normal price data points result in the violation of several Machine Learning algorithmic assumptions.
  • Instead of sampling data in terms of time, we sample data in volume terms called volume bars. This has a dual advantage :
  • The corresponding volume has better statistical properties (iid & gaussian)
  • This takes into account also the volume aspect of information. We manage to capture more information due to higher sampling during higher activity

Signal Generation & Processing

This is one area where the majority of ML applications is being explored. Everything around converting the data sets into useful signals comes here.

  • Overfitting data is the biggest challenge with ML models. Coupled with relentless backtesting can result in lots of spurious results.
  • Feature selection and not back-testing is where the edge is. Use simple models to understand and interpret the top features/ predictors.
  • Model ML problems as classification over regression. Simple models over complex approach: Occam’s Razor
  • Follow a research-driven approach(EDA, summaries) contrary to the back-testing heavy model.
  • Split Data into Train, Validation & Test.
  • Normalise the values.
  • Plot the error values after each epoch. This will help in understanding if we are overfitting, generalisability of models etc.
  • Order book “pictures” used and trained using transfer-learning to predict the next set of movements.
  • LSTMs: Any paper on time series will have some relevant stuff for financial data sets. The pre-processing steps can be replicated for financial data sets as well.

Portfolio Allocation & Risk Management

Besides signal generation, portfolio allocation processes also involve several risk management principles. I have seen firms employ strict risk management constraints such as not more than 2% liquidity in one stock etc.

  • Kelly Criterion & Portfolio Allocation theory to distribute funds to signals. Make assumptions about a known mean, variance of returns which is a huge assumption.
  • Extreme returns(primarily negative) have a higher probability than the traditional normal distribution. This is important from a risk management point of view. This is referred as fat tails in returns.
  • This can result in greater drawdown during live markets compared to backtesting.
  • A large & diverse portfolio can bring the excess kurtosis close to zero.
  • This assumption of independence can be dangerous. The fat tails aspect coupled with highly correlated asset movements resulted in a failure of risk management during the 2008 crisis.
  • The covariance between assets is constantly changing. They highly correlate with negative stock movements.

Post-simulation of the model under conditions of risk etc, they still need to be evaluated. Several metrics and evaluation techniques have been developed but the most prominent one is the historical Sharpe Ratio during pre-defined data.

  • Sharpe Ratio: The best single criteria to evaluate stocks under the assumption that returns have a normal distribution.
  • Returns typically are known to have a high kurtosis and long negative tail. So the probability of high drawdowns is greater in real scenario compared to back-tests.
  • Bonferroni Test: The p-value for significance usually adjusted as we carry out more backtesting operations. This allows to ensure new max Sharpe >> old max Sharpe for it to be an actual signal.
  • Transaction costs are absolutely critical and important to check for. Often not accounted for in back-tests and modelling.
  • Execution Strategy :
  • You can’t execute at midpoint prices, so need to include price.
  • Trade execution needs to include the volume aspect as well.
  • You need at the minimum last trade price with volume. The best is to have order book depth.

Optimising Stock Portfolio Machine Learning Notes

Originally published at https://statarb.in.