If past history was all there was to the game, the richest people would be librarians. –Warren Buffett
- Time bars
- Tick bars
- Volume bars
- Dollar bars
- Tick Imbalance Bars
- Volume/Dollar Imbalance Bars
- Tick Runs Bars
- Volume/Dollar Runs Bars
- Sampling for Reduction
- Event-Based Sampling
- The CUSUM Filter
- Event Study
- Paper Trading
- Survivorship Bias
- Adverse Selection
- Instantaneous Communication
- Transaction Costs
- Unrealistic Backtesting
- Backtesting with cross validation
- Bet sizing
- Backtesting on synthetic data
- Backtest statistics
- Extra: Michael Dubno
Simulation is the number one tool of the quant trader. Simulating a strategy on historic data shows whether it is likely to make money.
Financial data comes in different types, including fundamental data, market data, analytics, and alternative data.
- Fundamental data is accounting data reported quarterly and contains information that can be found in regulatory filings and business analytics. It is important to confirm the exact time each data point was released to use the information correctly.
- Market data includes all trading activities on an exchange or trading venue, providing an abundant dataset for strategy research.
- Analytics are derivative data based on an original source, processed in a specific way to extract signal for you.
- Alternative data is primary information that has not made it to other sources, which can be expensive and raise privacy concerns, but offers the opportunity to work with unique and hard-to-process datasets.
Periodic “bar” data is the easiest to get. Is sampled at regular intervals (minutely, hourly, daily, weekly) and looks like this:
Date,Open,High,Low,Close,Volume 11-May-09,33.78,34.72,33.68,34.35,142775884 12-May-09,34.42,34.48,33.52,33.93,147842117 13-May-09,33.63,33.65,32.96,33.02,175548207 14-May-09,33.10,33.70,33.08,33.39,140021617 15-May-09,33.36,33.82,33.23,33.37,121326308 18-May-09,33.59,34.28,33.39,34.24,114333401 19-May-09,34.14,34.74,33.95,34.40,129086394 20-May-09,34.54,35.04,34.18,34.28,131873676 21-May-09,34.02,34.26,33.31,33.65,139253125
“Quote” data shows the best bid and offer:
SYMBOL,DATE,TIME,BID,OFR,BIDSIZ,OFRSIZ QQQQ,20080509,9:36:26,47.94,47.95,931,964 QQQQ,20080509,9:36:26,47.94,47.95,931,949 QQQQ,20080509,9:36:26,47.94,47.95,485,616 QQQQ,20080509,9:36:26,47.94,47.95,485,566 QQQQ,20080509,9:36:26,47.94,47.95,485,576 QQQQ,20080509,9:36:26,47.94,47.95,931,944 QQQQ,20080509,9:36:26,47.94,47.95,849,944 QQQQ,20080509,9:36:26,47.94,47.95,837,944 QQQQ,20080509,9:36:26,47.94,47.95,837,956
“Tick” or “trade” data shows the most recent trades:
SYMBOL,DATE,TIME,PRICE,SIZE QQQQ,20080509,8:01:29,47.97,1000 QQQQ,20080509,8:01:56,47.97,500 QQQQ,20080509,8:01:56,47.97,237 QQQQ,20080509,8:02:20,47.98,160 QQQQ,20080509,8:02:50,47.98,100 QQQQ,20080509,8:02:50,47.98,200 QQQQ,20080509,8:02:50,47.98,1700 QQQQ,20080509,8:02:50,47.98,500 QQQQ,20080509,8:02:53,47.98,100
“Order book” data shows every order submitted and canceled. If you do not know what it looks like, then odds are you do not have the resources to be fast enough to trade on it, so do not worry about it.
Volume bars circumvent that problem by sampling every time a pre-defined amount of the security’s units (shares, futures contracts, etc.) have been exchanged. For example, we could sample prices every time a futures contract exchanges 1,000 units, regardless of the number of ticks involved.
Tick bars can introduce arbitrariness in the number of ticks and order fragmentation, while volume bars sample every time a pre-defined amount of the security's units have been exchanged, resulting in better statistical properties and providing a convenient artifact for analyzing the interaction between prices and volume in market microstructure theories.
Dollar bars are formed by sampling an observation every time a pre-defined market value is exchanged, which is more reasonable than sampling by tick or volume when the analysis involves significant price fluctuations. The number of outstanding shares often changes multiple times over the course of a security’s life, as a result of corporate actions, and dollar bars tend to be robust in the face of those actions. Thus, dollar bars are more interesting than time, tick, or volume bars.
Tick Imbalance Bars
The tick rule defines a sequence of tick imbalances in a given sequence of ticks, where the imbalance is defined as the accumulation of signed ticks exceeding a given threshold. Tick imbalance bars (TIBs) are then defined as T-contiguous subsets of ticks that meet a certain condition. TIBs are more likely to occur when there is informed trading and can be thought of as containing equal amounts of information.
Volume/Dollar Imbalance Bars
The concept of tick imbalance bars (TIBs) is extended to volume imbalance bars (VIBs) and dollar imbalance bars (DIBs) to sample bars when volume or dollar imbalances diverge from expectations. The procedure to determine the index of the next sample involves computing the expected value of the imbalance at the beginning of the bar and defining VIB or DIB as a T-contiguous subset of ticks such that the expected imbalance exceeds a given threshold. The bar size is adjusted dynamically, making it more robust to corporate actions.
Tick Runs Bars
TIBs, VIBs, and DIBs are methods to monitor order flow imbalance in terms of ticks, volumes, and dollar values exchanged. Tick runs bars (TRBs) measure the sequence of buys in the overall volume and take samples when that sequence diverges from our expectations. TRBs count the number of ticks of each side without offsetting them and are useful in forming bars. The expected value of the current run can be estimated using exponentially weighted moving averages.
Volume/Dollar Runs Bars
Volume runs bars (VRBs) and dollar runs bars (DRBs) are based on the same concept as tick runs bars (TRBs), but they monitor volume or dollar imbalances instead of tick imbalances. The procedure to determine the index T of the last observation in the bar is similar to TRBs, and it involves defining the volumes or dollars associated with a run, and defining a VRB or DRB as a T-contiguous subset of ticks such that the expected volume or dollar from runs exceeds our expectation for a bar.
Applying an ML algorithm on an unstructured financial dataset is generally not a good idea due to scalability issues and lack of relevant examples. Sampling bars after certain catalytic conditions can help to achieve a more accurate prediction.
Sampling for Reduction
Downsampling is a way to reduce the amount of data used to fit the ML algorithm by sequentially sampling at a constant step size (linspace sampling) or by random uniform sampling. Linspace sampling has the advantage of being simple, but it suffers from arbitrary step size and outcomes that depend on the seed bar. Uniform sampling is better as it draws samples uniformly from the entire set of bars, but it may still not include the most relevant observations in terms of predictive power or informational content.
Portfolio managers make investment decisions after significant events occur, such as changes in market conditions or economic indicators. These events are used to train machine learning algorithms to predict future outcomes. If the algorithm is not accurate, the definition of a significant event may need to be adjusted or alternative features considered.
The CUSUM Filter
In financial time series, the signal-to-noise ratio is usually low. Using the entire dataset can cause the model to focus too much on noisy samples and not enough on informative ones. Downsampling can improve the signal-to-noise ratio, but randomly doing so is not effective as it doesn't change the ratio of noisy to informative samples. A better solution is to apply a CUSUM filter, which only creates a sample when the next values deviate sufficiently from the previous value.
Consider a locally stationary process generating IID observations . The cumulative sums can then be defined as
with boundary condition . A sample is only created when , for some threshold . This can be further extended to a symmetric CUSUM filter to include run-ups and run-downs such that
Here are the most common simulation tools. These are too well known to use space here, so I just give pointers to each and recommend you to google them.
Backtesting is simulating a strategy on historic data and looking at the PnL curve at the end. Basically you run the strategy like normal, but the data comes from a historical file and time goes as fast as the computer can process.
In an event study, you find all the points in time that you have a signal and then average the proceeding and preceding return paths. This shows you on average what happens before and after. You can see how alpha accumulates over time and if there is information leakage before the event.
The correlation of a signal with future returns is a quick measure of how accuately it predicts. It is better than a backtest when you need just a single number to compare strategies, such as for plotting an information horizon. You can configure a lagged correlation test in Excel in under a minute. However it doesn’t take into account transaction costs, and it doesn’t output trading-relevant metrics like Sharpe ratio or drawdown.
The information horizon diagram was published in Grinold and Kahn’s Active Portfolio Management. It is a principled way of determining how long to hold positions.
Paper trading refers to trading a strategy through your broker’s API, but using a demo account with fake money. It’s good because you can see that the API works, and see if it crashes or if there are data errors. And it also gives you a better feel for the number of trades it will make and how it will feel emotionally. However it takes a long time since you have to wait for the markets. It is more practical for high frequency systems which can generate a statistically significant number of data points in a short amount of time.
When you are testing a strategy on historical data, you have the ability to give it access to more knowledge than it could possibly have had at that point, like predicting sports with an almanac from the future. Of course it would be dumb to do that, since it is unrealistic, but usually it is accidental. It is not always easy to notice. If you have a strategy that gives a Sharpe over 3 or so on the first test, you should check very carefully that you are not accidentally introducing omniscience. For example, if you use lagged correlation to test a 6-period moving average crossover trend-following strategy and use the moving average through time T to predict time T itself, then you will get a high, but not unreasonable correlation, like .32. It is a hard error to catch and will make your results during live trading quite disappointing.
If you test a strategy on all the companies in the S&P500 using their prices over the last 10 years, your backtest results will be biased. Long-only strategies will have higher returns and short-only will have worse returns, because the companies that went bankrupt have disappeared from the data set. Even companies whose market capitalizations decreased because their prices fell will be missing. You need a data source that includes companies’ records even after they have gone bankrupt or delisted. Or you need to think very carefully about the strategy you are testing to make sure it is not susceptible to this bias. For an independent trader, the latter is more practical.
When executing using limit orders, you will only get filled when someone thinks it will be profitable to trade at your price. That means every position is slightly more likely to move against you than you may have assumed in simulation.
It is impossible to trade on a price right when you see it, although this is a common implementation for a backtester. This is mainly a problem for high-frequency systems where communication latency is not negligible relative to holding periods.
It is also easy to inaccurately estimate transaction costs. Transaction costs change over time, for example increasing when volatility rises and market makers get scared. Market impact also depends on the liquidity of the stock, with microcaps being the least liquid.
There are other issues that sometimes cause problems up but are less universal. For example, your data might not be high enough resolution to show precisely how the strategy will perform. Another problem is if the exchange does not provide enough data to perfectly reconstruct the book (eg the CME).
Backtesting with cross validation
In machine learning, cross-validation is used to assess how well models generalize over unseen data by splitting the training dataset into parts: a train part to train the model, and a validation part to assess the model on unseen data. Time-series models are prone to overfitting, making cross-validation crucial in estimating the degree of overfitting.
However, standard k-fold cross-validation assumes that data is drawn from an independent and identically distributed stochastic process, which is violated in time-series modeling due to serial correlation. This leads to information leakage between the folds, where information from the training set leaks into the validation set, resulting in an overestimation of the model's performance on unseen data.
To address this, Prado (2018) developed a more robust cross-validation scheme called Combinatorial Purged Cross Validation, which is similar to k-fold but accounts for information leakage. This method purges and embargoes (removes) samples from the train set that are close to the validation set to prevent information leakage, resulting in better estimates of the model's performance on unseen data.
After developing a strategy and backtesting it, most people try to optimize it. This usually consists of tweaking a parameter, re-running the backtest, keeping the result if the results are better, and repeating multiple times.
If you think about backtesting a strategy, you can see how it could be represented as a function:
profit = backtest_strategy(parameter_1, parameter_2, ...)
Looking at it this way, it is easy to see how to make the computer test all the possible parameter values for you. Usually it is best to list all the values you want to test for a parameter and try them all (called brute force or grid search) rather than to try to save computation time using hill-climbing, simulated annealing, or a genetic algo optimizer.
With hill climbing, you start with a certain parameter value. Then you tweak it by adding a little or subtracting a little. If it’s an improvement, keep adding until it stops improving. If it’s worsened, then try subtracting a little and keep subtracting until it stops improving.
Simulated annealing adds one feature to hill climbing. After finding a point where no parameter changes will help, in simulated annealing you add some random amount to the parameter value and see if it improves from there. This can help “jump” out of a locally optimal parameter setting, which hill climbing would get stuck in.
Genetic algo optimizers add three features. First they run multiple simulated annealings at once with different initial parameter values. The second feature only applies to strategies with multiple parameters. The genetic algo periodically stops the set of simulated annealing algos, and creates new ones by combining subsets of different algo’s parameters. Then it throws away the ones with the least profitable parameter values and starts extra copies of the most profitable ones. Finally, it creates copies that take a little bit from each of two profitable ones. (note: the hill climbing feature is renamed “natural selection,” the simulated annealing feature is renamed “mutation,” the parameter sets are renamed “DNA,” and the sharing feature is called “crossover”).
Let’s see how each of these optimizers looks in a diagram. First, think of the arguments to:
profit = backtest_strategy(parameter_1, parameter_2, ...)
as decimal numbers written in a vector:
|X_1|X_2|X_3|X_4|X_5|X_6|... | ==> $Y
Let’s say each X variable is in the range -10 to 10. Let’s say there are only two X variables. Then the brute force approach would be (assuming you cut X’s range up into 3 points):
| X_1 | X_2 | ==> $Y | -10 | -10 | ==> $5 | -10 | 0 | ==> $6 best | -10 | 10 | ==> $5 | 0 | -10 | ==> $4 | 0 | 0 | ==> $5 | 0 | 10 | ==> $3 | 10 | -10 | ==> $3 | 10 | 0 | ==> $3 | 10 | 10 | ==> $3
The best point is at (-10,0) so you would take those parameter values. This had 32 combinations we had to backtest. But let’s say you have 10 parameters. Then there are 310 ≈ 60, 000 combinations and you will waste a lot of time waiting for it to finish.
Hill climbing will proceed like the following. With hill climbing you have to pick a starting point, which we will say is (0,0), right in the middle of the parameter space. Also we will now change the step size to 5 instead of 10 since we know this optimizer is more efficient.
| 0 | 0 | ==> $ 5 | 5 | 0 | ==> $ 4 worse, try other direction | -5 | 0 | ==> $ 5.5 better, try another step | -10 | 0 | ==> $ 6 better, cannot try another step since at -10 try other variable | -10 | 5 | ==> $ 5.5 worse, try other direction | -10 | -5 | ==> $ 5.5 worse, stop since no other steps to try
This only took 6 backtests to find the peak.
Simulated annealing is similar, but adds random jumps. We will choose to do up to 2 random jumps without profit improvement before stopping.
| 0 | 0 | ==> $ 5 | 5 | 0 | ==> $ 4 worse, try other direction | -5 | 0 | ==> $ 5.5 better, try another step | -10 | 0 | ==> $ 6 better, cannot try another step since at -10 try other variable | -10 | 5 | ==> $ 5.5 worse, try other direction -----^^- above here is same as before -^^----- | -10 | -5 | ==> $ 5.5 worse, try random jump | -9 | 0 | ==> $ 6.1 better, try another random jump | -9 | 2 | ==> $ 6 worse, try another random jump | -9 | 3 | ==> $ 6 worse, stop after some arbitrary
It came up with a slightly higher optimal point and took 9 runs of the backtester. Although this is as slow as the brute force approach on this toy example, when you have more parameters, the gains increase.
Now for the genetic optimizer. Use 3 initial members of the population at (-10,-10), (-10,10), and (10,0) to start them as far apart in the space as possible. Run for 3 generations.
1 | -10 | -10 | ==> $5 keep 1 | -10 | -10 | ==> $5 keep 2 | -10 | 0 | ==> $6 | \- best
2 | -10 | 10 | ==> $5 crossover X1 on 1 with X2 on 3 | -10 | 0 | ==> $6 keep but mutate 2 | -10 | 2 | ==> $5.7
3 | 10 | 0 | ==> $3 keep but mutate 2 | -10 | 8 | ==> $5.1 keep but mutate 3 | -10 | 9 | ==> $5
By pure luck, crossover placed us right at the optimal parameters in this example. This also took 9 backtests. It also required the overhead of crossover and mutation. Genetic algorithms are often significantly slower than simulated annealing, though in theory they are faster than brute force.
In general these fancy optimizers are overkill and can create hard to detect problems. In trading it’s better to have fewer parameters than many. It is best to test all the possible values of a parameter and plot the graph of profits vs parameter value. If the optimizer has found a parameter value that is at an isolated peak on the curve, you are probably overfitting.
Here is an example of a profitability curve for an imaginary strategy with one parameter, an exponential moving average’s length:
Here there aren’t any potential overfitting points, however there is a local maximum at EMA length of 1 which could mess up the hill climbing optimizers. Overfitting is often presented as a curse that one must carefully ward off with rituals and sacrificial stress testing simulations. However typically (at least when not using fancy machine learning algos like SVMs or neural nets) it appears in intuitive places. For example with a moving average convergance-divergance momentum strategy, overfitting will typically rear its ugly head when the two MA lengths are almost equal, or when one or both is very low (eg 1, 2, 3, or 4ish bars). This is because you can catch a lot of quick little market turns if the market happened to be trading a the perfect speed during a perfect little time period [in the past].
Building a trading strategy using a model that predicts the side of the trade is useful, but opening the position with a fixed size can be inefficient. A better way to size the position would be to use the model's confidence in its prediction. This concept is called bet sizing, and the following sections will explain it further.
When metalabeling is applied on top of a primary model that predicts the side of the trade, the metalabeling model outputs a predicted probability that can be used to determine the position size. The metalabeling model is a binary classifier that predicts whether to take or not take the trade.
The goal is to test the null hypothesis : , where is the probability whether the side is correct and , where -1 and 1 represents short and long respectively. The test statistic is defined as
when the metalabelling model predicts , we compute the test statistic, which we can put into the CDF of the standard normal distribution to get a bet size. However, this can lead to excessive trading with small amounts when the predicted probability is close to 0.5. To filter out these noisy trades, we can discretise the CDF in steps such that probabilities close 0.5, are floored to a size of 0.
Backtesting on synthetic data
Each strategy has an implementation tactic called "trading rules," which provide the algorithm to enter and exit a position. These rules rely on historical simulations, which can lead to the problem of backtest overfitting, making the investment strategy unfit for the future.
The goal is to exit a position optimally, which is a dilemma often faced by execution traders, and it should not be confused with the determination of entry and exit thresholds for investing in a security. To avoid the risk of overfitting, a good practice is deriving the optimal parameters for the trading rule directly from the stochastic process that generates the data, rather than engaging in historical simulations.
- construct a mesh of stop-loss and profit-taking pairs,
- generate paths for prices,
- apply the stop-loss and profit-taking logic,
and determine the optimal trading rule within the mesh. This approach avoids the problem of backtest overfitting and is applicable to various specifications, such as the discrete Ornstein-Uhlenbeck process on prices. Examples of financial data that can be generated include stock prices, stock returns, correlation matrices, retail banking data, and all kinds of market microstructure data.
Assessing the performance of your trading strategy is important, and backtesting the model on a past period is a common way to do this. Metrics such as the Sharpe ratio and maximum drawdown are often used to evaluate financial machine learning models.
However, a common mistake in backtesting is repeating the backtest multiple times on the same period until a desired result is achieved. Prado (2018) even shows that it is possible to achieve any Sharpe ratio as long as the backtest is repeated enough times.
To avoid artificially inflating the Sharpe ratio, it's important to consider the number of backtests performed and correct for this factor. Additionally, to prevent overfitting to a specific past period, the user could generate multiple paths using combinatorial purged cross-validation. More information on generating these paths can be found in Prado (2018).
Additionally, the statistics below provide information on the key features of the backtest:
- Time range: This refers to the start and end dates of the testing period, which should be long enough to cover various market conditions.
- Average AUM: This is the average value of assets under management, taking into account both long and short positions.
- Capacity: This measures the maximum AUM required to achieve the desired risk-adjusted performance, with a minimum AUM necessary for proper bet sizing and risk diversification.
- Leverage: This measures the borrowing required to achieve the reported performance, and costs must be assigned to it. Leverage is often calculated as the ratio of average dollar position size to average AUM.
- Maximum dollar position size: This indicates whether the strategy at times took dollar positions that greatly exceeded the average AUM, with preference given to strategies that rely less on extreme events or outliers.
- Ratio of longs: This shows the proportion of bets that involve long positions, with a ratio close to 0.5 preferred for long-short, market neutral strategies.
- Frequency of bets: This is the number of bets per year in the backtest, with a bet defined as a sequence of positions on the same side.
- Average holding period: This is the average number of days a bet is held, which can vary significantly depending on the strategy's frequency.
- Annualized turnover: This measures the ratio of the average dollar amount traded per year to the average annual AUM, and can be high even with a low number of bets.
- Correlation to underlying: This is the correlation between strategy returns and the returns of the underlying investment universe, with a significantly positive or negative correlation indicating that the strategy is essentially holding or short-selling the investment universe without adding much value.
Extra: Michael Dubno
Mike was CTO of Goldman Sachs. I sat down with him for a few hours in the summer of 2009 while he was advising the company I worked for. I asked him for his thoughts on backtesting. Here’s what I took away from the conversation:
- Start by making a backtester that is slow but works, then optimize and verify that the results exactly match with the slow version.
- Control 100 percent of the program’s state - and all inputs including the clock and random seed - in order to get reproducible results.
- Backtesting, paper trading, and live trading are three worlds to place the trading system in. The system should not be able to tell which world it is in. Backtesting on recorded data should produce the same results as the live run.
- There are always 2 parameters - the versioned configuration file and the world.
- A system should have 4 external sources - Database, Market, Portfolio Accountant, Execution Trader - and they should be thought of as external servers with their own config files - this cuts the system’s config file down.
- System should have a state matrix that is updated with each new point, adding the most recent and forgetting the last. Even in backtesting, data is only fed to the system one at a time. No function can query external stats. [This makes interpreted languages like Matlab and Python much less suitable, since loops are generally very slow. The temptation to vectorize in Matlab leads to very poorly designed backtesters]
- Maximum and average drawdown are better objectives for a startup fund than sharpe ratio because it can’t ride out 3 years to let a high sharpe play out.
More general tips:
- On combining signals. Signals should be probability surfaces in price and time.
- On software development. Use a standard comment to mark code you suspect is broken e.g. “FIX:”
- On auto-trading architecture. The “system” keeps generating portfolios and then the “trader” tries to close the delta with the current positions. This way it is no problem if you miss trades, they will go through in the long run.