A man always has two reasons for doing anything: a good reason and the real reason. βJ.P. Morgan
- Example of alphas
- Earning announcement drift
- Satellite image analysis
- Social Arbitrage
- First Day of the Month
- Biotech announcement front-running
- Weighted Midpoint
- Signal identification method
- Toolbox
- Exploration
- Regression
- Machine Learning
- Normalization
- EMA Smoothing
- Trim Outliers
- Debug and Check Data
- Labeling
- The fixed-time horizon method
- The triple barrier method
- Trend Scanning Method
- Meta-labeling
- Dropping unnecessary labels
- Sample weights
- Return Attribution
- Time Decay
- Average uniqueness of labels
- Sequentially Bootstrapped Bagging Classifier
- Fractionally differentiated features
- Fixed-width Window Fracdiff
- Stationarity With Maximum Memory Representation
- Structural breaks
- CUSUM Tests
- Explosiveness Tests
- Entropy features
- Encoding schemes
- A few financial applications of entropy
- Microstructural features
- Modeling
- Fast Online Algorithms
- Ensemble methods
- Estimation of proper dimensions
- Cross validation
- Feature importance
- Hyper parameters tuning
Alpha is the ability to predict the future. Alpha is defined as the additional return over a naive forecast. Finding alpha is the job of a quant research analyst.
Alpha comes from four sources:
- Information
- Processing
- Modeling
- Speed
Speed is a source of alpha since acting in the future relative to other traders is equivalent to predicting the future (think of the uncertainty principle).
A toy example of predicting a personβs weight from height illustrates the first three. The naive predicton is that weight will be equal to the mean weight of the overall population. An improved prediction is where Ξ² is the regression coefficient. The informational alpha is the use of height. The preprocessing alpha is cubing height. The modeling alpha is usage of the linear model. Modeling and processing are only subtly different in this case.
There are six common trading strategies. These are frameworks for coming up with new models and jargon for communicating concisely. These are not exhaustive. The pairs of stacked horizontal bars are the bid and offer. Remember a market order is buying at an offer or selling at a bid and a limit order is selling at an offer or buying at a bid. Imagine a vertical y-axis with price.
Example of alphas
Finding alpha is creative work so I can only provide examples rather than a formula to find new alpha. Most examples of alpha I know are owned by companies or individuals I have worked for and cannot be included here. Alpha is often nothing more than taking commonly available data and mathematically encoding it in a signal correctly. Correct often means something as simple as using an rate of change instead of a difference, normalizing a value, smoothing a chaotic signal with an EMA, using a heteroscedastic weighted linear regression instead of a simple regression, or handling all numerical errors or edge cases.
As history goes on, hedge funds and other large players are absorbing the alpha from left to right. Having squeezed the pure arbs (ADR vs underlying, ETF vs components, mergers, currency triangles, etc) they then became hungry again and moved to stat arb (momentum, correlated pairs, regression analysis, news sentiment, etc). But now even the big stat arb strategies are running dry so people go further, and sometimes end up chasing mirages (nonlinear regression, causality inference in large data sets, etc).
The following are a couple of creative ideas. Their diversity also hints at the diverse ways of coming up with and researching new alpha ideas.
Earning announcement drift
In Unknown Market Wizards, we learn about the unique trading strategy of Pavel Krejci, a Czech independent trader who has generated impressive annual returns of 35% over the past 13 years.
Pavel's trading strategy is characterized by a focus on earnings announcements, particularly for US stocks. He trades during an intense four-month period around these announcements, followed by three months of lighter trading and five months dedicated to research. Pavel keeps track of a small universe of 200-300 liquid stocks, with the anticipation that he may manage larger positions in the future.
His approach involves taking long positions only, with 80% of his trades and 90% of his profits coming from stocks that are trending upward and performing better than expected.
In order to open a trade, Pavel relies on two main signals: when the market is flat or declining but the stock is trending up with good earnings, or when the latest results have not been good but the trend is up. He typically places his orders within the first few minutes after the trading session opens, but never directly at the opening. At the end of the day, Pavel usually unwinds his positions and does not hold any overnight.
Statistically, about 65% of his trades are profitable, with an average gain of approximately 1.5 times the average loss. The maximum loss per trade is about 4.5%.
Satellite image analysis
In 2011, RS Metrics pioneered satellite image analysis of parking lots for hedge funds, with companies like Orbital Insight following. The data is expensive and requires skilled analysts to yield results.
Hedge funds use satellite images to predict retail performance, with parking lot volume being a reliable indicator. Parking lot volume can identify errors in analysts' forecasts after quarterly earnings but before public announcements.
Market reaction to earnings announcements showed no difference between retailers covered by satellite image companies and those that are not. As satellite imagery data becomes more accessible, the advantage gained from parking lot intelligence will likely diminish.
Social Arbitrage
Chris Camillo has successfully turned an initial capital of $83,000 into a $21 million portfolio. Chris Camillo's trading strategy mentinned in Unknown Market Wizards is based on observing people's lifestyles and analyzing consumers' reactions to different products and services. This approach, which he calls "social arbitrage", relies on social network data to detect underlying trends by focusing on the adoption of products or services by the community.
Chris Camillo starts by formulating a hypothesis, such as the popularity of a video game or a movie. Then, he collects data to validate or invalidate this hypothesis.
He collected a list of million hashtags, keywords, etc. that he linked to 2,000 companies in order to automatically see everything that is trending, everything that stands out, all the conversations that really start at the very beginning of a trend.
Chris Camillo invests in a company when his non-financial research tells him that it is on the verge of a major success that has not yet been detected by the market. He sells his shares when the market begins to incorporate this data or when the information he has based on becomes public, such as through news articles or bank financial analysis.
First Day of the Month
The First Day of the Month. Its probably the most important trading day of the month, as inflows come in from 401(k) plans, IRAs, etc. and mutual fund have to go out there and put this new money into stocks. Over the past 16 years, buying the close on SPY (the S&P 500 ETF) on the last day of the month and selling one day later would result in a successful trade 63% of the time with an average return of 0.37% (as opposed to 0.03% and a 50%-50% success rate if you buy any random day during this period). Various conditions take place that improve this result significantly.
For instance, James Altucher tells: one time I was visiting Victorβs office on the first day of a month and one of his traders showed me a system and said, βIf you show this to anyone we will have to kill youβ. Basically, the system was: If the last half of the last day of the month was negative and the first half of the first day of the month was negative, buy at 11a.m. and hold for the rest of the day. βThis is an ATM machine the trader told meβ.
I leave it to the reader to test this system.
Biotech announcement front-running
Michael Keane is a successful trader who made 29% a year ... for 10 years.
He would buy biotech stocks several months before clinical trial results were announced, then sell them before the official announcement of the results. This strategy allowed him to make significant gains, but it became less effective due to ethical issues related to drug costs and the decline in popularity of biotechs
Michael also bet on drug approval announcements, especially when the odds are not in the company's favor (according to his statistics, no company under $300 million has received Phase 3 approval for a cancer drug). He then short sells these overvalued stocks, taking advantage of market anomalies caused by individual investors reacting to emotions.
It is important to note that risk management is crucial to the success of these trading strategies. To limit his losses, he uses stops and risks no more than 0.3% of his total capital per position.
Weighted Midpoint
Citadel vs Teza was a legal case in 2010 where Citadel, one of the largest hedge funds, sued Teza, a startup trading firm founded by an ex-Citadel executive, for stealing trading strategies.
The case filings mention the βweighted midpointβ indicator used in high frequency trading. In market making, one would like to know the true price of an instrument. The problem is that there is no true price. There is a price where you can buy a certain size, and a different one where you can sell. How one combines these (possibly also using the prices at which the last trades occurred, or the prices of other instruments) is a source of alpha.
One sophisticated approach is the βmicro-priceβ, which weights the bid price by the proportion of size on the ask, and vice versa. This makes sense because if the ask is bigger, it is likely that the bid will be knocked out first because fewer trades have to occur there to take all the liquidity and move it to the next price level. So a bigger ask than bid implies the true price is closer to the bid.
π For more alpha ideas, go to https://www.edarchimbaud.com/become-a-quant-trader.html.
Signal identification method
In modeling the market, itβs best to start with as much structure as possible before moving on to more flexible statistical strategies. If you have to use statistical machine learning, encode as much trading domain knowledge as possible with specific distance/neighbourhood metrics, linearity, variable importance weightings, hierarchy, low-dimensional factors, etc.
Main principles:
- Focus on two main types of variables:
- the variables to be explained (Y), which represent market characteristics useful for trading,
- and the potentially explanatory variables (X).
- Consistently use a multi-scale approach when examining these variables. The more the multi-scale analysis will identify independent scales, the easier it will be to recombine the signals.
- Label the variables to be explained. This labeling is a pre-computed a multi-scale target signal, which is the one we will then try to predict. A very important point is that we are allowed to use future information for this.
- Decompose the explanatory variables on multiple scales and make them independent.
- Use machine learning methods (either regression methods or joint density estimation methods) to predict the characteristics of Y from observations X.
- Solve the problem of estimating the dimension of the optimal representation space. For example, for a regression method, we need to determine K such that .
Here is a way checklist to document on an alpha strategy to make sure not to forget anything.
Abstract
Preliminary Evidence
Formal Strategy
Experimental Setup
Results
Risks
Further Work
Toolbox
These are a set of mathematical tools that I have found are very useful in many different strategies.
Exploration
Regression
Regression is an automatic way to find the relationship between a signal and returns. It will find how well the signal works, how strong signals correspond to big returns, and whether to flip the sign of the signalβs predictions.
Let us say you are using an analystβs estimates to predict a stockβs returns. You have a bunch of confusing, contradictory prior beliefs about what effect they will have:
- The analyst knows the company the best so the predictions are good
- Everyone follows the analystβs predictions so the trade is overcrowded and we should trade opposite
- The prediction is accurate in the near-term, but it is unclear how far into the future it might help
- Analysts are biased by other banking relationships to the company and so produce biased reports
- It is unclear if the analyst is smart or dumb since the predictions can never be perfect
In this case, you can take a time series of the analystβs predictions, and a timeseries of the stockβs returns, run regression on the returns given the predictions, and the computer will instantaneously tell the relationship.
Furthermore you can plug in multiple inputs, such as the lagged returns of other securities. And it will output estimates of how accurate the predictions are.
Although it is called linear regression, if you just do separate regressions on subsets of the data, it will find βnon-linearβ patterns.
I will not list the formula here because there are highly optimized libraries for almost every language.
If you are not sure of the connection between a numerical signal and future returns, then you should introduce a regression step.
One of the best ways to think about regression is to think of it as a conversion between units. When you train a regression model, it is equivalent to finding a formula for converting units of signal or data into units of predicted returns. Even if you think your signal is already in units of predicted returns, adding a regression step will ensure that it is well-calibrated.
Machine Learning
Machine learning is like linear regression but more general. It can predict categories or words or pretty much anything else, rather than just numerical values. It can also find more complicated patterns in data, such as V-shapes or curves rather than just single straight lines.
If you give a machine learning algorithm too much freedom, it will pick a worse model. The model will work well on the historical data set but poorly in the future. However the whole point of using machine learning is to allow some flexibility. This balance can be studied as a science, but in practice finding the right balance is an art.
Normalization
Consider the signal that is the size on the bid minus the size on the ask. If the bid size is much larger, it is likely the price will go up. Therefore the signal should be positively correlated with the future return over a few millisecond horizon (Note: this is untradable since you would get in the back of a long queue on the bid if you see it is much larger than the ask and never get filled before the ask gets knocked out. Assume you are using this signal calculated on one instrument to get information about a correlated instrument, to make it tradable.)
signal = best_bid_size - best_ask_size
This is a good core signal. But it can be improved by normalizing the value.
The problem is that when the overall size on the bid and ask are larger, the difference between them will also likely be larger, even though we would probably not like the signal value to be larger. We are more interested in the proportional difference between the bid and ask. Or to be more clever, we are interested in the size difference between the bid and ask relative to the average volume traded per interval, since that should give a number which accurately encodes the βstrengthβ of the signal.
So two ways of making a better signal would be to use:
signal_2 = (best_bid_size - best_ask_size) / (best_bid_size + best_ask_size)
or
signal_3 = (best_bid_size - best_ask_size) / (avg volume per x milliseconds)
With one signal I worked on which was very similar to this, normalizing as in the first formula above increased the information coefficient from 0.02907 to 0.03893.
EMA Smoothing
Consider the same signal as above. There is another problem with it. The bid and the ask change so rapidly that the signal values are all over the place. Sometimes when signals change so rapidly they can have a good alpha but information horizon has a peak in the extremely near future. This means the prediction is strong but lasts only a brief time, sometimes only seconds. This is bad because then it can result in a lot more turnover and transaction costs than you can afford, and it can be so short you do not have time to act on it given your network latency.
Applying an exponential moving average will often maintain the same good predictive performance since it still weights recent values the most, but it will also help push the peak of the information horizon out further so you have more time to act on it.
As an example, for one strategy I backtested I got the following information horizons. It is obviously an improvement and a shift to a longer horizon:
IC before EMA IC after EMA
------------- ------------
1 sec: 0.01711 0.01040
5 sec: 0.01150 -> 0.01732
1 min: 0.00624 0.02574
5 min: 0.03917 0.04492
Lets say the original signal is something like this:
double signal(Bar new_bar) {
... // do some calculations and compute new signal
return signal;
}
Then all you have to do is add a global variable to store the old EMA value and a parameter for the EMAβs weight. The code becomes:
double signal_2(Bar new_bar) {
... // do some calculations and compute new signal
ema = ema*(1 - weight_ema) + signal*weight_ema;
// that could be simplified to, ema += weight_ema * (signal - ema)
return ema;
}
This can be reparameterized for aperiodic data, where the data do not arrive at fixed intervals (the new parameter is the inverse of the old one, and is renamed time scale):
ema += ((time_now - time_previous)/time_scale) * (signal - ema);
Trim Outliers
For strategies that are based on training a regression model on a training set and then using it to predict forward, trimming the outliers of the historical returns in the training set often makes out-of-sample predictions more accurate.
Any model that is sensitive to single points (such as regression based on a squared criterion) is a good candidate for trimming outliers as a preprocessing step.
Here is an example from the the output of one backtest I ran:
% BEFORE TRUNCATION
-----
Sector 10
βCTLβ
βLUKβ
βQβ
βTβ
βVZβ
pd =
5
training_set_size =
700
test_set_size =
14466
ic = 0.1079
% AFTER TRUNCATING RETURN TAILS BEYOND 4SDβS
-----
Sector 10
βCTLβ
βLUKβ
βQβ
βTβ
βVZβ
pd =
5
training_set_size =
700
test_set_size =
14466
ic = 0.1248
It shows which symbols were traded, that I was using 1-minute bars aggregated up to 5-minutes, a lagging window of 700 points were used to train each regression, and the IC (correlation between the signals and returns) increased. A higher IC means it is more accurate.
Debug and Check Data
This is less specific and more just another reminder. Look at the signal values and data one-by-one, side-by-side, even if you have hundreds of thousands of points. You will often find little situations or values you did not anticipate and are handling badly.
You might have one crossed bid ask in a million, a few prices or OHLCs values equal to zero, a data point from outside of regular hours, a gap in the data, or a holiday which shorted the trading day. There are innumerable problems and you will see them all.
Labeling
In financial machine learning, creating target labels is essential for training a supervised learning model. Target labels can be created in different ways depending on the problem at hand.
The fixed-time horizon method
The most common way to construct target labels for a trading strategy is to use fixed time horizons, for example, we take the price at time and look ahead with a fixed horizon at price . To calculate the target label we take the difference between the two prices, if the difference is positive, then we label this sample with 1 and otherwise -1. Note how this method does not take into account the intermediate prices, i.e. the target ignores the path taken from to , while in practice if you went long and the price dropped or increased significantly in this period, then you would have either have hit your stop-loss or profit-taking levels respectively.
The triple barrier method
The triple barrier labeling method by Prado (2018) takes into account the path taken when a position is taken at time . To start, we need to figure out the stop-loss and profit-taking levels that are adjusted for the level of volatility, which we refer to as the horizontal barriers. Additionally, we must decide when to close out the position after a certain number of time periods, known as the vertical barrier.
The triple barrier labeling method assigns labels to a long position based on which of the horizontal and vertical barriers is hit first. If the profit-taking level is hit first, then the label is 1, if the stop loss level is hit, then the label is -1. If the price moves between the two barriers and hits the vertical barrier, then we can either label it with 0 or we determine whether we have made a profit or a loss at that point and label it with 1 or -1 accordingly. This labeling method is more realistic than the fixed time horizon method and is therefore preferred.
Trend Scanning Method
The barrier levels in the triple barrier labeling are set by the user at their own discretion. The trend scanning method by Prado (2020) determines labels without having to set such barriers. The goal of trend scanning is to determine the dominant trend for a certain period and assign that period a label , which corresponds with a down-trend, no-trend, or up-trend respectively. Consider a time series of prices , to determine the label of we look ahead periods and we compute the following
where by taking different values for we get different t-values. The label that is assigned to is determined by the for which the absolute value of the t-value is maximised.
Meta-labeling
When building a trading strategy model, it's important not only to predict the side of the trade but also to predict the size of the position. This is called metalabeling. It's better to open a smaller position when the model predicts a lower probability of going long than when it predicts a higher probability. The metalabeling model focuses on predicting the size, while the primary model focuses on predicting the side. Stacking these two models can help increase the performance of the trading strategy.
When building a binary classifier for the primary model, there is a trade-off between false positives and false negatives. This trade-off can be visualized with a confusion matrix that categorizes the predictions into true positives, false negatives, false positives, and true negatives. The goal is to maximize the number of true positives and minimize the number of false positives and true negatives.
From this confusion matrix, several metrics can be computed to assess the performance of the model.
Metalabeling is helpful in achieving a higher F1 score in a trading strategy model. First, a primary model is built to predict the side of the trade (go long or short). In this stage, it is sufficient to achieve high recall but not high precision. To increase precision, the metalabeling model is used to filter out false positives generated by the primary model. This is done by taking the predictions of the primary model and labeling which ones are predicted correctly or incorrectly to generate metalabels. The metalabel model is then fit using the existing features (and potentially other features as well) combined with the metalabels.
Dropping unnecessary labels
When classes are imbalanced, some ML classifiers may not perform well. In such situations, it is better to drop extremely rare labels and focus on the more common outcomes.
Sample weights
In financial time series, the samples in the training set do not contain equal amounts of information, ideally the model would focus on significant events. For example, samples wherein the subsequent period a large absolute return can be realised, are more interesting for the model than periods where small returns are made. In addition, it makes intuitive sense that recent information is more valuable than dated information in financial markets. It therefore desirable for the model to place more emphasis on recent information than dated information.These two concepts can be formalised by calculating sample weights for each sample. These weights are then used by the model to place more emphasis on samples with a high weight.
Return Attribution
To calculate the sample weight based on the sample's return, we transform the prices to log prices such that the sum of log prices is approximately equal to the return over that period. The weight for sample with a lifespan between can be calculated as follows
where is the number of concurrent events, i.e. the number of samples that (partially) overlap in the period . The weights are then normalised such that they sum one.
Time Decay
It is possible to assign more weight to recent samples than older samples by calculating time decay factor . To calculate these time decay factors, we use the array with the average uniqueness , where the most recent sample always receives a weight of 1. The user can control the amount of time decay with parameter . The weight of the oldest sample is , for . When , the decay factor is 0 for some samples, which implies the model will fully ignore these samples. For other samples, the decay factor can be computed with a linear piecewise function defined as
Average uniqueness of labels
When creating labels for samples without a fixed horizon (e.g., triple barrier labeling method), they each span a different period, and some samples may overlap with other samples to varying degrees. Samples that don't overlap much with other samples are more unique and therefore more interesting for the model to look at. This becomes more relevant for machine learning models that bootstrap the training data by random sampling from the dataset. However, samples are bootstrapped according to a uniform distribution, which means that highly overlapping samples are as likely to be sampled as more unique samples that don't overlap much. Ideally, we would like to bootstrap the samples according to their uniqueness to get a more diverse bootstrapped dataset.
Number of Concurrent Events
To calculate the average uniqueness, we first have to calculate the number of concurrent events. This can be calculated with an indicator matrix.
the rows represent time periods and the columns represent the samples . In the above example the and . To calculate the number of concurrent events we sum over the columns , in this example the number of concurrent events are .
Average Uniqueness
The uniqueness for sample at time can be calculated with . In this example, the uniqueness matrix is
To calculate the average uniqueness we take the average of the uniqueness values over
In our example, the average uniquenesses of the samples are 0.5, 0.83, and 0.5 respectively.
Sequentially Bootstrapped Bagging Classifier
When developing a machine learning model for a trading strategy, classifiers are often used to predict whether to buy, sell, or do nothing. Nonparametric models like random forests are preferred because financial time series have complicated nonlinear relationships that are unknown, and nonparametric models can discover these relationships by themselves.
Machine learning models generally suffer from three errors: bias, variance, and noise. Ensemble methods help reduce these errors by combining a set of weak learners (e.g., single decision trees) to develop a stronger learner that performs better than the individual ones. Examples of ensemble methods are random forests and bagging classifiers, which bootstrap the training data to reduce variance in the forecasts.
To get more diverse datasets, it's better to sample according to the average uniqueness of the samples. The sequentially bootstrapped bagging classifier is an ensemble model that samples according to the average uniqueness of the samples. This bootstrapping scheme reduces the probability of sampling neighboring samples when a sample is added to the bootstrapped dataset, which increases the likelihood of more diverse data in the dataset and helps reduce the variance in forecasts.
Fractionally differentiated features
In finance, time series data often have trends or non-constant means, making them non-stationary. However, for inferential analyses, it's necessary to work with stationary processes such as returns or changes in yield or volatility. Unfortunately, these transformations remove all memory from the original series, which is the basis for the model's predictive power. The question is, how much differentiation can we use to make a time series stationary while preserving as much memory as possible? Fractional differentiation allows us to explore the region between fully differentiated and undifferentiated series, which can be useful in developing highly predictive machine learning models. While supervised learning algorithms typically require stationary features, stationarity alone doesn't ensure predictive power. There's a trade-off between stationarity and memory, and too much differentiation can erase too much memory, which defeats the forecasting purpose of the ML algorithm.
Fixed-width Window Fracdiff
Using a positive coefficient the memory can be preserved:
where is the original series, the is the fractionally differentiated one, and the weights are defined as follows:
When d is a positive integer number, , and memory beyond that point is cancelled.
Stationarity With Maximum Memory Representation
Virtually all finance papers attempt to recover stationarity by applying an integer differentiation , which means that most studies have over-differentiated the series, that is, they have removed much more memory than was necessary to satisfy standard econometric assumptions.
Applying the fixed-width window fracdiff (FFD) method on series, the minimum coefficient can be computed. With this the resulting fractionally differentiated series is stationary. This coefficient quantifies the amount of memory that needs to be removed to achieve stationarity.
If the input series:
- is already stationary, thenΒ .
- contains a unit root, thenΒ .
- exhibits explosive behavior (like in a bubble), thenΒ .
A case of particular interest isΒ , when the original series is βmildly non-stationary.β In this case, although differentiation is needed, a full integer differentiation removes excessive memory (and predictive power).
Structural breaks
Structural breaks occur in time series when the underlying dynamics change significantly, such as during the COVID-19 pandemic which disrupted financial markets. These events can be black swan events, which invalidate past data and can lead to losses if not detected by trading strategies. Statistical tests can be used to detect structural breaks, which fall into two categories:
- CUSUM tests that check for deviations in forecasting errors from white noise,
- and explosiveness tests that also test for exponential growth and collapse.
CUSUM Tests
The Chu-Stinchcombe-White CUSUM test on levels was developed by Homm and Breitung (2012) to test departures from the log price relative to the log price where .
The test statistic can be compared against the time dependent critical value
for a one sided test to test for significance
To prevent choosing the reference level arbitrarily, we calculate the test static over all
and take the supremum such that
Explosiveness Tests
The Chow-Type Dickey-Fuller (SDFC) test was developed by Gregory Chow (1960). Consider an autoregressive process
where is white noise. Under the null hypothesis follows a random walk, i.e. , while under the alternative hypothesis starts off as a random walk, but after time , where
the process changes into an explosive process.
To calculate the test statistic, we fit the following specification with ordinary least squares
where the dummy variable is 0 when and 1 when , under the null hypothesis and under the alternative hypothesis , we define the following test statistic
In practice is not known, we therefore try all possible and take the maximum such that
where we take sufficiently large, such that the specification always has sufficient data to fit with at the start and the end. The main drawback of this test is that a single bubble is assumed, while in reality there can be multiple.
The Supremum Augment Dickey-Fuller (SADF) test by Phillips, Wu and Yu (2011) tests for periodically collapsing bubble with the following specification
where we test under the null hypothesis and under the alternative hypothesis . For each point the above mentioned specification is fitted with backwards expanding start points and for each point we calculate , we then define
where is taken sufficiently large to ensure the specification at the end can be fitted with sufficient data.
Entropy features
Price series reflect the balance between supply and demand. Perfect markets have unpredictable prices because all information is immediately incorporated. Imperfect markets, on the other hand, result in partial information and asymmetrical knowledge, which can be exploited by some agents. To estimate the information content of price series, we can use machine learning (ML) algorithms to identify profitable patterns, such as momentum or mean-reversion bets. This chapter explains how to measure the information content of a price series using Shannon's entropy and other information theory concepts.
Shannon defined entropy as the average amount of information produced by a stationary data source, measured in bits per character. The entropy of a random variable is the probability-weighted average of its informational content. Low-probability outcomes reveal more information than high-probability outcomes.
Redundancy is defined as the difference between the entropy of the variable and its mutual information with another variable.
Mutual information is a natural measure of the association between variables, regardless of whether they are linear or nonlinear. The normalized variation of information is a metric derived from mutual information.
Encoding schemes
To estimate entropy, we need to encode a message. There are different encoding schemes that can be used, such as binary encoding, quantile encoding, and sigma encoding.
- Binary encoding is assigning a code based on the sign of a return.
- Quantile encoding assigns a code to a return based on the quantile it belongs to.
- Sigma encoding assigns a code based on a fixed discretization step.
Each encoding scheme has its advantages and disadvantages, but they all aim to discretize a continuous variable so that entropy can be estimated.
A few financial applications of entropy
Entropy can be used to measure market efficiency, with high entropy indicating efficiency and low entropy indicating inefficiency, such as in the case of a bubble.
Maximum entropy generation can also be used to determine the most profitable future outcome.
Portfolio concentration can be defined using entropy and the generalized mean.
In microstructure theory, the probability of informed trading (PIN) can be derived from the order flow imbalance, which can be measured using entropy.
The unpredictability of the order flow imbalance can then be used to determine the probability of adverse selection. To do this, a sequence of volume bars can be quantized and its entropy estimated using Kontoyiannis' Lempel-Ziv algorithm, and the resulting feature used to predict adverse selection.
Microstructural features
Quants have analyzed the microstructure of financial markets and identified various predictive features. With more data becoming available over the years, these features have become increasingly sophisticated. While it's not possible to discuss all the microstructural features in this whitepaper, interested users can find more information in the referenced papers.
- First Generation
- These features were developed when there was limited data available and focus on estimating the bid-ask spread and volatility as a proxy of illiquidity.
- Roll Measure
- Roll Impact
- Parkinson Volatility
- Corwin Schultz Spread
- Beckers Parkinson Volatility
- Second Generation
- These features focus on understanding and measuring illiquidity and have a stronger theoretical foundation.
- Kyle's Lambda
- Amihud's Lambda
- Hasbrouck's Lambda
- Third Generation
- Sequential trade models were developed to model randomly selected traders arriving at the market sequentially and independently, these models have become increasingly popular as it models various sources of uncertainty.
- Volume-Synchronised Probability of Informed Trading (VPIN)
Modeling
Fast Online Algorithms
Online algorithms, like the EMA, above, are very efficient when run in realtime. Batch, rather than online, algorithms encourage the terrible practice of fitting the model on the entire set, which causes overfitting (see next section). The right way to simulate is to always use sliding window validation and online algorithms make this easy. They also speed up high frequency strategies.
As you may know from statistics, the exponential distribution is called memoryless. This same property is what makes exponentially weighted algorithms efficient.
Here is a list of useful online algorithms:
Covariance
Assume the mean is zero which is a good assumption for high and medium frequency returns. Fixing the mean greatly simplifies the computation. Also assume data points from the two series arrive periodically at the same time. Assume you have a global variable named cov and a constant parameter weight:
double covariance(double x, double y) {
cov += weight*(x*y - cov);
return cov;
}
From this you can get a formula for online variance by setting y = x.
Variance
If the data are not periodic then you cannot calculate covariance since they do not arrive at the same time (interpolation would be required). Let us also drop the assumption of zero mean and include an EMA as the mean:
double variance(double x, double time_now, double time_previous, double time_scale) {
time_decay = ((time_now - time_previous)/time_scale);
ema += time_decay * (x - ema);
var += time_decay * (x*x - ema);
return var;
}
Note that you may use two different values for the EMA and Varianceβs time decay constants.
Linear Regression
With Variance and Covariance, we can do linear regression. Assume no intercept, as is the case when predicting short-term returns. The formula is:
y_predicted = x * cov / var_x;
And we can calculate the standard error of the regression, . It comes from combining the two expressions: . The formula is:
var_y_given_x = var_y * (1 - cov*cov / (var_x*var_y));
Ensemble methods
ML models generally suffer from three errors:
- Bias: This error is caused by unrealistic assumptions. When bias is high, the ML algorithm has failed to recognize important relations between features and outcomes. In this situation, the algorithm is said to be βunderfit.β
- Variance: This error is caused by sensitivity to small changes in the training set. When variance is high, the algorithm has overfit the training set, and that is why even minimal changes in the training set can produce wildly different predictions. Rather than modelling the general patterns in the training set, the algorithm has mistaken noise with signal.
- Noise: This error is caused by the variance of the observed values, like unpredictable changes or measurement errors. This is the irreducible error, which cannot be explained by any model.
Bagging
Bootstrap aggregation, also known as bagging, is a technique used to reduce the variance in forecasts. The process involves generating N training datasets by randomly sampling with replacement. Then, N estimators are fit, one on each training set. These estimators are fit independently from each other and can be fit in parallel. The ensemble forecast is then calculated by taking the simple average of the individual forecasts from the N models. In the case of categorical variables, the probability that an observation belongs to a class is given by the proportion of estimators that classify that observation as a member of that class, also known as majority voting. If the base estimator can make forecasts with a prediction probability, the bagging classifier may derive a mean of the probabilities.
Random forest
Random Forest (RF) is a machine learning method used to reduce the variance of forecasts by creating an ensemble of independently trained decision tree models over bootstrapped subsets of data. However, decision trees are prone to overfitting, which increases variance. To address this concern, RF introduces a second level of randomness where only a random subsample of attributes is evaluated when optimizing each node split. This helps to further decorrelate the estimators and reduce overfitting.
RF provides several advantages, including reduced variance, evaluation of feature importance, and out-of-bag accuracy estimates. However, if a large number of samples are redundant, overfitting may still occur, leading to essentially identical and overfit decision trees.
To address overfitting in RF, we can adjust hyperparameters such as the number of estimators, the maximum depth of the decision trees, and the minimum number of samples required to split a node. We can also use techniques such as cross-validation and grid search to tune these hyperparameters.
Boosting
Boosting is a method for combining weak estimators to achieve high accuracy. It starts by generating a training set using sample weights, which are initially set to be uniform. An estimator is then fit to the training set. If the estimator's accuracy is greater than a threshold, it is kept, otherwise it is discarded. Misclassified observations are then given more weight, while correctly classified observations are given less weight. This process is repeated until a specified number of estimators are produced. The ensemble forecast is the weighted average of the individual forecasts from the N models, where the weights are determined by the accuracy of the individual estimators. AdaBoost is one of the most popular boosting algorithms.
Boosting improves forecast accuracy by reducing both variance and bias, but it comes with the risk of overfitting. In financial applications, bagging may be a better option as it addresses overfitting, which is a greater concern than underfitting. Overfitting can happen easily due to the low signal-to-noise ratio in financial data. Additionally, bagging can be parallelized, while boosting generally requires sequential running.
Estimation of proper dimensions
Cross validation
Cross-validation (CV) is used to determine the generalization error of an ML algorithm, which helps prevent overfitting. However, standard CV techniques fail when applied to financial problems because they assume that observations are drawn from an independent and identically distributed (IID) process, which is not true in finance. Additionally, CV contributes to overfitting through hyper-parameter tuning. To address these issues, purging and embargo techniques can be used in a modified version of CV called purged k-fold CV. Purging involves removing observations from the training set whose labels overlap in time with those in the testing set, while embargo involves removing observations that immediately follow those in the testing set. These techniques help reduce the likelihood of leakage and improve the accuracy of the ML algorithm in financial applications.
Feature importance
Many financial researchers and even large hedge funds make the mistake of repeatedly running data through ML algorithms and backtesting until they get a positive result, which can lead to false discoveries. Instead, researchers should focus on understanding the importance of features in their data, which can offer valuable insights into the patterns identified by the classifier.
One approach is to use the mean decrease impurity (MDI) method, which is specific to tree-based classifiers, to rank the importance of features.
Another approach is to use mean decrease accuracy (MDA), which is a slower method that can be applied to any classifier. Both methods are susceptible to substitution effects, which can lead to the wrong conclusions when trying to understand, improve, or simplify a model.
Therefore, the single feature importance (SFI) method can also be used to compute the out-of-sample performance score of each feature in isolation without substitution effects.
Additionally, researchers can use orthogonal features, which are derived from principal component analysis (PCA), to alleviate the impact of linear substitution effects. By understanding the importance of features, researchers can gain valuable insights into their data and avoid the pitfall of repeatedly backtesting until a positive result is achieved.
Hyper parameters tuning
Hyper-parameter tuning is a crucial step in fitting a machine learning algorithm. If not done correctly, the algorithm may overfit and not perform well in live scenarios. The ML literature emphasizes the importance of cross-validating any tuned hyper-parameters. However, cross-validation in finance can be challenging, and solutions from other fields may not work. In this chapter, we discuss the purged k-fold cross-validation method to tune hyper-parameters. Alternative methods proposed in various studies are also listed in the references section.
Grid search cross-validation is a popular method that performs an exhaustive search for the combination of parameters that maximizes the cross-validation performance. Scikit-learn provides the GridSearchCV
function, which accepts a CV generator as an argument.
Computational tractability
In some cases, a grid search becomes computationally intractable for ML algorithms with many parameters. In such cases, it is better to sample each parameter from a distribution. RandomizedSearchCV provides an option for this purpose.
Non-negative hyper-parameters
It is common for some ML algorithms to accept non-negative hyper-parameters only. For such parameters, drawing random numbers from a log-uniform distribution is more effective than drawing from a uniform distribution bounded between 0 and a large value.