The most powerful force in the universe is compound interest. –Einstein (unconfirmed)

- Understanding strategy risk
- Single Strategy Allocation
- Skewed die roll
- Backtesting
- Optimizing Kelly
- Multiple Strategies
- Covariance as coin flipping from 2 pools
- Simple two strategy mean-variance optimization
- More mean-variance special cases
- Spurious Correlation
- Porfolio Allocation Summary
- Extra: Empirical Kelly Code
- Extra: Kelly ≈ Markowitz
- Markowitz
- Kelly
- Where “Approximately” Breaks Down

This section is a straightforward, holistic, practical guide to risky portfolio capital allocation. It combines the best parts of the Kelly Criterion, Barra’s factors, Modern Portfolio Theory, scenario analysis, and practical heuristics, in a principled engineering approach. It stands alone without the rest of this script and I believe it is perhaps the best explanation of position sizing available. Rather than trying to fix academics’ theories or old traders’ heuristics, let’s start from the beginning.

# Understanding strategy risk

# Single Strategy Allocation

Assume you have a strategy that is profitable, otherwise, bet $0. How do you make as much money as possible? Assume you have $100 to trade with– your total net worth including everything anyone would loan you. Put all your money into it? No, then on the first losing trade you go to $0. Put 99% of your money in it? No, then on the first losing trade you go to $1, and even a 100% winner after that only brings you to $2, down a net of $98. Put 98% in it? No then you still go down too much on losers. Put 1% in it? No then you barely make any money. Clearly there is an optimization problem here, with a curve we are starting to uncover. It has minima around 0% and 100% and a peak somewhere in between.

## Skewed die roll

Consider the case where you are trying to find the best amount to bet on a game where you roll a die and win 1 on any number but 6, and lose 4 on a 6. This is a good strategy because on average you make money $5/6 ∗ 1 − 1/6 ∗ 4 = 1/6 > 0$. Let’s look at a graph of the PnL from betting on this game 100 times with different amounts of your capital. In R, type:

```
X = rep(1,100);
X[which(rbinom(100, 1, 1/6)==1)] = -4;
profit = function(f)prod(1+f*X);
fs = seq(0,2/24,1/24/10);
profits = sapply(fs, profit);
plot(fs, profits, main=’Optimum Profit’)
```

We have plotted all levels of leverage between 0 and 1/12:

## Backtesting

You need to optimize your position size in a principled, consistent method. First you need more detailed assumptions. Assume the strategies profits and losses are governed by probability, rather than deterministic or determined adversarially. Assume this distribution is the same on each trade (we’ll get rid of this obviously impossible to satisfy assumption later). Now optimize based on the results of a backtest. Assume the same fraction of your total capital was placed into each trade. Take the sequence of profits and losses and calculate what fraction would have resulted in the most profit. Think of the backtest PnL as the sequence of die rolls you get playing the previous game, and position sizing as optimizing the curve above.

You are right if you think solely looking at the past is silly. You can freely add in imaginary PnLs which you think are likely to happen in the future. Add them in proportion to the frequency that you think they will occur so the math doesn’t become biased. These imaginary PnLs could be based on scenario analysis (what if a Black Swan event occurs?) or your views on anything (did you make improvements to the system that you think will improve it?). There is closed form solution to this fractional capital optimization method, which means the computer can calculate it instantly. It is called the Kelly Criterion. You have probably seen it in special cases of biased coin flips, or horse racing, but it applies to any sequence of probabilistic bets and is simple to compute.

## Optimizing Kelly

Assume you are playing a game with even odds but a greater than 50% chance of winning. Call the probability of winning p. We want to maximize our compound return, $\prod_n (1 + f ∗ r_n)$ where $f ∗ r_n$ are the gains or losses each round. For example this might be $(1+f_1)∗(1+f_2)∗(1−f_3)∗(1+f_4)∗(1−f_5)∗... = G$. In the long run of $N$ rounds, we should get about $pN$ wins and $(1 − p)N$ losses. So we can collapse $G$ as $G = (1 + f)pN (1 − f)(1−p)N$. We want to find the maximum value of final capital we can get by tweaking $f$. To do this, we will maximize $log(G)$, to simplify the math, and then take a derivative and set it equal to 0. $\frac{\partial log(G)}{\partial f} = 0 = \frac{\partial}{\partial d} pN \log(1+f)+(1−p)N \log(1−f) = \frac{pN}{1 + f} + \frac{(1−p)N}{1 - f} (−1) \implies p(1−f) = (1−p)(1+f) \implies f = 2p−1$. Therefore if you win everytime, $p = 1$ so you should bet $2 ∗ 1 − 1 = 1$, all your money. Try some other values of $p$, 0, .25, .5, .75, etc, to see it makes sense.

By some very elegant mathematical underpinnings and assuming Gaussian returns, to maximize risky profits you should maximize (average profit-loss)/(standard deviation of PnL) i.e. E(return)/Variance(return). We will target this expression in the next section.

For the sake of completeness, here is the proof. We will first derive the Kelly betting fraction for the case where you have a $\frac{1}{2}$ probability of getting either return $a$ or $b$, where $b < 0$. You will bet a fraction $f$. We want to find the f which maximizes utility, so we take a derivative of the utility function and set it equal to zero. The value of f that satisfies the equation is the optimum. Later we will plug in the infinitesimal outcomes of a Gaussian, $a = \mu \Delta t + \sigma \sqrt{\Delta t}$ and $b = \mu \Delta t + \sigma \sqrt{\Delta t}$, where $\mu < \sigma$, and take its limit as we move from discrete to continuous time.

multiply both sides by 2

derivative, chain rule

common denominator

multiply by the denominator

distribute

subtract $2fab$

divide by $− 2ab$

plugging in $a, b$

simplify with algebra

simplify more with algebra

cancelling $\Delta t$ limit as $\Delta t → 0$

# Multiple Strategies

Assume now that you have multiple profitable strategies. You can decrease the risk of your net position by allocating parts of your capital to strategies with risks that cancel each other out. Even putting money in a strategy with lower expected profit can help out because the benefits of canceling risk are initially more powerful than sacrificing some profit. With less risk, you can invest a higher fraction of your capital and make profits off of the extra part you invest rather than sidelining it.

To balance position sizes in multiple strategies in a principled fashion requires more mathematical sophistication. Let $\sum w_i X_i$ be the portfolio represented as a random variable. The uncertain numbers (random variables) $X_i$ are the PnL distributions of each strategy. The numbers $w_i$ are the fractions of your invested capital in each strategy. The sum of all $w_i$ is 1. The expected return is then $\mathbb{E}(\sum w_i X_i) = \sum \mathbb{E}(w_iX_i)$, which is easy to maximize. The risk is then $\mathrm{Var}(\sum w_i X_i) = \sum w_i^2 \mathrm{Var}(X_i) + \sum \sum w_i w_j \mathrm{Cov}(X_i, X_j)$. Covariance is the shared risk between the two strategies. Covariance can be broken down into standard deviation and correlation since $\mathrm{Cov}(X_i, X_j) = \mathrm{SD}(X_i) \mathrm{SD}(X_j) \mathrm{Corr}(X_i, X_j)$. The portfolio variance is more interesting than the expected profit, since by changing the weights we can decrease it substantially.

## Covariance as coin flipping from 2 pools

The intuition behind correlation is simple. Say two random variables $X_1$ and $X_2$ are generated by flipping 10 coins each. If they are uncorrelated, then we flip the 10 coins separately for $X_1$ and $X_2$. If we flip the 10 coins to first generate $X_1$, and then leave 1 coin lying down and only flip the other 9 when generating $X_2$, then $\mathrm{Corr}(X_1, X_2 ) = 1/10 = .1$. If we don’t touch 2 coins and only reflip 8 of them for $X_2$, then the correlation will be 2/10. In terms of trading strategies, you can imagine that the coins in common represent common risk factor exposures. For example stocks typically have correlation to the overall market of about .5 − .8. This is equivalent to saying that after generating the overall market returns by flipping 10 coins, you only pick up and reflip 5 of them for a given stock. Sector correlations might represent another coin in common. The challenge is that you can’t see the individual ten coins, only their sum, so you can’t easily predict one thing using another. For strategies with $\mathrm{Var}(X) = 1 \implies \mathrm{SD}(X) = 1$, the correlation is the covariance.

## Simple two strategy mean-variance optimization

How about the simplest 2-strategy case? Assume they have the same return $α$, risk $σ^2 = 1$ and correlation $ρ = .5$. Covariance $= ρσ_iσ_j = ρ ∗ 1 ∗ 1 = ρ$.

$\mathbb{E}(\sum w_iX_i) = w_1 α + w_2 α = (w1 + w2)α = 1 α = α$. So the weights don’t matter for the expected profit.

$\mathrm{Var}(\sum w_i X_i) = \sum w_i^2 \mathrm{Var}(X_i) + \sum \sum w_i w_j \mathrm{Cov}(X_i, X_j) = w_1^2 \mathrm{Var}(X1) + w_2^2 \mathrm{Var}(X2) + w_1 w_2 \mathrm{Cov}(X1, X2) = w_1^2 1 + w_2^2 1 + w_1 w_2 σ_1 σ_2 ρ = w_1^2 + w_2^2 + w_1 w_2 .5$.

Since $w_i$ is less than 1, squaring it pushes it even closer to 0.

Substituting $1 − w_1 = w_2$ for $w_2$ and taking a derivative and setting it equal to zero allows us to find the minimum, by calculus. The derivative is $0 = 2 w_1 + 2 (1 − w_1) (−1) + (1 − w_1) .5 − w_1 .5 = 4 w_1 − 2 − w_1 + .5 = 3 w_1 − 1.5 \implies w_1 = .5$.

Good, we expect that each strategy will get the same allocation since they have the same risk and reward.

## More mean-variance special cases

How about more special cases? In the case of N strategies with identical risk Var(X) and correlations $ρ$, the risk with an equal weight portfolio will be $\mathrm{Var}(X)(1 + ρ(N − 1))/N)$. When the strategies are uncorrelated ($ρ = 0$), the risk is $\mathrm{Var}(X)/N$. When you have an infinite number of strategies, the risk goes to $ρ \mathrm{Var}(X)$. You can see how the risk varies with each attribute of the strategies’ PnLs.

But before, with one strategy, we had to estimate the profit and risk from a backtest. How do we estimate Variance and Covariance? You estimate it from a backtest also. But what about the high number of things we’re trying to estimate? There are $N × N$ Covariance numbers. If we haven’t backtested over a long enough time then it will be hard to accurately estimate these numbers.

## Spurious Correlation

When you only have a few observations of $X_1$ and $X_2$, there is a possibility of spurious [fake] correlation. This means that just by chance the two strategies had winning trades at the same time, and losers at the same time, although they are actually unrelated. You can simulate correlations between random unrelated strategies $X_1$ and $X_2$ to find levels that are probably authentic for different backtest lengths. Here is a graph showing the minimum correlation coefficient that is significantly different from zero at the 5% level, for given sample sizes:

```
Source:
http://en.wikipedia.org/wiki/
Pearson_product-moment_correlation_coefficient#Inference
N = seq(10, 200, 25)
Z = 1.96/sqrt(N-3)
R = (exp(2*Z)-1)/(exp(2*Z)+1)
plot(N, R, pch=20)
```

If you don’t know the statistical hypothesis testing terminology like “5% level”, then just take from this example the shape of the curve - more data points mean you can distinguish fake from authentic correlations better, but you need a lot of points.

You need to decrease the number of parameters. Academics and traders have come up with tons of different ways of doing this. There are problems with most methods. Before you get lost in the trees of the forest, remember that you are trying to balance strategies’ PnLs so their risks cancel out, so that you can safely put a bigger fraction of your capital to work. Estimating which strategies’ risks cancel out is hard to do because the number of pairings of $N$ strategies is $N × N$ and you only have a limited amount of backtest data. Here are the key ideas behind each approach:

**Buy it from Barra to pass on the blame**BARRA was started in 1974 to improve portfolio constuction. They were the first to predict covariances rather than just base them on historical data. They also explained covariance (beta) in terms of fundamental risk factors to make the quantitative approach more palatable to portfolio managers.**Factor Model**A matrix like the covariance matrix can be broken up into columns or rows. Rather than having the matrix derived from the strategies themselves, you can have it based on a smaller number of intuitive components such as a stock sector, commodity, or interest rate. So instead of having to estimate $N × N$ numbers, you will have $N × F$ where $F$ is the number of factors and hopefully less than $N$. The important thing is that the factors are independent in the linear algebra sense.**Regularization**The covariance matrix has variance terms along the diagonal and covariance terms off the diagonal. It is the covariance terms that will mess up position sizing. This is because there are a lot more of them and because having the wrong covariance number effects two strategies, not just 1, and it can affect them a lot if the covariance comes out negative. In regularization you take a weighted sum of the sample covariance matrix and the identity matrix. This decreases the influence of the covariance terms and balances the position sizes to be more even across all strategies.**Bootstrap**Bootstrapping is a way of estimating the error in your covariance number. In this method, you take B random subsamples of the strategies’ PnL curves and calculate the covariance of these samples. After doing this you will get B covariance numbers, all a little bit different depending on which points were excluded. Typically then one will take the number which is closest to 0. Choosing covariances closer to 0 balances out the strategies’ allocations.**Universal Portfolio**A product of game theory and information theory (which was also the origin of the Kelly Criterion), the universal portfolio approach attempts to minimize the difference between the performance of the portfolio of strategies and the best strategy in the portfolio from the very first trade (including simulated trades). It does not give the kind of performance we want since it seeks minimax optimality rather than in-expectation optimality.**Scenario Analysis**The covariance term numerically represents how the two strategies will move together in the future. Future movements are driven by human events. Listing the most probable scenarios along with their probabilities and anticipated effects on each strategy is another principled way to calculate covariance. This is most appropriate for long-term fundamental investors.**Random Matrix**Theory Random matrix theory has two components. The first contribution is a method to estimate the typical level of spurious correlations by measuring the average correlation between uncorrelated random sequences. Making this precise in the case of N × N correlations is the interesting case. The second contribution is a method to extract the true correlations by building the covariance matrix from the largest eigenvalue eigenvectors (and throwing away the rest) of the sample covariance.**Realized Variance**By using higher frequency data to construct the covariance matrix, you can get more datapoints so there’s less risk of spurious correlation. The relationships between strategies that exist at high frequencies typically also hold at lower frequencies. This has been termed “realized variance.” When you go to very high frequency data, problems arise because trades actually occur asyncronously, not at fixed periods like OHLC-bar data. As long as you stay above 5-min bars for large caps and 30-min bars for small caps, the realized variance approach to high frequency covariance estimation will work well. However you need access to higher frequency bars and the infrastructure to trade at multiple horizons.**Equal Weight Portfolio**If you become pessimistic on all this complicated estimation nonsense and just want to get on with trading, the the 1/N portfolio is for you. 1/N means you just allocate equal amounts to each strategy. A lot of academic research shows that this allocation beats fancier methods when you go out of the training data.- Decompose and Forecasts Covariance is really some part of each of the approaches above. The right thing to do is fuse them all. Covariance = Variance ∗ Correlation.
*Variance*is predictable, and even very short term exponentially weighted dynamic linear forecasting can perform very well, predicting about 30% of the hourly changes using a 10 minute window. Incorporating changes in spreads and implied volatilities provides another marginal improvement. News will improve it too.*Correlation*is a matrix decomposable into factor vectors corresponding to noise and to fundamental factors, which will result from the future’s scenario, but it is affected by sampling error. Statistical and scenario-based forecasts of the factors’ correlations will improve portfolio risk estimates. Isolating the noise in the matrix and smoothing globally prevents overfitting to the risk estimates. Correlation should be statistically predicted using exponentially weighted windows of 1 week of 1-minute data. A simple linear factor models on related assets’ prices is sufficient for another significant fundamentals-based improvement.

# Porfolio Allocation Summary

After you get the expected returns, variances, and covariances, you are ready to allocate capital to multiple strategies. After getting the right weights for each strategy, treat the strategy portfolio as a single strategy and apply the Kelly Criterion optimization to the net strategies’ PnL. If you don’t like any of the numbers that pop out from the formulas or backtest, feel free to modify them. If you see a correlation at 0.2 and you think 0.5 is more reasonable, change it. The numbers are only estimates based on the past. If you have some insight to the future, feel free to apply it.

There are still unanswered questions which I will briefly answer here.

If you have strategies that trade over different horizons, what horizon should the strategy portfolio use? Calculate the risk at the longer horizon but use separate estimators optimized for each horizon.

How can you tell if your principled estimation recipe worked? Compare it to an alternative method. How do you compare them? The general approach is to calculate the 1-period realized risk at the horizon for multiple sets of random weights; a more specific approach is appropriate for other objectives - such as constructing minimum variance portfolios.

# Extra: Empirical Kelly Code

The following Kelly optimization code is very general and useful for the case when your strategy has very skewed or fat tailed returns. I have not seen it published anywhere else so I include it here.

```
"""
These python functions are used to calculate the growth-optimal
leverage for a strategy based on its backtest performance. Each
trade’s profit or loss is treated as one sample from an underlying
probability distribution which is generating the PnLs. This function
optimizes the amount of leverage one would use when betting on this
distribution, assuming that the strategy will have the same performance
in the near future.
Example:
# Load some data from yahoo
import matplotlib.finance as fin
import datetime as dt
start = dt.datetime(2006,1,1)
end = dt.datetime(2007,1,1)
d = fin.quotes_historical_yahoo(’SPY’, start, end)
import numpy as np
close = np.array([bar[4] for bar in d])
close = close[range(0,len(close),5)] # make weekly
returns = np.diff(close)/close[:-1]
# Empirical Kelly
kelly(returns)
# Continuous/Gaussian Analytic Kelly
np.mean(returns)/np.var(returns)
import pylab as pl
pl.hist(returns)
pl.show()
# Good: heavy left tail caused empirical Kelly
# to be less than continuous/Gaussian Kelly
"""
import numpy as np
import scipy.optimize
def kelly(hist_returns, binned_optimization=False,
num_bins=100, stop_loss=-np.inf):
"""
Compute the optimal multiplier to leverage
historical returns
Parameters
----------
hist_returns : ndarray
arithmetic 1-pd returns
binned_optimization : boolean
see empirical distribution. improves runtime
num_bins : int
see empirical distribution. fewer bins
improves runtime
stop_loss : double
experimental. simulate the effect of a stop
loss at stop_loss percent return
Returns
-------
f : float
the optimal leverage factor.
"""
if stop_loss > -np.inf:
stopped_out = hist_returns < stop_loss
hist_returns[stopped_out] = stop_loss
probabilities, returns = empirical_distribution(hist_returns,
binned_optimization,
num_bins)
expected_log_return = lambda(f): expectation(probabilities,
np.log(1+f*returns))
objective = lambda(f): -expected_log_return(f)
derivative = lambda(f): -expectation(probabilities,
returns/(1.+f*returns))
return scipy.optimize.fmin_cg(f=objective, x0=1.0, fprime=derivative,
disp=1, full_output=1, maxiter=5000,
callback=mycall)
def empirical_distribution(hist_returns, binned_optimization=True,
num_bins=100):
"""
Aggregate observations and generate an empirical probability
distribution
Parameters
----------
hist_returns : ndarray
observations, assumed uniform probability ie point masses
binned_optimization : boolean
whether to aggregate point masses to speed computations
using the distribution
num_bins : int
number of bins for histogram. fewer bins improves
runtime but hides granular details
Returns
-------
probabilites : ndarray
probabilities of respective events
returns : ndarray
events/aggregated observations.
"""
if binned_optimization:
frequencies, return_bins = np.histogram(hist_returns,
bins=num_bins)
probabilities = np.double(frequencies) / len(hist_returns)
returns = (return_bins[:-1]+return_bins[1:])/2
else:
# uniform point masses at each return observation
probabilities = np.double(np.ones_like(hist_returns))/len(hist_returns)
returns = hist_returns
return probabilities, returns
def expectation(probabilities, returns):
"""
Compute the expected value of a discrete set of events given
their probabilities."""
return sum(probabilities * returns)
def mycall(xk):
print xk
```

# Extra: Kelly ≈ Markowitz

Not a lot of people are aware of this, and it creates some unnecessary holy wars between people with different views on position sizing. I have not seen it elsewhere so I have included it here.

There are a couple different approaches to determining the best leverage to use for an overall portfolio. In finance 101 they teach Markowitz’s mean-variance optimization where the efficient portfolios are along an optimal frontier and the best one among those is at the point where a line drawn from the risk free porfolio/asset is tangent to the frontier. In the 1950s Kelly derived a new optimal leverage criteria inspired by information theory (which had been established by Shannon a few years earlier). The criteria being optimized in these two cases is typically called an “objective function”- a function of possible asset weights/allocations that outputs a number giving the relative estimated quality of the portfolio weights. However, the two objective functions look quite different (peek ahead if you’re unfamiliar). In this note I show the two are approximately equivalent, with the approximation being very close in realistic risk-return scenarios.

I’m ignoring the various other naive/heuristic position sizing approaches which float around the tradersphere-“half-Kelly”, risk-“multiple” based approaches, 130/30 strategies, beginners’ basic 100% allocation, etc.

The following are sets of synonyms in the literature:

{ Mean variance, modern portfolio theory, Markowitz optimization }

{ Kelly criterion, maximize log utility, maximize geometric growth }

## Markowitz

Mean variance portfolio optimization maximizes the objective function:

Computational formula for variance

Justification below

Where $λ$ represents risk aversion- supposedly a known constant like 1. We are allowed to do the last step because in the financial setting, $\mathbb{E}[w ∗ r]$ is typically less than 1 i.e. returns are expected to be less than 100%. This causes $\mathbb{E}[w∗r]^2 <<\mathbb{E}[w∗r]$ so we can ignore it as a round-off error. The fact that $w∗r$ is so much less than 1 (since asset weights sum to 1 and returns are less than 1) will be useful later too.

## Kelly

This Taylor series formula from Wikipedia will be useful below when we work with Kelly’s objective:

Taylor series can be used to approximate functions such as ln(1 + x) by calculating the first couple terms i.e. $n = \{ 1, ..., m \}, \; m < ∞$. They were discovered in the 1300s in India but took another 300 years for European “mathematicians” to figure out.

Moving on, in contrast to Markowitz above, Kelly maximizes the log growth rate:

As above, assume $w ∗ r < 1$

Throwing away terms corresponding to $n > 2$

Expanding the sum

Evaluating the two terms $(−1)^k$

Linearity of expectation

This is the same as the final result of Markowitz above except here $λ = \frac{1}{2}$.

## Where “Approximately” Breaks Down

First of all the user must obviously have a “risk aversion” which seeks only to maximize wealth ($λ = \frac{1}{2}$). Now let’s look at the places where approximately equals ≈ was used in each derivation. Throwing away $−\mathbb{E}[w∗r]^2$ in Markowitz will be violated if the expected return is near 100% or higher. Since we’re looking at daily returns $\mathbb{E}[w ∗ r]$ is almost certainly $< 1\% \implies \mathbb{E}[w ∗ r]^2 < .01\%$ so we can basically accept this one as a reasonable approximation.

The looser approximation was in the derivation of Kelly where we threw away terms corresponding to $n > 2$ in the summation. If your strategy has high alpha ($\mathbb{E}[w ∗ r] > 0$) and is positively skewed (i.e. fat right tail returns $\mathbb{E}[(w ∗ r)3] > 0$) then this is a bad approximation. This is the case if, let’s say, you’re Peter Lynch or Warren Buffett and every now and then you pick a ‘ten-bagger’ while even your unsuccessful picks don’t lose much or stay flat. Alternatively you could just have an option strategy which loses money most of the time but every now and then makes a huge win, for example.

Keep in mind that when you use Markowitz instead of Kelly you lose this sensitivity to skewness (and higher moments).