The main goal of this work is to understand and overcome the different stages of building an index fund
Given a target population of n stocks, select q stocks and their weights in the index fund (a portfolio), to represent the target population as closely as possible.
The idea here is to consider the following ILP model aiming at aggregating a broad market index of n securities into a representative index fund:
\begin{array}{lll} max \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{ij} x_{ij} \hspace{7cm} (1) \\ \\ \sum_{j=1}^{n} y_{j} = q \hspace{8cm} (2) \\ \\ \sum_{j=1}^{n} x_{ij} = 1 \hspace{1cm} i = 1,...,n \hspace{5cm} (3) \\ \\ x_{ij} \le y_{j} \hspace{1.5cm} i= 1,...,n \hspace{0.5cm} j= 1,...,n \hspace{3cm} (4)\\ \\ x_{ij}, y_{j} \in \left\{ 0,1 \right\} \hspace{0.5cm} i= 1,...,n \hspace{0.5cm} j= 1,...,n \hspace{3cm} (5) \end{array}
Where the decision variables represent :
$ y_{j} = \left\{ \begin{array}{ll} 1 \; \; if \; stock \; j \; is \; in \; the \; index\; fund \\ 0\; \; otherwise \end{array} \right. $
$ x_{ij} = \left\{ \begin{array}{ll} 1 \; \; if \; j \; is \; the \; most \; similar \; stock \; to\; stock \; i \; in \; the \; index \; fund \\ 0\; \; otherwise \end{array} \right. $
q represents the number of securities in the index fund and $\alpha_{ij} $ the similarity between stock i and stock j.
We can use the covariance matrix or the correlation matrix as a measure of similarity between the fund assets.
The objective function (1) maximizes total similarity between securities selected.
The cardinality constraint (2) enforces that the tracking portfolio will have exactly q securities.
Constraint (3) ensures that each securities has exactly one representative in the portfolio.
Constraint (4) prohibits a security to be a representative of any security if it is not selected to be part of the tracking portfolio.
Constraint (4) stipulates that x and y are binary variables
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pulp import *
import seaborn as sns
import re
import time
plt.style.use('seaborn')
def getData(n, l):
'''
n : number of assets in the fund
l : number of days considered
returns a data set of dimension (l,n)
'''
data=pd.read_csv('H:/Desktop/Dauphine/Optimisation en Finance/CAC40.csv')
data = data.set_index('Date')
data = data[data.columns[0:n]][0:l]
data_return = data.pct_change()
data_return.fillna(0, inplace=True)
return data_return
print(getData(4,100).head())
def solver(n, q, df):
'''
The solver function takes the number of stocks within the fund (n),
the number of securities we should select to represent the index fund as closely as possible (q),
and the measure of similarity between the stocks which is the correlation or covariance matrix (df)
It returns the ILP problem previously expressed
'''
df = np.ravel(df) # transforms the matrix into a vector
# Define the decision variables
x_name=[]
for i in range(n*n):
x_name.append("x{0}".format(i+1))
x = [LpVariable(x_name[i], 0 , 1, LpBinary) for i in range(n*n) ]
y_name=[]
for i in range(n):
y_name.append("y{0}".format(i+1))
y = [LpVariable(y_name[i], 0 , 1, LpBinary) for i in range(n) ]
# defines the problem
prob = LpProblem("problem", LpMaximize)
# Objective function
prob += sum(df * x)
# Constraints
prob += LpConstraint(sum(y),sense=0,rhs = q) # y1 + y2 + y3 + ... = q
for i in range(n):
prob += LpConstraint(sum(x[(i*n):((i+1)*n)]),sense=0,rhs = 1) # sum(x_ij) = 1
j = 0
for y in y:
for i in range(n):
prob += LpConstraint(x[i*n+j]-y,sense=-1,rhs = 0) # x_ij <= y_j
j = j + 1
return prob
Let's try to solve the problem for q in range 1 to 3 and a fund of 4 assets using the covariance and the correlation matrix as a measure of similarity
data_return = getData(4,100)
data_cov = data_return.cov()
data_corr = data_return.corr()
indexFund = data_return.mean(axis = 1)
data_return['indexFund'] = indexFund
print(data_return.describe()[1:3])
# If we use the covariance matrix
df = data_cov
print("\n",data_return.cov())
for q in range(1,4):
prob = solver(n=4, q=q, df=df)
prob.solve()
print("\n status : ", LpStatus[prob.solve()], ", q = ", q)
for v in prob.variables():
if 'y' in v.name:
print(v.name, "=", v.varValue)
# If we use the correlation matrix
df = data_corr
print("\n",data_return.corr())
for q in range(1,4):
prob = solver(n=4, q=q, df=df)
prob.solve()
print("\n status : ", LpStatus[prob.solve()], ", q = ", q)
for v in prob.variables():
if 'y' in v.name:
print(v.name, "=", v.varValue)
In each case it seems that the assets selected are those whose performance is the closest to the performance of the whole fund.
We use 1000 days and q=10 assets to select the stocks within our index fund, we assumes that we invest equally in the 10 assets selected.
After that we observe the expected return of the fund versus the expected return of ou index fund for a period of 100 days to see if our approach is relevant of not.
data_return = getData(30,2000)
def GetCompareExpectedReturns(data, q, nbDaysUsed, nbDaysObserved):
nbDays = [] # list to stock the number of days used to construct the index fund
ERIndexFund = [] # list to stock expected returns of the index fund
ERFund = [] # list to stock expected returns of the fund
dataExp = data[data.columns[:]][0:nbDaysUsed]
df = dataExp.corr()
prob = solver(n=dataExp.shape[1], q=q, df=df)
prob.solve()
for j in range(nbDaysUsed, nbDaysUsed+nbDaysObserved-1):
a = [] # list to stock expected returns of asset selected
for v in prob.variables():
if 'y' in v.name and v.varValue==1:
ide = int(re.search(r'\d+', v.name).group()) - 1
a.append(data[data.columns[ide]][j:j+1].mean())
nbDays.append(j)
ERIndexFund.append( sum(a) / float(len(a)))
ERFund.append(data[data.columns[:]][j:j+1].mean().mean())
return ERIndexFund, ERFund, nbDays
ERIndexFund, ERFund, nbDays = GetCompareExpectedReturns(data=data_return, q=10, nbDaysUsed=1000, nbDaysObserved=100)
port = pd.DataFrame({'nbDays': nbDays, 'ERFund': ERFund, 'ERIndexFund': ERIndexFund})
# Plot of the comparison between the behaviour of the fund and our index fund
plt.scatter(port['nbDays'], port['ERFund'], color='blue', alpha=0.9, s=10, label='')
plt.scatter(port['nbDays'], port['ERIndexFund'], color='red', alpha=0.9, s=10, label='')
plt.plot(port['nbDays'], port['ERFund'], color='blue', label='Fund')
plt.plot(port['nbDays'], port['ERIndexFund'], color='red', label='Index Fund')
plt.xlabel('Days observed')
plt.ylabel('Expected Returns')
plt.title('Expected Returns of the fund versus the index fund according to different periods of time considered')
plt.legend(loc='upper right')
plt.plot
The behaviour of our index fund seems to replicate correctly the fluctuation of the fund.
However this test is not enough to conclude about the performance of this approach.
Let's compare the portfolio performance constructed in period t (based on historical data up to period t) with the market performances by observing the portfolio performances in period t + 1.
To do that, we construct an index fund for different periods of time and look at the difference of the expected return of the fund versus the expected return of ou index fund the day after the period considered.
data_return = getData(30,1000)
def GetExpectedReturns(data, q):
nbDays = []
ERIndexFund = []
ERFund = []
for j in range(7, data.shape[0]):
a = []
dataExp = data[data.columns[:]][0:j]
df = dataExp.corr()
prob = solver(n=dataExp.shape[1], q=q, df=df)
prob.solve()
for v in prob.variables():
if 'y' in v.name and v.varValue==1:
ide = int(re.search(r'\d+', v.name).group()) - 1
a.append(dataExp[dataExp.columns[ide]][0:dataExp.shape[0]].mean())
nbDays.append(j)
ERIndexFund.append( sum(a) / float(len(a)))
ERFund.append(dataExp.mean().mean())
return ERIndexFund, ERFund, nbDays
ERIndexFund, ERFund, nbDays = GetExpectedReturns(data=data_return, q=10)
port = pd.DataFrame({'nbDays': nbDays, 'ERFund': ERFund, 'ERIndexFund': ERIndexFund})
plt.scatter(port['nbDays'], port['ERFund'], color='blue', alpha=0.9, s=10, label='')
plt.scatter(port['nbDays'], port['ERIndexFund'], color='red', alpha=0.9, s=10, label='')
plt.xlabel('Number of days considered')
plt.ylabel('Expected Returns')
plt.title('Expected Returns of the fund versus the index fund according to different periods of time considered')
plt.legend(loc='upper left')
plt.plot
The difference of the expected return of the fund versus the expected return of ou index fund decrease while we increase the period considered.
data_return = getData(30,100)
data_cov = data_return.cov()
data_corr = data_return.corr()
def Getqperf(data, df):
lq = []
perf = []
range_q = data.shape[1]
for q in range(1, range_q):
prob = solver(n=data.shape[1], q=q, df=df)
prob.solve()
lq.append(q)
perf.append(value(prob.objective))
return lq, perf
df = data_cov
lq, perf_cov = Getqperf(data_return, df)
df = data_corr
lq, perf_corr = Getqperf(data_return, df)
port = pd.DataFrame({'q': lq, 'ObjFuncValue_cov': perf_cov, 'ObjFuncValue_corr': perf_corr})
print(port.head())
plt.scatter(port['q'], port['ObjFuncValue_cov'], color='blue', alpha=0.9, s=15, label='')
plt.scatter(port['q'], port['ObjFuncValue_corr'], color='red', alpha=0.9, s=15, label='')
plt.plot(port['q'], port['ObjFuncValue_cov'], color='blue', label='Covariance')
plt.plot(port['q'], port['ObjFuncValue_corr'], color='red', label='Correlation')
plt.xlabel('Number of assets selected')
plt.ylabel('Objective function value')
plt.title('Performance according to the number of assets selected')
plt.legend(loc='upper left')
plt.plot
According to the following graphic, it seems that the best value for q, in other words, the value of q which maximize the objective function is the highest and that the correlation matrix as a measure of similarity between assest is hugely more efficient than the covariance matrix.
Therefore, if we want to determine a good portfolio analyzing the trade off between q and the performances, we would choose the highest q possible which would leads us to a full replication strategy. Hence, to achieve the same performance as the index, the tracking portfolio would be built by purchasing all the assets available in the funds with the same fixed quantities. Otherwise we could select q assets arbitrarily.
Both option are not optimal and they are a lot of drawbacks associated with them :
$\bullet$ There are no constraints about minimum or maximum amount of capital invested in each asset
$\bullet$ Transactions costs are not taken into account
$\bullet$ The portfolio constructed in this manner can track well the benchmark in terms of return but may be insufficiently diversified as there is no constraints that limits portfolio risk
$ w_{j}$ is the total market value of the stocks represented by stock j in the fund
$$ w_{j} = \sum_{i=1}^{n} V_{i} x_{ij} $$
Where $V_{i}$ is the market value of stock i. In the following implementation $V_{i}$ corresponds to the mean of market value taken by stock i.
The fraction of the index fund to be invested in stock j is proportional to the stocks weight $w_{j}$
$$ \frac{w_{j}}{\sum_{f=1}^{q} w_{f}} $$
def getMarketValue(data, n):
'''
n : number of assets in the fund
returns the market value of each assets in the fund
'''
v = []
for i in range(n):
v.append(data[data.columns[i]][:].mean())
return v
def getSorted(var):
'''
sort the variable linked to the problem through their name (x1 x2 x3 ...)
'''
x_ = []
x_value = []
for v in var:
if 'x' in v.name:
x_.append(int(re.search(r'\d+', v.name).group()))
x_value.append(v.varValue)
df_x = pd.DataFrame({'x_': x_, 'x_value': x_value})
df_x = df_x.sort_values(by = 'x_')
df_x = df_x.reset_index(drop=True)
return df_x
def getTotalMarketValue(data, n, q, df):
'''
returns the weight wj calculated for each assets selected in the fund
'''
prob = solver(n=n, q=q, df=df)
prob.solve()
w=[]
vi = getMarketValue(data, n)
df_x = getSorted(prob.variables())
for j in range(n):
ie = 0
for i in range(n):
ie += vi[i] * df_x['x_value'][i*n+j]
w.append(ie)
return(w)
def getWeight(data, n, df):
'''
for each q possible the function returns the weight
normalised for each assets selected in the fund
'''
listWeight = []
for j in range(1,n-1):
w = []
w = getTotalMarketValue(data=data, n=n, q=j, df=df)
wf = sum(w)
for i in range(len(w)):
w[i]=w[i]/wf
listWeight.append(w)
df_weight = pd.DataFrame(listWeight, columns=data.columns)
return df_weight
n=5
l = 200
data = pd.read_csv('H:/Desktop/Dauphine/Optimisation en Finance/CAC40.csv')
data = data.set_index('Date')
data = data[data.columns[0:n]][0:l]
data_return = getData(n,l)
data_cov = data_return.cov()
data_corr = data_return.corr()
df = data_corr
df_weight = getWeight(data, n, df)
print(df_weight.head())
This test shows the weight of selected assets for a sub-dataset and different value of q.
Indeed for q = 1 (first line), the capital is logically invested in one asset.
For q = 2, the capital is invested in two assets and the sum of the weight is eqals to 1 and so on ..
Considering the S&P 500 index, let use the previous ILP model to construct an index fund.
As shown previously the correlation matrix is more efficient than the covariance one, therefore we will use the correlation matrix as a proxy of similarity between assets.
In addition, as the choice of q is very subjective for the moment, we propose a table with weight of each index selected in function of the value of q.
It is still interesting to remind that the higher the value of q is, the higher the objective function value will be. Moreover, it will be the same scenario for the number of days used to built the index fund.
def getDataSP500(n,l):
'''
n : number of assets in the fund
l : number of days considered
returns a data set of dimension (l,n)
'''
data = pd.read_csv('H:/Desktop/Dauphine/Optimisation en Finance/SP500.csv')
data['date'] = pd.to_datetime(data['date'])
data.set_index('date', inplace=True)
data = data.pivot(columns='Name', values='close')
columns_na = data.isna().any()
sum(columns_na)
data = data.dropna(1)
data = data[data.columns[:n]][0:l]
data_return = data.pct_change()
data_return.fillna(0, inplace=True)
return data_return
n = 20
data_return = getDataSP500(n,100)
data_corr = data_return.corr()
df = data_corr
q = 10
prob = solver(n=n, q=q, df=df)
prob.solve()
print("\n status : ", LpStatus[prob.solve()], ", q = ", q)
It is possible to use the functions previously implemented to get the assets in which we should invest and th repartition of the weights.
plt.style.use('seaborn')
from time import time
data_return = getDataSP500(40,1000)
data_cov = data_return.cov()
data_corr = data_return.corr()
df = data_corr
def GetTimeNbreAssets(data, q, df):
timee = []
nbrAsset = []
for i in range(q+1, data.shape[1]):
dataExp = data[data.columns[0:i]][:]
df = dataExp.corr()
start_time = time()
prob = solver(n=dataExp.shape[1], q=q, df=df)
timee.append(time() - start_time)
nbrAsset.append(i)
return timee, nbrAsset
timee, nbrAsset = GetTimeNbreAssets(data_return, 5, df)
port = pd.DataFrame({'timee': timee, 'nbrAsset':nbrAsset })
plt.scatter(port['nbrAsset'], port['timee'], alpha=0.9, s=10, label='')
plt.plot(port['nbrAsset'], port['timee'], color='blue')
plt.ylabel('time in seconds')
plt.xlabel('Number of stocks considered ')
plt.title('Time of solver execution for different number of stocks used and q=5')
plt.plot
Unfortunately, our computer is not powerful enough to execute the solver with high quantity of data.
However, it seems that the relation between the time of execution and the number of stocks can be represented by an exponential law or a maybe a cubic function.
Once a predictive model has been set, it should be easy to Determine the maximum of instances dimension which can be solved to proven optimality within an acceptable computing time of 10 minutes.
plt.style.use('seaborn')
from time import time
data_return = getDataSP500(40,1000)
data_cov = data_return.cov()
data_corr = data_return.corr()
df = data_corr
def GetTimeq(data, df):
timee = []
q = []
df = data.corr()
for i in range(1, data.shape[1]):
start_time = time()
prob = solver(n=data.shape[1], q=i, df=df)
timee.append(time() - start_time)
q.append(i)
return timee, q
timee, q = GetTimeq(data_return, df)
port = pd.DataFrame({'timee': timee, 'q': q})
plt.scatter(port['q'], port['timee'], color='blue', alpha=0.9, s=10, label='')
plt.plot(port['q'], port['timee'], color='blue')
plt.ylabel('time in seconds')
plt.xlabel('Number of assets selected (q)')
plt.title('Time of solver execution for different value of q')
plt.plot
It would be interesting to add constraints in order to overcome the issue evoked previously concerning the transactions costs. Indeed the realocation of the capital in order to adjust the current portfolio can be very expensive if the existing portfolio is different from the targeted one.
In addition, it might be pertinent to add minimum and maximum transaction levels while adding the weight of assets directly in the model
Here is a model incorporating those ideas :
\begin{array}{lll} max \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{ij} x_{ij} \\ \\ \sum_{j=1}^{n} x_{ij} = 1 \hspace{3cm} i = 1,...,n \\ \\ x_{ij} \le y_{j} \hspace{1cm} i = 1,...,n \hspace{0.2cm} j= 1,...,n \\ \\ \sum_{j=1}^{n} y_{j} = q \\ \\ l_{j} y_{j} \le \frac{ \sum_{i=1}^{n} V_{i} x_{ij}}{ \sum_{i=1}^{n} V_{i}} \le u_{j} y_{j} \hspace{0.5cm} j= 1,...,n \\ \\ w_{j} = \frac{ \sum_{i=1}^{n} V_{i} x_{ij}}{ \sum_{i=1}^{n} V_{i}} \hspace{0.5cm} j= 1,...,n \\ \\ \sum_{j=1}^{n} |w_{j}^0 - w_{j}| \alpha \le \lambda \\ \\ x_{ij}, y_{j} \in \left\{ 0,1 \right\} \end{array}
where :
$V_{i}$ is the market value of stock i at current time
$w_{j}$ denotes the proportion of wealth invested in stock j (set to 0 if asset j is not selected )
$w_{j}^0$ corresponds to the proportion of stock j in current portfolio
$|w_{j}^0 - w_{j}|$ denotes the turnover of stock j from buying or selling.
To overcome the lack of diversification within the portfolio constructed this way, we can classify the companies of the CAC40 by sector and then rebalance the portfolio at minimum cost.
Indeed, we might introduce different sectors of activity which best represents the companies of the CAC40. Such sectors could be Insurance, bank, Chemistry, real-estate or others.
$\cdot $ Let us assume that there are m such characteristics that we would like our index fund to track as well as possible
$\cdot $ The $f_{i}$ correspond to the different sectors evoked previously
$\cdot $ Let $a_{ij} = 1$ if company j belongs to sector i and 0 if it does not
$\cdot $ Let $x_{j} $ denote the optimum weight of asset j in the portfolio
$\cdot $ Assume that initially, the portfolio has weights $x_{j}^{0}$
$\cdot $ Let $ y_{j}$ denote the fraction of asset j bought and $z_{j} $ the fraction sold
The linear programming model corresponding to this approach is the following one :
$$min \sum_{i=1}^{n} (y_{j} + z_{j}) \hspace{5cm}$$
$$\sum_{j=1}^{n} a_{ij} x_{j} = f_{i} \hspace{3cm} i= 1,...,m$$
$$\sum_{j=1}^{n} x_{j} = 1 \hspace{5cm}$$
$$ x_{j} -x_{j}^{0} \le y_{j} \hspace{3cm} j= 1,...,n $$
$$ x_{j}^{0} -x_{j} \le z_{j} \hspace{3cm} j= 1,...,n $$
$$ x_{j} \ge 0 \hspace{3cm} j= 1,...,n $$
$$ y_{j} \ge 0 \hspace{3cm} j= 1,...,n $$
$$ z_{j} \ge 0 \hspace{3cm} j= 1,...,n $$
The use of the lagrangian relaxation would be interesting in order to reduce the time needed to execute the solver.