Louis JEANPIERRE
MIAGE IF

Optimization in Finance

Part 1 : Investigate the performance of the mode

The main goal of this work is to understand and overcome the different stages of building an index fund

Index Fund - Definition :

Given a target population of n stocks, select q stocks and their weights in the index fund (a portfolio), to represent the target population as closely as possible.

First approach

The idea here is to consider the following ILP model aiming at aggregating a broad market index of n securities into a representative index fund:

\begin{array}{lll} max \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{ij} x_{ij} \hspace{7cm} (1) \\ \\ \sum_{j=1}^{n} y_{j} = q \hspace{8cm} (2) \\ \\ \sum_{j=1}^{n} x_{ij} = 1 \hspace{1cm} i = 1,...,n \hspace{5cm} (3) \\ \\ x_{ij} \le y_{j} \hspace{1.5cm} i= 1,...,n \hspace{0.5cm} j= 1,...,n \hspace{3cm} (4)\\ \\ x_{ij}, y_{j} \in \left\{ 0,1 \right\} \hspace{0.5cm} i= 1,...,n \hspace{0.5cm} j= 1,...,n \hspace{3cm} (5) \end{array}

Where the decision variables represent :

$ y_{j} = \left\{ \begin{array}{ll} 1 \; \; if \; stock \; j \; is \; in \; the \; index\; fund \\ 0\; \; otherwise \end{array} \right. $

$ x_{ij} = \left\{ \begin{array}{ll} 1 \; \; if \; j \; is \; the \; most \; similar \; stock \; to\; stock \; i \; in \; the \; index \; fund \\ 0\; \; otherwise \end{array} \right. $

q represents the number of securities in the index fund and $\alpha_{ij} $ the similarity between stock i and stock j.

We can use the covariance matrix or the correlation matrix as a measure of similarity between the fund assets.

Explanation of the model

The objective function (1) maximizes total similarity between securities selected.

The cardinality constraint (2) enforces that the tracking portfolio will have exactly q securities.

Constraint (3) ensures that each securities has exactly one representative in the portfolio.

Constraint (4) prohibits a security to be a representative of any security if it is not selected to be part of the tracking portfolio.

Constraint (4) stipulates that x and y are binary variables

Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pulp import *
import seaborn as sns
import re
import time
plt.style.use('seaborn')

Get the data

In [ ]:
def getData(n, l):
    '''
    n : number of assets in the fund
    l : number of days considered 
    returns a data set of dimension (l,n) 
    '''
    
    data=pd.read_csv('H:/Desktop/Dauphine/Optimisation en Finance/CAC40.csv')
    data = data.set_index('Date')
    data = data[data.columns[0:n]][0:l]

    data_return = data.pct_change()
    data_return.fillna(0, inplace=True)
    return data_return 

print(getData(4,100).head())

Set up of the solver for this ILP model

In [3]:
def solver(n, q, df):
    '''
    The solver function takes the number of stocks within the fund (n), 
    the number of securities we should select to represent the index fund as closely as possible (q),
    and the measure of similarity between the stocks which is the correlation or covariance matrix (df)
    It returns the ILP problem previously expressed 
    '''
    
    df = np.ravel(df) # transforms the matrix into a vector 
    
    # Define the decision variables
    x_name=[]
    for i in range(n*n):
        x_name.append("x{0}".format(i+1))
    x = [LpVariable(x_name[i], 0 , 1, LpBinary) for i in range(n*n) ]

    y_name=[]
    for i in range(n):
        y_name.append("y{0}".format(i+1))
    y = [LpVariable(y_name[i], 0 , 1, LpBinary) for i in range(n) ]

    # defines the problem
    prob = LpProblem("problem", LpMaximize)

    # Objective function
    prob += sum(df * x)

    # Constraints
    prob += LpConstraint(sum(y),sense=0,rhs = q) # y1 + y2 + y3 + ... = q

    for i in range(n):
        prob += LpConstraint(sum(x[(i*n):((i+1)*n)]),sense=0,rhs = 1) # sum(x_ij) = 1

    j = 0
    for y in y:
        for i in range(n):
            prob += LpConstraint(x[i*n+j]-y,sense=-1,rhs = 0) # x_ij <= y_j
        j = j + 1 

    return prob

testing of the solver

Let's try to solve the problem for q in range 1 to 3 and a fund of 4 assets using the covariance and the correlation matrix as a measure of similarity

In [11]:
data_return = getData(4,100)

data_cov = data_return.cov()
data_corr = data_return.corr()

indexFund = data_return.mean(axis = 1)
data_return['indexFund'] = indexFund
print(data_return.describe()[1:3])

# If we use the covariance matrix
df = data_cov
print("\n",data_return.cov())

for q in range(1,4):
    prob = solver(n=4, q=q, df=df)
    prob.solve()
    print("\n status : ", LpStatus[prob.solve()], ", q = ", q)
    for v in prob.variables():
        if 'y' in v.name:
            print(v.name, "=", v.varValue)

# If we use the correlation matrix
df = data_corr
print("\n",data_return.corr())

for q in range(1,4):
    prob = solver(n=4, q=q, df=df)
    prob.solve()
    print("\n status : ", LpStatus[prob.solve()], ", q = ", q)
    for v in prob.variables():
        if 'y' in v.name:
            print(v.name, "=", v.varValue)         
            AC       ACA        AI       AIR  indexFund
mean  0.002045 -0.000558  0.001720  0.001209   0.001104
std   0.011626  0.011321  0.010233  0.014344   0.006650

                  AC       ACA        AI       AIR  indexFund
AC         0.000135  0.000015  0.000044 -0.000023   0.000043
ACA        0.000015  0.000128  0.000027  0.000007   0.000045
AI         0.000044  0.000027  0.000105 -0.000004   0.000043
AIR       -0.000023  0.000007 -0.000004  0.000206   0.000047
indexFund  0.000043  0.000045  0.000043  0.000047   0.000044

 status :  Optimal , q =  1
y1 = 0.0
y2 = 0.0
y3 = 0.0
y4 = 1.0

 status :  Optimal , q =  2
y1 = 1.0
y2 = 0.0
y3 = 0.0
y4 = 1.0

 status :  Optimal , q =  3
y1 = 1.0
y2 = 1.0
y3 = 0.0
y4 = 1.0

                  AC       ACA        AI       AIR  indexFund
AC         1.000000  0.116151  0.366061 -0.138636   0.552571
ACA        0.116151  1.000000  0.237129  0.044708   0.591687
AI         0.366061  0.237129  1.000000 -0.024226   0.632537
AIR       -0.138636  0.044708 -0.024226  1.000000   0.488353
indexFund  0.552571  0.591687  0.632537  0.488353   1.000000

 status :  Optimal , q =  1
y1 = 0.0
y2 = 0.0
y3 = 1.0
y4 = 0.0

 status :  Optimal , q =  2
y1 = 0.0
y2 = 0.0
y3 = 1.0
y4 = 1.0

 status :  Optimal , q =  3
y1 = 1.0
y2 = 1.0
y3 = 0.0
y4 = 1.0

In each case it seems that the assets selected are those whose performance is the closest to the performance of the whole fund.

Let's look at the performance of the portfolio built in this way

We use 1000 days and q=10 assets to select the stocks within our index fund, we assumes that we invest equally in the 10 assets selected.

After that we observe the expected return of the fund versus the expected return of ou index fund for a period of 100 days to see if our approach is relevant of not.

In [14]:
data_return = getData(30,2000)

def GetCompareExpectedReturns(data, q, nbDaysUsed, nbDaysObserved):
    nbDays = [] # list to stock the number of days used to construct the index fund 
    ERIndexFund = [] # list to stock expected returns of the index fund 
    ERFund = [] # list to stock expected returns of the fund
    dataExp = data[data.columns[:]][0:nbDaysUsed]
    df = dataExp.corr()
    prob = solver(n=dataExp.shape[1], q=q, df=df)
    prob.solve()
    for j in range(nbDaysUsed, nbDaysUsed+nbDaysObserved-1):
        a = [] # list to stock expected returns of asset selected
        for v in prob.variables():
            if 'y' in v.name and v.varValue==1:
                ide = int(re.search(r'\d+', v.name).group()) - 1
                a.append(data[data.columns[ide]][j:j+1].mean())
        nbDays.append(j) 
        ERIndexFund.append(  sum(a) / float(len(a)))     
        ERFund.append(data[data.columns[:]][j:j+1].mean().mean())
    return ERIndexFund, ERFund, nbDays 
    
ERIndexFund, ERFund, nbDays = GetCompareExpectedReturns(data=data_return, q=10, nbDaysUsed=1000, nbDaysObserved=100)
port = pd.DataFrame({'nbDays': nbDays, 'ERFund': ERFund, 'ERIndexFund': ERIndexFund})

# Plot of the comparison between the behaviour of the fund and our index fund 
plt.scatter(port['nbDays'], port['ERFund'], color='blue', alpha=0.9, s=10, label='')
plt.scatter(port['nbDays'], port['ERIndexFund'], color='red', alpha=0.9, s=10, label='')
plt.plot(port['nbDays'], port['ERFund'], color='blue', label='Fund')
plt.plot(port['nbDays'], port['ERIndexFund'], color='red', label='Index Fund')

plt.xlabel('Days observed')
plt.ylabel('Expected Returns')
plt.title('Expected Returns of the fund versus the index fund according to different periods of time considered')
plt.legend(loc='upper right')
plt.plot
(2842, 30)
Out[14]:
<function matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)>

The behaviour of our index fund seems to replicate correctly the fluctuation of the fund.

However this test is not enough to conclude about the performance of this approach.

Performance of our index fund according to different periods of time

Let's compare the portfolio performance constructed in period t (based on historical data up to period t) with the market performances by observing the portfolio performances in period t + 1.

To do that, we construct an index fund for different periods of time and look at the difference of the expected return of the fund versus the expected return of ou index fund the day after the period considered.

In [15]:
data_return = getData(30,1000)

def GetExpectedReturns(data, q):
    nbDays = []
    ERIndexFund = []
    ERFund = []
    for j in range(7, data.shape[0]):
        a = []
        dataExp = data[data.columns[:]][0:j]
        df = dataExp.corr()
        prob = solver(n=dataExp.shape[1], q=q, df=df)
        prob.solve()
        for v in prob.variables():
            if 'y' in v.name and v.varValue==1:
                ide = int(re.search(r'\d+', v.name).group()) - 1
                a.append(dataExp[dataExp.columns[ide]][0:dataExp.shape[0]].mean())
        nbDays.append(j)        
        ERIndexFund.append(  sum(a) / float(len(a)))     
        ERFund.append(dataExp.mean().mean())
    return ERIndexFund, ERFund, nbDays 
    
ERIndexFund, ERFund, nbDays = GetExpectedReturns(data=data_return, q=10)
port = pd.DataFrame({'nbDays': nbDays, 'ERFund': ERFund, 'ERIndexFund': ERIndexFund})

plt.scatter(port['nbDays'], port['ERFund'], color='blue', alpha=0.9, s=10, label='')
plt.scatter(port['nbDays'], port['ERIndexFund'], color='red', alpha=0.9, s=10, label='')

plt.xlabel('Number of days considered')
plt.ylabel('Expected Returns')
plt.title('Expected Returns of the fund versus the index fund according to different periods of time considered')
plt.legend(loc='upper left')
plt.plot
(2842, 30)
Out[15]:
<function matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)>

The difference of the expected return of the fund versus the expected return of ou index fund decrease while we increase the period considered.

Comparison between the covariance and correlation matrix

In [24]:
data_return = getData(30,100)

data_cov = data_return.cov()
data_corr = data_return.corr()

def Getqperf(data, df):
    lq = []
    perf = []
    range_q = data.shape[1]
    for q in range(1, range_q):
        prob = solver(n=data.shape[1], q=q, df=df)
        prob.solve()
        lq.append(q)
        perf.append(value(prob.objective))
    return lq, perf

df = data_cov
lq, perf_cov = Getqperf(data_return, df) 

df = data_corr
lq, perf_corr = Getqperf(data_return, df) 

port = pd.DataFrame({'q': lq, 'ObjFuncValue_cov': perf_cov, 'ObjFuncValue_corr': perf_corr})
print(port.head())

plt.scatter(port['q'], port['ObjFuncValue_cov'], color='blue', alpha=0.9, s=15, label='')
plt.scatter(port['q'], port['ObjFuncValue_corr'], color='red', alpha=0.9, s=15, label='')
plt.plot(port['q'], port['ObjFuncValue_cov'], color='blue', label='Covariance')
plt.plot(port['q'], port['ObjFuncValue_corr'], color='red', label='Correlation')
plt.xlabel('Number of assets selected')
plt.ylabel('Objective function value')
plt.title('Performance according to the number of assets selected')
plt.legend(loc='upper left')
plt.plot
(2842, 30)
   q  ObjFuncValue_cov  ObjFuncValue_corr
0  1          0.096052          10.947693
1  2          0.099464          12.689154
2  3          0.099767          13.750641
3  4          0.100012          14.745031
4  5          0.100220          15.669743
Out[24]:
<function matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)>

According to the following graphic, it seems that the best value for q, in other words, the value of q which maximize the objective function is the highest and that the correlation matrix as a measure of similarity between assest is hugely more efficient than the covariance matrix.

Therefore, if we want to determine a good portfolio analyzing the trade off between q and the performances, we would choose the highest q possible which would leads us to a full replication strategy. Hence, to achieve the same performance as the index, the tracking portfolio would be built by purchasing all the assets available in the funds with the same fixed quantities. Otherwise we could select q assets arbitrarily.

Both option are not optimal and they are a lot of drawbacks associated with them :

$\bullet$ There are no constraints about minimum or maximum amount of capital invested in each asset

$\bullet$ Transactions costs are not taken into account

$\bullet$ The portfolio constructed in this manner can track well the benchmark in terms of return but may be insufficiently diversified as there is no constraints that limits portfolio risk

Calcul of the weight associated with each assets selected

$ w_{j}$ is the total market value of the stocks represented by stock j in the fund

$$ w_{j} = \sum_{i=1}^{n} V_{i} x_{ij} $$

Where $V_{i}$ is the market value of stock i. In the following implementation $V_{i}$ corresponds to the mean of market value taken by stock i.

The fraction of the index fund to be invested in stock j is proportional to the stocks weight $w_{j}$

$$ \frac{w_{j}}{\sum_{f=1}^{q} w_{f}} $$

In [4]:
def getMarketValue(data, n):
    '''
    n : number of assets in the fund
    returns the market value of each assets in the fund 
    '''
    v = []
    for i in range(n):
        v.append(data[data.columns[i]][:].mean())
    return v 

def getSorted(var):
    '''
    sort the variable linked to the problem through their name (x1 x2 x3 ...)
    '''
    x_ = []
    x_value = []
    for v in var:
        if 'x' in v.name:
            x_.append(int(re.search(r'\d+', v.name).group()))
            x_value.append(v.varValue)
    df_x = pd.DataFrame({'x_': x_, 'x_value': x_value})   
    df_x = df_x.sort_values(by = 'x_')
    df_x = df_x.reset_index(drop=True)
    
    return df_x


def getTotalMarketValue(data, n, q, df):
    '''
    returns the weight wj calculated for each assets selected in the fund 
    '''
    prob = solver(n=n, q=q, df=df)
    prob.solve()
    w=[]
    vi = getMarketValue(data, n)
    df_x = getSorted(prob.variables())
    
    for j in range(n):
        ie = 0
        for i in range(n):
            ie += vi[i] * df_x['x_value'][i*n+j] 
        w.append(ie)

    return(w)

def getWeight(data, n, df):
    '''
    for each q possible the function returns the weight 
    normalised for each assets selected in the fund 
    '''
    listWeight = []
    
    for j in range(1,n-1):
        w = []
        w = getTotalMarketValue(data=data, n=n, q=j, df=df)
        wf = sum(w)

        for i in range(len(w)):
            w[i]=w[i]/wf
        listWeight.append(w)
        
    df_weight = pd.DataFrame(listWeight, columns=data.columns)
    
    return df_weight


n=5
l = 200

data = pd.read_csv('H:/Desktop/Dauphine/Optimisation en Finance/CAC40.csv')
data = data.set_index('Date')
data = data[data.columns[0:n]][0:l]

data_return = getData(n,l)
data_cov = data_return.cov()
data_corr = data_return.corr()
df = data_corr

df_weight = getWeight(data, n, df)
print(df_weight.head())
    AC  ACA        AI       AIR        BN
0  0.0  0.0  1.000000  0.000000  0.000000
1  0.0  0.0  0.795118  0.204882  0.000000
2  0.0  0.0  0.541142  0.204882  0.253975

This test shows the weight of selected assets for a sub-dataset and different value of q.

Indeed for q = 1 (first line), the capital is logically invested in one asset.
For q = 2, the capital is invested in two assets and the sum of the weight is eqals to 1 and so on ..

Part 2 : Business game

Considering the S&P 500 index, let use the previous ILP model to construct an index fund.

As shown previously the correlation matrix is more efficient than the covariance one, therefore we will use the correlation matrix as a proxy of similarity between assets.

In addition, as the choice of q is very subjective for the moment, we propose a table with weight of each index selected in function of the value of q.

It is still interesting to remind that the higher the value of q is, the higher the objective function value will be. Moreover, it will be the same scenario for the number of days used to built the index fund.

In [7]:
def getDataSP500(n,l):
    '''
    n : number of assets in the fund
    l : number of days considered 
    returns a data set of dimension (l,n) 
    '''
    
    data = pd.read_csv('H:/Desktop/Dauphine/Optimisation en Finance/SP500.csv')
    data['date'] = pd.to_datetime(data['date'])
    data.set_index('date', inplace=True)
    data = data.pivot(columns='Name', values='close')
    columns_na = data.isna().any()
    sum(columns_na)
    data = data.dropna(1)
    data = data[data.columns[:n]][0:l]

    data_return = data.pct_change()
    data_return.fillna(0, inplace=True)
    return data_return 

n = 20

data_return = getDataSP500(n,100)
data_corr = data_return.corr()
df = data_corr

q = 10

prob = solver(n=n, q=q, df=df)
prob.solve()

print("\n status : ", LpStatus[prob.solve()], ", q = ", q)
 status :  Optimal , q =  10

It is possible to use the functions previously implemented to get the assets in which we should invest and th repartition of the weights.

Analysis of the variation in the computational time according to the total number of stocks

In [8]:
plt.style.use('seaborn')
from time import time

data_return = getDataSP500(40,1000)
data_cov = data_return.cov()
data_corr = data_return.corr()
df = data_corr

def GetTimeNbreAssets(data, q, df):
    timee = []
    nbrAsset = []
    for i in range(q+1, data.shape[1]):
        dataExp = data[data.columns[0:i]][:]
        df = dataExp.corr()
        start_time = time()
        prob = solver(n=dataExp.shape[1], q=q, df=df)
        timee.append(time() - start_time)
        nbrAsset.append(i)
    return timee, nbrAsset

timee, nbrAsset = GetTimeNbreAssets(data_return, 5, df)
port = pd.DataFrame({'timee': timee, 'nbrAsset':nbrAsset })
 
plt.scatter(port['nbrAsset'], port['timee'], alpha=0.9, s=10, label='')
plt.plot(port['nbrAsset'], port['timee'], color='blue')

plt.ylabel('time in seconds')
plt.xlabel('Number of stocks considered ')
plt.title('Time of solver execution for different number of stocks used and q=5')
plt.plot
Out[8]:
<function matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)>

Unfortunately, our computer is not powerful enough to execute the solver with high quantity of data.

However, it seems that the relation between the time of execution and the number of stocks can be represented by an exponential law or a maybe a cubic function.

Once a predictive model has been set, it should be easy to Determine the maximum of instances dimension which can be solved to proven optimality within an acceptable computing time of 10 minutes.

Analysis of the computational time according to different value of q

In [9]:
plt.style.use('seaborn')
from time import time

data_return = getDataSP500(40,1000)
data_cov = data_return.cov()
data_corr = data_return.corr()
df = data_corr

def GetTimeq(data, df):
    timee = []
    q = []
    df = data.corr()
    for i in range(1, data.shape[1]):
        start_time = time()
        prob = solver(n=data.shape[1], q=i, df=df)
        timee.append(time() - start_time)
        q.append(i)
    return timee, q

timee, q = GetTimeq(data_return, df)
port = pd.DataFrame({'timee': timee, 'q': q})
 
plt.scatter(port['q'], port['timee'], color='blue', alpha=0.9, s=10, label='')
plt.plot(port['q'], port['timee'], color='blue')

plt.ylabel('time in seconds')
plt.xlabel('Number of assets selected (q)')
plt.title('Time of solver execution for different value of q')
plt.plot
Out[9]:
<function matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)>

Discussion about the results and possible ameliorations

Model with buy-in threshold and turnover constraints

It would be interesting to add constraints in order to overcome the issue evoked previously concerning the transactions costs. Indeed the realocation of the capital in order to adjust the current portfolio can be very expensive if the existing portfolio is different from the targeted one.

In addition, it might be pertinent to add minimum and maximum transaction levels while adding the weight of assets directly in the model

Here is a model incorporating those ideas :

\begin{array}{lll} max \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{ij} x_{ij} \\ \\ \sum_{j=1}^{n} x_{ij} = 1 \hspace{3cm} i = 1,...,n \\ \\ x_{ij} \le y_{j} \hspace{1cm} i = 1,...,n \hspace{0.2cm} j= 1,...,n \\ \\ \sum_{j=1}^{n} y_{j} = q \\ \\ l_{j} y_{j} \le \frac{ \sum_{i=1}^{n} V_{i} x_{ij}}{ \sum_{i=1}^{n} V_{i}} \le u_{j} y_{j} \hspace{0.5cm} j= 1,...,n \\ \\ w_{j} = \frac{ \sum_{i=1}^{n} V_{i} x_{ij}}{ \sum_{i=1}^{n} V_{i}} \hspace{0.5cm} j= 1,...,n \\ \\ \sum_{j=1}^{n} |w_{j}^0 - w_{j}| \alpha \le \lambda \\ \\ x_{ij}, y_{j} \in \left\{ 0,1 \right\} \end{array}

where :

$V_{i}$ is the market value of stock i at current time

$w_{j}$ denotes the proportion of wealth invested in stock j (set to 0 if asset j is not selected )

$w_{j}^0$ corresponds to the proportion of stock j in current portfolio

$|w_{j}^0 - w_{j}|$ denotes the turnover of stock j from buying or selling.

Diversification of the portfolio

To overcome the lack of diversification within the portfolio constructed this way, we can classify the companies of the CAC40 by sector and then rebalance the portfolio at minimum cost.

Indeed, we might introduce different sectors of activity which best represents the companies of the CAC40. Such sectors could be Insurance, bank, Chemistry, real-estate or others.

$\cdot $ Let us assume that there are m such characteristics that we would like our index fund to track as well as possible

$\cdot $ The $f_{i}$ correspond to the different sectors evoked previously

$\cdot $ Let $a_{ij} = 1$ if company j belongs to sector i and 0 if it does not

$\cdot $ Let $x_{j} $ denote the optimum weight of asset j in the portfolio

$\cdot $ Assume that initially, the portfolio has weights $x_{j}^{0}$

$\cdot $ Let $ y_{j}$ denote the fraction of asset j bought and $z_{j} $ the fraction sold

The linear programming model corresponding to this approach is the following one :

$$min \sum_{i=1}^{n} (y_{j} + z_{j}) \hspace{5cm}$$

$$\sum_{j=1}^{n} a_{ij} x_{j} = f_{i} \hspace{3cm} i= 1,...,m$$

$$\sum_{j=1}^{n} x_{j} = 1 \hspace{5cm}$$

$$ x_{j} -x_{j}^{0} \le y_{j} \hspace{3cm} j= 1,...,n $$

$$ x_{j}^{0} -x_{j} \le z_{j} \hspace{3cm} j= 1,...,n $$

$$ x_{j} \ge 0 \hspace{3cm} j= 1,...,n $$

$$ y_{j} \ge 0 \hspace{3cm} j= 1,...,n $$

$$ z_{j} \ge 0 \hspace{3cm} j= 1,...,n $$

Time of execution

The use of the lagrangian relaxation would be interesting in order to reduce the time needed to execute the solver.