TIØ4317: Empirical and Quantitative Methods in finance
This compendium was written for the 2025 version of the course.
Learning objective of course
There are three parts:
- Linear regression and diagnostic tests
- Autoregressive models
- Risk analysis and portfolio optimization
Statistical background
It is advantagous to be aquainted with consepts from statistics like probability distributions, correlation/covariance, moments, estimator theory, confidence intervals, hypothesis testing, and some familiarity with linear regression.
These consepts are exiplained in the compendium for TMA4240: Statistics.
Linear regression
Classical linear regression model
These are the assumptions of a classical linear regression model (CLRM).
For n independent variables per individual i,
and,
E[ui|Xi]=0 Var(ui |Xi)=σ2 i.e. homoscedasticity (constant variance in error term).Cov(ui,uj)=0∀ i,j i.e. no autocorrelation (error terms are uncorrelated across individuals).ui∼N(0,σ2) (error term is unbiased and ly distributed)- No perfect colinearity (the independent variables are not perfectly correlated).
Assumption violation (important)
If any assumptions are violated it may lead to wrong coefficient estimates, their standard errors and the distributions that we evaluate the estimators with. Therefore we often try to remedy any violations with different techniques like non-linear variables (explained below), until the assumptions are satisfied.
Below we find a list of assumption violations. In the sections following this one some of the terms and procedures mentioned in the list will be elaborated upon
If E[u]≠0
- Detection: There will be model bias
- Effect: Residual test statistics will be wrong because the residuals are affected by the bias, e.g.
R2 - Solution: Add an intercept to regression model,
β0
If Var(ui) is variable (heteroscedasticity)
- Detection: GQ-test, White-test, ARCH-test
- Effect: OLS estimators will not be BLUE
- Solution: Find some
zt such thatVar(ut)=σzt . Then divide every variable in the regression model byzt .
If there is autocorrelation (Cov(ui|uj)≠0 )
- Detection: Durbin-Watson test, Breusch-Godfrey, Box-Pierce, Box-Ljung
- Effect: OLS will not be BLUE and
R2 will be inflated - Solution: Adding lags (see ARMA models), but it is debated and dangerous. May also be caused by omitted variables.
If there is endogeneity (E[u|X]≠0 )
- Detection: By checking the exogeneity assumption,
E[u|X]=0 . It may be caused by omitted variables, measurement error and sample selection. - Effect: The coeffeicients on all other variables will be biased and inconsistent if an important variable is omitted. Standard errors are also biased
- Solution: Controll for omitted variables, use an instrumental variable (See below), Heckman correction (See below), ...
Endogeneity may also be caused by the inclusion of an unimportant variable. Then the coefficients will be unbiased, but inefficient.
If error term is not normally distributed
- Detection: The Bara-Jarque test tests the skewness and kurtosis of some distribution for normality.
where
- Effect: Will make hypothesis testing faulty, as we assume distributions in the tests.
- Solution: No clear solution
In case of multicolinearity
- Detection: Check for correlation between explanatory variables. If they are highly correlated there is muticolinearity.
- Effect:
R2 will be high, and the standard errors of the coefficients will be high. The regression is too sensitive to modification (not stable) and therefore the confidence intervals of the coefficients will be too wide. - Solution: May want to change the regression model to remove the colinearity. Another solution is PCA
Instrumental variables (IV)
If an independent variable X is corrolated with the error term u, we try to divide the variable into two parts: one term corrolated with the error and one term uncorrolated.
The technique for finding an instrumental variable starts by udentifying some other variable (the IV) which is correlated to the endogenious variable, but is uncorrelated with unobserved effects on the dependent variable (the error term). Then we use Two-stage least squares method (2SLS).
The first stage involves predicting the endogenious variable with the new IV with a linear regression. We can generate predictions for the endogenious variablem where the unobserved effects on the dependent variable is removed. So we use this in the final regression model instead of the endogenious variable. Mathematically it looks like this:
If we have a regression model,
but
where all of the unobserved effects of
Now
Sample selection and Heckman correction
Sample selection occurs when the observations are not made randomly and a specific subset of the population is not in the sample. Heckman correction attempts to remidy this. It is done in two steps:
The first step consists of estimating a selection model to predict the probability of an observation being included in a random sample. This is called a limited dependent variable regression.
The second step involves estimating the main model with a correction variable: The Inverse Mills Variable, which we found in the first model. The purpose of the IMV is to correct for the selection.
Difference-in-Difference (DiD)
Difference-in-Diffrence tries to estimate the causal relationship by comparing a group with a property (experiment group) and a group without a property (Control group).
PCA (Principle component analysis)
Principal component analysis in the context of regression takes the explanatory variables as vectors and finds a new basis of orthogonal vectors. These new vectors can be used as explanatory variables, and they are independent variables that explain variance in the dependent variable. These variables can also be ranked in order of "importance" to the dependent variable, so we can pick the k most important vectors and use them in the regression instead of using them all.
Mathematically PCA is done by finding the eigenvectors of the
OLS estimator
Ordinary least squares (OLS) is an estimation method for the linear regression model. The method works by minimizing the sum of squares of the error terms,
We often write the LR model in its matrix form when discussing OLS,
where Y and u are n-dimensional vectors, X is an
We define the OLS estimator by,
These are the properties of the OLS estimator:
E[ˆβ]=β , unbiased estimator- Efficient (minimal variance)
These two properties are called the best linear unbiased estimator, (BLUE). In other words, this is the best etimator when the assumptions above are satisfied.
For the CLRM with one independent variable the OLS estimator is given in the formula sheet for the exam by,
Now we define the estimator for the parameters of the model. This is needed for hypothesis testing in the next section about the t-test. The estimated variance of each parameter is,
where
EditNote on the estimated variance of an estimator
The equation given above is not completely correct, but is the primary one, because the real variance is
Var(ˆβ)=σ2(XTX)−1 , but since we don't knowσ we use the estimators . Therefore, the equation above is given as an estimator for the variance of the estimator,ˆβ .
Statistics and hypothesis testing for LR
The t-ratio and t-test
The t-ratio is a test statistic for hypothesis testing. It is defined by the error of a parameter divided by the standard error,
This is a t-distribution so we follow the t-distribution procedure for hypothesis testing.
The
Also note the F-statistic which is not discussed in this text yet. It is used to test regressions under restrictions.
Goodness of fit
We need a statistic that determines how well a model fits the data (goodness of fit). This is the purpose of the
It is the ratio of the explained sum of squares (ESS) and the residual sum of squares (RSS),
The purposes of the t-ratio and the R^2 statistic are different. The former measures relevancy of regressor, and the latter measures goodness of fit.
Hypothesis testing by maximum likelihood methods
For now we only concern ourselves with the distributions of the following three statistics as the distribution is important for interpretation (relevant for exam).
- Wald
Let q be the number of restrictions (number of parameters being estiamted), T be the sample size and k the number of parameters. Then the Wald statistic is in epproximately an F-distribution F(m, T-k)
- Lagrange multiplier (LM)
The distribution for the LM statistic is a
Generally, the finite Wald statistic is better for small sample sizes, as it accounts for T, the sample size.
Nonlinear regression
Any independent variable might have a higher polynomial degree than 1, like be squared and still be a valid regression model. Then the coefficients will be interpreted differently. The dependent variable might also vary with the independent variables logarithmically. In these cases it might be worth considering how a change in the independent variables affect the dependent varible (i.e. taking a derivative).
The most "important" nonlinear type of correlation in regression in this course is dummy variables. Dummy variables can be considered as binary variables that check if a condition is satisfied. You can add dummy variables up to a certain limit. This limit is decided by
EditNote on the risk of dummy variable
Though, you should be careful not to add as many dummy variables as there are outcomes while still having an intercept. For example, if we add a dummy variable in a regression for every day of the week one of them will always be true for every single day. In other words, the intercept (which may be interpreted as an independent variable which is set to 1) will be correlated like this to the dummy variables:
X0=1=D1+D2+⋯+D7 and there is perfect multicolinearity, which is assumed not to be present. This is fixed by not having any intercept in the regression.
Times series models
Characteristics of financial returns
We need to be aquainted with some terms: stationarity, autocorrelation, volatility clustering and leverage effects.
Volatility clustering is present in a time series when large volatility is usually followed by large volatility. So absolute retuns tend to have autocorrelation for small lags.
Leverage effects are effects where lower returns lead to higher volatility, while higher-than-average returns do not result in increased volatility.
Autocorrelation
We've already discussed autocorrelation, but for the following material it is essential to be very familiar with the term. When autocorrelation is present there is a constant correlation between the value of a time series at some time
The function mapping the integers to the autocovariances of some time series for each integer is the autocovariance function (acf), or mathematically
We also have another autocorrelation function than the typical ACF walled the Partial Autcorrolative Function (PACF). This function messures the correlation between an observation made at some arbitrary time and one made
Stationarity
Stationarity generally means that some time-series has no trend appart from seasonality. It has constant variance and constant mean. Mathetatically we have two definitions of stationarity. A weak and a strong definition.
A time series is a strictly (strongly) stationary time seires if,
i.e. if the probability distribution of a time series is the same with respect to lag.
A process is a weakly stationary time series if all the conditions are satisfied
E[yt]=μ (constant mean)E[(yt−μ)(yt−μ)]=σ2<∞ (constant variance)E[(yt1−μ)(yt2−μ)]=γt2−t1 (autocovariance)
EditStationarity example: White noise
White noise are examples of stationary processes in that they have constant mean, constant variace and no non-trivial autocorrelation (meaning that
τ0=1 ). We test for white noise by assuming that the autocorrelation is normally distributed with varianceσ2= 1/T so that we can do a significance test.EditNon-stationarity example: Random walk
Random walks are more discussed in this course. They are defined by
yt=yt−1+ut , where u_t is white noise. They are not strictly stationary as the probability distribution of the time series is not constant for any lag. It changes depending on the white noise over many lags. Though, the acf of a random walk is strictly decreasing, so it is more predictable than white noise in many ways.
Seasonality
Seasonality is the property where some value is predicatbly fluctuating within some timeframe.
There are many ways of correcting for seasonality. Seasonal differencing is the most relevant preprocessing technique for this course. If a time series has a lag k seasonality we difference two intervals of size k to find the corrected time series. In other words
Other methods of correcting for seasonality involve different types of preprocessing, feature engineering, machine learning, or using seasonal models like SARIMA.
Characteristics
The folliwing characteristics of financial returns are discussed numerous times in this course:
- Non-stationarity of prices, but stationarity of returns.
- No autocorrelations of returns
- There is autocorrelation of squared returns.
- Volatility clustering
- Fat-tailed distributions (non-normal with large kutosis)
- Leverage effects
We would also like to define the notion of unit roots. If a non-stationary process
Tests for time series models
There are many things to test for in financial time series.
- We test for normality wiht the Jarque-Bera test, as discussed above. The JB test statistic is distributed as
χ22 - We test for autocorrelation with either Durbin-Watson, Box-Pierce or Ljung-Box tests.
- To detect heteroscadistity we use the ARCH test.
- We test for stationarity and unit roots by the Dickey-Fuller test, or the Phillips-Perron test.
- We test HAC estimators with the Newey West test.
Univariate time series models
Univariate time series models makes predictions about a time series based on the history of the series. In other words the models use
Moving averages models (MA)
A q-th order moving averages model is defined by,
where
Now we study some properties of MA models. They all have expected value
The autocovariances can be calculated as,
for
Autoregressive models (AR)
An autoregressive model of order p is defined as,
Note on notation: Lag operator
Sometimes you might see used the lag operator. It is defined as
Liyt=yt−i . We may also define some function,ϕ where,
ϕ(L)=1−(ϕ1L1+⋯+ϕpLp) Then,
ϕ(L)yt=μ+ut ecnapsulates the AR model as defined above.
Stationarity of AR model
The stationary condition of an AR model is that all the roots og the polynomial below lies outside of the unit circle.
From algebra we know that this polynomial will have at most p unique roots.
An AR model can also be expressed as an MA(
Moments of AR model
The expected value is,
The autocorrelations can be found from solving the Yule-Walker equations:
From the definitions of variance, covariance and autocorrelation we can calculate these quantities for an AR model analytically by hand with the moments.
ARMA
An ARMA process is a combination between an AR- and an MA process. Mathematically an ARMA(p,q) process is,
or if they are expressed with lag operators,
The stationary condition of the ARMA model requires that all roots of the characteristic polynomial have roots with absolute value larger than one, as explained above.
Note also that for lags larger than q, the MA order, the model will behave like an AR model when discussing autocorrelations.
Building ARMA models
The Box-Jenkins approach is a 3-step process for buidling an ARMA model.
- Identification: Determine the order of an ARMA model.
- Estimation: Typically done with least squares or maximum likelihood.
- Diagnostics: We may use two methods: deliberate overfitting and residual diagnostics.
Identification is typically done by finding the model that minimizes different information criteria. See below
ARMA variants
There are many different variants of ARMA models. One of them is called ARIMA, where the added I stands for integrated. An integrated autoregressive model has at least one root (of the characteristic polynomial) inside the unit circle. Therefore we introduce the ARIMA model to difference the data d times to make it satisfy the stationary condition. So, an ARIMA(p,d,q) model is equivalent to an ARMA(p,q) model where the original data has been differences d times.
Forecasting
TODO: Write about forecasting procedures
We may evaluate forecasting procedures with Mean Squared Error (MSE). It is useful because it is differentiable, but it is not easily interpretable.
Root mean squared error (RMSE) and mean abslute error (MAE) try to solve this, as they are easier to interpret. MAE is less sensitive to outliers than the RMSE, but MAE is also not differentiable while RMSE still is. They are defined as,
We may also use Theil's U-statistic, defined by
This metric measures error relatively.
We also have multiple tests for comparing forecasts. The Diebold-Mariano Test is defined by,
where d is the loss differential between two forecasts.
We also have the Clark-West test:
where
Volatility
The ACF curves of returns are statistically insignificant, but the ACF of squared returns (volatility) is present for some lags. This is indicative of volatility clustering.
When discussing volatility we differentiate between empirical and implied volatility. Empirical volatility is backward looking and is meassures the volatility of previous returns. Implied volatility is forward looking and is a meassure of expected volatility. It can be inferred from the Black Scholes model.
We also diffrentate between two types of volatility modeling: deterministic and stochastic. We will discuss three types of deterministic volatilty modeling. These are the GARCH family, HAR family and the MIDAS family. We will only discuss one type of stochastic modeling, Heston.
ARCH and GARCH models
ARCH
ARCH is short for AutoRegressive Conditional Heteroscedasticity. The ARCH(q) model is,
where D is some distribution and
where
We can test for ARCH effects with Engles ARCH test.
The null hypothesis is
where T is the number of observations.
We typically estimate ARCH models with Maximum likelihood numerically. This method is called Quasi Maximum Likelihood.
GARCH
But ARCH does not always result in a good fit because it needs too many parameters. Therefore Generalized ARCH (GARCH) was developed. GARCH(p,q) is defined by,
and,
GARCH(p,q) is stationary if
The uncondititonal variance of the GARCH(p,q) model is,
From this equation we can see why the condition above is necessary for the GARCH model.
GJR-GARCH
The greates limitation of GARCH is that is cannot correct for leverage effects (the asymmetric effect of the sign of returns on volatility). Volatility is not dependent on the sign of returns in this model. Therefore we can estimate a GJR-GARCH model like,
where
This model may also be written as,
EGARCH
There also exists Exponential GARCH (EGARCH) to correct for asymetric levereage affects. It is formulated as,
The nature of the logarithm creates the desired asymmetry.
HAR models
To define a HAR model we first need to define realized volatility.
For some time t, there is a consistent estimator for the true latent variance,
where,
The HAR model is then a regression model,
where
MIDAS models
MIDAS is short of Mixed Data Sampling. It is a regression method that may use data which has been sampled at different frequencies. The model is defined as,
Panel data models
SUR model
Panel data is a matrix of data that measures values related to different quantities (individuals) and times. It is the development of cross sectional data across time. It is useful to study these datasets because we can discover relationships between quantities with these techniques.
A simple way to model this data is with a "pooled regression". But in cases where there is heterogeneity between individuals and some parameters might not be constant accross time the resulting model might fall to a special case of Simpsons Paradox, as seen in the figure below. (in this case the intercept is not constant)
A solution to this might be to use SUR (Seemingly unrelated regression). In SUR each individual is treated independently and each in reality there are
The model in matrix format is,
FE and RE models
We deal with two types of model that model panel data with heterogeneity: Fixed Effects (FE) and Random Effects (RE).
Fixed effects models assumes that an individuals heterogeneity is corrolated with the explanatory variables. In FE models the heterogeneity is captured by allowing for different intercepts and minimizing bias by differencing it out for each individual. But this does not allow for time-invariant regressors. As FE models try to model a "within-individual" (across time) specific vairation and not within time (across individials) any variable that doesn't vary across time will be colinear with the intercept and should not be included in the model. Therefore it becomes impossible to interpret the effects of time-invariant regressors on the dependent variable. An example of this might be the effect of gender on income.
A Random Effects models assumes that individual specific heterogeneity is not correlated with exmplanatory variables - the opposite of the Fixed Effects model. Since there are no correlations with the regressors the heterogeneity will be in the random error term. But if there in reality is a correlation with the explanatory variables the results will be biased.
The FE model can be estimated by either Least Squares Dummy Variable (LSDV) or Fixed Effects model Within Estimator. The LSDV model uses lots of dummy variables for individual-specific effects and estimates with OLS. The Fixed Effects model Within Estimator demeans the effects for each individual to find the unobserved heterogeneity across individuals.
There is also a third appraoch, the first difference operator, which tries to model changes over time instead of levels.
It is generally thought that FE is only plausible when the sample can represent the whole population. The RE model often produce more efficient estimation than FE bacause RE has fewer variables, but RA also assumes that the error term is uncorrelated with the explanatory variables.
We can use a Hausman specification test to determine the model.
Risk measurement and Portfolio optimization
Risk measures
Say we have tome portfolio value at time 0 and time 1. Then with a set of investments the portfolio value at time 1 will be some probability distribution
A risk measure
Translation invariance:
Where
Monotonicity: If
whith this property saying that more valuable positions are riskier.
Normalization:
Saying that taking no position is acceptable as well.
Positive homogeneity:
Making larger positions riskier in terms of buffer capital.
Subadditivity:
Again, rewarding diversification when adding them to each other.
When a risk measure satisfies all these properties above, it is coherent.
Value-at-Risk (VaR)
The value at risk at level
where
Note that VaR is not subadditive and is therefore not coherent. Another disadvantage og VaR is that the user has to mute it to time horizon and confidence level. It also does not describe losses larger than the VaR, as it does not describe the tailedness of the distribution.
Expected shortfall tries to remidy this.
Expected Shortfall (ES)
Expected shortfall tries to remidy the fact that VaR metric does not describe losses larger than the VaR. Expected shortfall is therefore the expected value of the loss, given that it is larger than the VaR, defined as
Note that ES is a coherent risk measure because it rewards diversification.
Tests and diagnostics registry
Heteroscadisticity tests
Goldfeld-Quandt (GQ) test
The GQ test is a test for heteroscadistity. It is useful for testing the homoscedasticity assumption in regression analysis.
It splits the samplpe length into two subsamples of length T1 and T2. Then we do a regression on both samples and the residual variances are calculated. The null hypotehsis is that the variances are equal.
We have the test statistic
White's test
White's test tries to model the residuals with an auxiliary regression with the same regressors as the original equation. Then we can run significance tests with F-test or
ARCH test
The ARCH test detects heteroscedasticity in a regression by estimating an ARCH(q) model on the residuals of the original regression. Then we run an LM test with the residual test statistic
and the null hypothesis stating that
Autocorrelation
The Durbin-Watson test tests for first order corrolation only and is distributed around 2 under the null hypothesis of no autocorrolation.
There are many other tests described in this course like Breusch-Godfrey, Box-Pierce and Box Ljung, but none are explained here yet.
Information criteria (IC)
Information criteria are typically used for model identification for ARMA models (finding the appropriate (p,q)-order). The main mechanisms of these criteria reward RSS minimization, but add a penalty for the number of parameters. This way we can increase the goodness of fit, but still avoid overfitting the data.
There are three popular IC: AIC, SBIC, and HQIC.
They are defined as,
where
They differ in what attributes they reward/punish the most. SBIC has a stiffer parameter-penalty than the AIC, as may be seen in the formulas. SBIC is also consistent, and AIC is not.
TODO: Write about HQIC.
Hausman specification test
The Hausman test compares the estimators under FE and RE models. The null hypothesis is:
if we reject H0 then RE is inconsistent and FE is preferred, and if we don't reject H0 then RE is consistent and efficient. The test statistic is,
where k is the number of parameters
The hypothesis of the test is based on the fac that the RE model is efficient, but it is inconsistent if the random effects
Types of data
There are three types of data:
- Cross-sectional data: Data set concentrated at a single time across individuals, i.
- Time-series data: Concentrated at a single individual across time, t.
- Panel data: A 2-dimensional set of data points across individuals, i, and time, t.
We take the word across to mean that there is a set of distinct individuals (times) which is indexed by i (t). Individuals may be any meaningful set of data that are connected somehow. It may be a tuple of a persons age, education, and income — or a company's pair tuple of (D/E ratio, EBIT) — or a country's tuple of (unemployment rate, GDP growth).