TIØ4317: Empirical and Quantitative Methods in finance
Compendium for TIØ4317
This compendium was written for the 2025 version of the course.
Learning objective of course
There are three parts:
- Linear regression and diagnostic tests
- Autoregressive models
- Risk analysis and portfolio optimization
Statistical background
It is advantagous to be aquainted with consepts from statistics like probability distributions, correlation/covariance, moments, estimator theory, confidence intervals, hypothesis testing, and some familiarity with linear regression.
These consepts are exiplained in the compendium for TMA4240: Statistics.
Linear regression
Classical linear regression model
These are the assumptions of a classical linear regression model (CLRM).
For n independent variables per individual i,
and,
$E[u_i | X_i] = 0$ $Var(u_i \ | X_i ) = \sigma^2$ i.e. homoscedasticity (constant variance in error term).$\text{Cov}(u_i, u_j) = 0 \; \forall \ i,j$ i.e. no autocorrelation (error terms are uncorrelated across individuals).$u_i \sim N(0, \sigma^2)$ (error term is unbiased and normally distributed)- No perfect colinearity (the independent variables are not perfectly correlated).
Assumption violation (important)
If any assumptions are violated it may lead to wrong coefficient estimates, their standard errors and the distributions that we evaluate the estimators with. Therefore we often try to remedy any violations with different techniques like non-linear variables (explained below), until the assumptions are satisfied.
Below we find a list of assumption violations. In the sections following this one some of the terms and procedures mentioned in the list will be elaborated upon
If $E[u] \neq 0$
- Detection: There will be model bias
- Effect: Residual test statistics will be wrong because the residuals are affected by the bias, e.g.
$R^2$ - Solution: Add an intercept to regression model,
$\beta_0$
If $Var(u_i)$ is variable (heteroscedasticity)
- Detection: GQ-test, White-test, ARCH-test (not explained here yet)
- Effect: OLS estimators will not be BLUE
- Solution: Find some
$z_t$ such that$Var(u_t) = \sigma z_t$ . Then divide every variable in the regression model by$z_t$ .
If there is autocorrelation ($Cov(u_i | u_j) \neq 0$ )
- Detection: Durbin-Watson test, Breusch-Godfrey, Box-Pierce, Box-Ljung
- Effect: OLS will not be BLUE and
$R^2$ will be inflated - Solution: Adding lags (see ARMA models), but it is debated and dangerous. May also be caused by omitted variables.
If there is endogeneity ($E[u | X] \neq 0$ )
- Detection: By checking the exogeneity assumption,
$E[u | X] = 0$ . It may be caused by omitted variables, measurement error and sample selection. - Effect: The coeffeicients on all other variables will be biased and inconsistent if an important variable is omitted. Standard errors are also biased
- Solution: Controll for omitted variables, use an instrumental variable (See below), Heckman correction (See below), ...
Endogeneity may also be caused by the inclusion of an unimportant variable. Then the coefficients will be unbiased, but inefficient.
If error term is not normally distributed
- Detection: The Bara-Jarque test tests the skewness and kurtosis of some distribution for normality.
where
- Effect: Will make hypothesis testing faulty, as we assume distributions in the tests.
- Solution: No clear solution
In case of multicolinearity
- Detection: Check for correlation between explanatory variables. If they are highly correlated there is muticolinearity.
- Effect:
$R^2$ will be high, and the standard errors of the coefficients will be high. The regression is too sensitive to modification (not stable) and therefore the confidence intervals of the coefficients will be too wide. - Solution: May want to change the regression model to remove the colinearity. Another solution is PCA
Instrumental variables (IV)
If an independent variable X is corrolated with the error term u, we try to divide the variable into two parts: one term corrolated with the error and one term uncorrolated.
The technique for finding an instrumental variable starts by udentifying some other variable (the IV) which is correlated to the endogenious variable, but is uncorrelated with unobserved effects on the dependent variable (the error term). Then we use Two-stage least squares method (2SLS).
The first stage involves predicting the endogenious variable with the new IV with a linear regression. We can generate predictions for the endogenious variablem where the unobserved effects on the dependent variable is removed. So we use this in the final regression model instead of the endogenious variable. Mathematically it looks like this:
If we have a regression model,
but
where all of the unobserved effects of
Now
Sample selection and Heckman correction
Sample selection occurs when the observations are not made randomly and a specific subset of the population is not in the sample. Heckman correction attempts to remidy this. It is done in two steps:
The first step consists of estimating a selection model to predict the probability of an observation being included in a random sample. This is called a limited dependent variable regression.
The second step involves estimating the main model with a correction variable: The Inverse Mills Variable, which we found in the first model. The purpose of the IMV is to correct for the selection.
TO-DO: wite about DiD
PCA (Principle component analysis)
Principal component analysis in the context of regression takes the explanatory variables as vectors and finds a new basis of orthogonal vectors. These new vectors can be used as explanatory variables, and they are independent variables that explain variance in the dependent variable. These variables can also be ranked in order of "importance" to the dependent variable, so we can pick the k most important vectors and use them in the regression instead of using them all.
Mathematically PCA is done by finding the eigenvectors of the
OLS estimator
Ordinary least squares (OLS) is an estimation method for the linear regression model. The method works by minimizing the sum of squares of the error terms,
We often write the LR model in its matrix form when discussing OLS,
where Y and u are n-dimensional vectors, X is an
We define the OLS estimator by,
These are the properties of the OLS estimator:
$E[\hat \beta] = \beta$ , unbiased estimator- Efficient (minimal variance)
These two properties are called the best linear unbiased estimator, (BLUE). In other words, this is the best etimator when the assumptions above are satisfied.
For the CLRM with one independent variable the OLS estimator is given in the formula sheet for the exam by,
Now we define the estimator for the parameters of the model. This is needed for hypothesis testing in the next section about the t-test. The estimated variance of each parameter is,
where
Note on the estimated variance of an estimator
The equation given above is not completely correct, but is the primary one, because the real variance is
$\mathit{Var(\hat \beta) = \sigma^2 (X^T X)^{-1}}$ , but since we don't know$\sigma$ we use the estimator$s$ . Therefore, the equation above is given as an estimator for the variance of the estimator,$\hat \beta$ .
Statistics and hypothesis testing for LR
The t-ratio and t-test
The t-ratio is a test statistic for hypothesis testing. It is defined by the error of a parameter divided by the standard error,
This is a t-distribution so we follow the t-distribution procedure for hypothesis testing.
The
Also note the F-statistic which is not discussed in this text yet. It is used to test regressions under restrictions.
Goodness of fit
We need a statistic that determines how well a model fits the data (goodness of fit). This is the purpose of the
It is the ratio of the explained sum of squares (ESS) and the residual sum of squares (RSS),
The purposes of the t-ratio and the R^2 statistic are different. The former measures relevancy of regressor, and the latter measures goodness of fit.
Hypothesis testing by maximum likelihood methods
For now we only concern ourselves with the distributions of the following three statistics as the distribution is important for interpretation (relevant for exam).
- Wald
Let q be the number of restrictions (number of parameters being estiamted), T be the sample size and k the number of parameters. Then the Wald statistic is in epproximately an F-distribution F(m, T-k)
- Lagrange multiplier (LM)
The distribution for the LM statistic is a
Generally, the finite Wald statistic is better for small sample sizes, as it accounts for T, the sample size.
Nonlinear regression
Any independent variable might have a higher polynomial degree than 1, like be squared and still be a valid regression model. Then the coefficients will be interpreted differently. The dependent variable might also vary with the independent variables logarithmically. In these cases it might be worth considering how a change in the independent variables affect the dependent varible (i.e. taking a derivative).
The most "important" nonlinear type of correlation in regression in this course is dummy variables. Dummy variables can be considered as binary variables that check if a condition is satisfied. You can add dummy variables up to a certain limit. This limit is decided by
Note on the risk of dummy variable
Though, you should be careful not to add as many dummy variables as there are outcomes while still having an intercept. For example, if we add a dummy variable in a regression for every day of the week one of them will always be true for every single day. In other words, the intercept (which may be interpreted as an independent variable which is set to 1) will be correlated like this to the dummy variables:
$\mathbf X_0 = 1 = D_1 + D_2 + \dots + D_7$ and there is perfect multicolinearity, which is assumed not to be present. This is fixed by not having any intercept in the regression.
Other notes
Types of data
There are three types of data:
- Cross-sectional data: Data set concentrated at a single time across individuals, i.
- Time-series data: Concentrated at a single individual across time, t.
- Panel data: A 2-dimensional set of data points across individuals, i, and time, t.
We take the word across to mean that there is a set of distinct individuals (times) which is indexed by i (t). Individuals may be any meaningful set of data that are connected somehow. It may be a tuple of a persons age, education, and income — or a company's pair tuple of (D/E ratio, EBIT) — or a country's tuple of (unemployment rate, GDP growth).