Wikipendium

Share on Twitter Create compendium Add Language
Edit History
Tools
  • Edit
  • History
  • Share on Twitter

  • Add language

  • Create new compendium
Log in
Table of Contents
  1. Compendium for TIØ4317
    1. Learning objective of course
    2. Statistical background
    3. Linear regression
      1. Classical linear regression model
        1. Assumption violation (important)
          1. If $E[u] \neq 0$
          2. If $Var(u_i)$ is variable (heteroscedasticity)
          3. If there is autocorrelation ($Cov(u_i | u_j) \neq 0$)
          4. If there is endogeneity ($E[u | X] \neq 0$)
          5. If error term is not normally distributed
          6. In case of multicolinearity
        2. Instrumental variables (IV)
        3. Sample selection and Heckman correction
        4. TO-DO: wite about DiD
        5. PCA (Principle component analysis)
      2. OLS estimator
      3. Statistics and hypothesis testing for LR
        1. The t-ratio and t-test
        2. Goodness of fit
        3. Hypothesis testing by maximum likelihood methods
      4. Nonlinear regression
  2. Other notes
    1. Types of data
      1. Note on the estimated variance of an estimator
        1. Note on the risk of dummy variable
‹

TIØ4317: Empirical and Quantitative Methods in finance

Tags:
  • finance
  • statistikk
  • statistics
+

Compendium for TIØ4317

This compendium was written for the 2025 version of the course.

Learning objective of course

There are three parts:

  • Linear regression and diagnostic tests
  • Autoregressive models
  • Risk analysis and portfolio optimization

Statistical background

It is advantagous to be aquainted with consepts from statistics like probability distributions, correlation/covariance, moments, estimator theory, confidence intervals, hypothesis testing, and some familiarity with linear regression.

These consepts are exiplained in the compendium for TMA4240: Statistics.


Linear regression

Classical linear regression model

These are the assumptions of a classical linear regression model (CLRM).

For n independent variables per individual i, $X_{i,1}, X_{i,2}, ..., X_{i,n}$ and one dependent variable $Y_i$, we have

$$ Y_i = \beta_0 + \beta_1 \ X_{i,1} + \beta_2 X_{i,2} + \dots + \beta_n X_{i,n} + u_i $$

and,

  1. $E[u_i | X_i] = 0$
  2. $Var(u_i \ | X_i ) = \sigma^2$ i.e. homoscedasticity (constant variance in error term).
  3. $\text{Cov}(u_i, u_j) = 0 \; \forall \ i,j$ i.e. no autocorrelation (error terms are uncorrelated across individuals).
  4. $u_i \sim N(0, \sigma^2)$ (error term is unbiased and normally distributed)
  5. No perfect colinearity (the independent variables are not perfectly correlated).

Assumption violation (important)

If any assumptions are violated it may lead to wrong coefficient estimates, their standard errors and the distributions that we evaluate the estimators with. Therefore we often try to remedy any violations with different techniques like non-linear variables (explained below), until the assumptions are satisfied.

Below we find a list of assumption violations. In the sections following this one some of the terms and procedures mentioned in the list will be elaborated upon

If $E[u] \neq 0$
  • Detection: There will be model bias
  • Effect: Residual test statistics will be wrong because the residuals are affected by the bias, e.g. $R^2$
  • Solution: Add an intercept to regression model, $\beta_0$
If $Var(u_i)$ is variable (heteroscedasticity)
  • Detection: GQ-test, White-test, ARCH-test (not explained here yet)
  • Effect: OLS estimators will not be BLUE
  • Solution: Find some $z_t$ such that $Var(u_t) = \sigma z_t$. Then divide every variable in the regression model by $z_t$.
If there is autocorrelation ($Cov(u_i | u_j) \neq 0$)
  • Detection: Durbin-Watson test, Breusch-Godfrey, Box-Pierce, Box-Ljung
  • Effect: OLS will not be BLUE and $R^2$ will be inflated
  • Solution: Adding lags (see ARMA models), but it is debated and dangerous. May also be caused by omitted variables.
If there is endogeneity ($E[u | X] \neq 0$)
  • Detection: By checking the exogeneity assumption, $E[u | X] = 0$. It may be caused by omitted variables, measurement error and sample selection.
  • Effect: The coeffeicients on all other variables will be biased and inconsistent if an important variable is omitted. Standard errors are also biased
  • Solution: Controll for omitted variables, use an instrumental variable (See below), Heckman correction (See below), ...

Endogeneity may also be caused by the inclusion of an unimportant variable. Then the coefficients will be unbiased, but inefficient.

If error term is not normally distributed
  • Detection: The Bara-Jarque test tests the skewness and kurtosis of some distribution for normality.

$$ W = T\ \left[ \frac{S^2}{6} + \frac{(K - 3)^2}{24} \right] \sim \chi^2(2) $$

where $S = \frac{E[u_i^3]}{\sigma^3}$, $K = \frac{E[u_i^4}{\sigma^4}$

  • Effect: Will make hypothesis testing faulty, as we assume distributions in the tests.
  • Solution: No clear solution
In case of multicolinearity
  • Detection: Check for correlation between explanatory variables. If they are highly correlated there is muticolinearity.
  • Effect: $R^2$ will be high, and the standard errors of the coefficients will be high. The regression is too sensitive to modification (not stable) and therefore the confidence intervals of the coefficients will be too wide.
  • Solution: May want to change the regression model to remove the colinearity. Another solution is PCA

Instrumental variables (IV)

If an independent variable X is corrolated with the error term u, we try to divide the variable into two parts: one term corrolated with the error and one term uncorrolated.

The technique for finding an instrumental variable starts by udentifying some other variable (the IV) which is correlated to the endogenious variable, but is uncorrelated with unobserved effects on the dependent variable (the error term). Then we use Two-stage least squares method (2SLS).

The first stage involves predicting the endogenious variable with the new IV with a linear regression. We can generate predictions for the endogenious variablem where the unobserved effects on the dependent variable is removed. So we use this in the final regression model instead of the endogenious variable. Mathematically it looks like this:

If we have a regression model,

$$ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + u_i $$

but $X_1$ is endogenious, then we find some other variable, Q, which has $Cov(X_1,Q)\neq 0$, and $Cov(Q, u_i) = 0$. Then we model the endogenious variable $X_1$ with a regression model like

$$ X_{1,i} = \lambda_0 + \lambda_1 Q_i + v_i $$

where all of the unobserved effects of $X_1$ "are" in $v_i$. So if we predict $X_i$ with $\hat X_{1,i} = \lambda_0 + \lambda_1 Q_i$, then we can reformulate a new regression model,

$$ Y_i = \beta_0 + \beta_1 + \hat X_{1,i} + \beta_2 X_{2,i} + u_i $$

Now $\hat X_{1,i}$ is no longer endogenious by applying an instrumental variable.

Sample selection and Heckman correction

Sample selection occurs when the observations are not made randomly and a specific subset of the population is not in the sample. Heckman correction attempts to remidy this. It is done in two steps:

The first step consists of estimating a selection model to predict the probability of an observation being included in a random sample. This is called a limited dependent variable regression.

The second step involves estimating the main model with a correction variable: The Inverse Mills Variable, which we found in the first model. The purpose of the IMV is to correct for the selection.

TO-DO: wite about DiD

PCA (Principle component analysis)

Principal component analysis in the context of regression takes the explanatory variables as vectors and finds a new basis of orthogonal vectors. These new vectors can be used as explanatory variables, and they are independent variables that explain variance in the dependent variable. These variables can also be ranked in order of "importance" to the dependent variable, so we can pick the k most important vectors and use them in the regression instead of using them all.

Mathematically PCA is done by finding the eigenvectors of the $X^T\ X$ matrix. The eigenvectors are the new orthogonal explanatory variables, and the effect on the dependent variable is measured by the corresponding eigenvalues.

OLS estimator

Ordinary least squares (OLS) is an estimation method for the linear regression model. The method works by minimizing the sum of squares of the error terms, $u_i$, in other words, the difference between the model prediction and the real meassurement.

We often write the LR model in its matrix form when discussing OLS,

$$ \mathbf Y = \mathbf X\mathbf \beta + \mathbf u $$

where Y and u are n-dimensional vectors, X is an $n \times (m+1)$ dimensional matrix and $\mathbb \beta$ is an m+1 dimensional vector and m is the number of independent variables. The X matrix has the first column with only 1's to represent the $X_1$ part of the model above, and the rest of the columns

We define the OLS estimator by,

$$ \hat \beta = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf Y $$

These are the properties of the OLS estimator:

  • $E[\hat \beta] = \beta$, unbiased estimator
  • Efficient (minimal variance)

These two properties are called the best linear unbiased estimator, (BLUE). In other words, this is the best etimator when the assumptions above are satisfied.

For the CLRM with one independent variable the OLS estimator is given in the formula sheet for the exam by,

$$ \hat\beta_1 = \frac{\sum_{i=1}^n(Y_i - \bar Y)(X_i - \bar X)}{\sum_{i=1}^n(X_i -\bar X)^2}, $$

$$ \hat\beta_0 = \bar Y - \bar X\hat\beta_1 $$

Now we define the estimator for the parameters of the model. This is needed for hypothesis testing in the next section about the t-test. The estimated variance of each parameter is,

$$ Var(\hat \beta) = s^2 (X^T X)^{-1}, $$

where $s^2 = \frac{\hat u^T u}{T-k}$, where T is the number of observations and k is the number of independent variables.

Note on the estimated variance of an estimator

The equation given above is not completely correct, but is the primary one, because the real variance is $\mathit{Var(\hat \beta) = \sigma^2 (X^T X)^{-1}}$, but since we don't know $\sigma$ we use the estimator $s$. Therefore, the equation above is given as an estimator for the variance of the estimator, $\hat \beta$.

Statistics and hypothesis testing for LR

The t-ratio and t-test

The t-ratio is a test statistic for hypothesis testing. It is defined by the error of a parameter divided by the standard error,

$$ t = \frac{\hat \beta - \beta}{SE(\hat \beta)}, $$

This is a t-distribution so we follow the t-distribution procedure for hypothesis testing.

The $\beta$ will be indexed in most linear regressions because each estimator will have its own t-ratio, like $t_j = \frac{\hat \beta_j - \beta_j}{SE(\hat \beta_j)}$. Then we can estiate significance for each estimate.

Also note the F-statistic which is not discussed in this text yet. It is used to test regressions under restrictions.

Goodness of fit

We need a statistic that determines how well a model fits the data (goodness of fit). This is the purpose of the $R^2$ statistic. It is the square of the correlation between $y$ and $\hat y$, and it measures how much of the squared error can be explained by the model.

It is the ratio of the explained sum of squares (ESS) and the residual sum of squares (RSS),

$$ R^2 = \frac{ESS}{RSS} = \frac{\sum_i (\hat y - \bar y)^2}{\sum_i (y_i - \hat y_i)^2} $$

The purposes of the t-ratio and the R^2 statistic are different. The former measures relevancy of regressor, and the latter measures goodness of fit.

Hypothesis testing by maximum likelihood methods

For now we only concern ourselves with the distributions of the following three statistics as the distribution is important for interpretation (relevant for exam).

  • Wald

Let q be the number of restrictions (number of parameters being estiamted), T be the sample size and k the number of parameters. Then the Wald statistic is in epproximately an F-distribution F(m, T-k)

  • Lagrange multiplier (LM)

The distribution for the LM statistic is a $\chi_q^2$-distribution.

Generally, the finite Wald statistic is better for small sample sizes, as it accounts for T, the sample size.

Nonlinear regression

Any independent variable might have a higher polynomial degree than 1, like be squared and still be a valid regression model. Then the coefficients will be interpreted differently. The dependent variable might also vary with the independent variables logarithmically. In these cases it might be worth considering how a change in the independent variables affect the dependent varible (i.e. taking a derivative).

The most "important" nonlinear type of correlation in regression in this course is dummy variables. Dummy variables can be considered as binary variables that check if a condition is satisfied. You can add dummy variables up to a certain limit. This limit is decided by

Note on the risk of dummy variable

Though, you should be careful not to add as many dummy variables as there are outcomes while still having an intercept. For example, if we add a dummy variable in a regression for every day of the week one of them will always be true for every single day. In other words, the intercept (which may be interpreted as an independent variable which is set to 1) will be correlated like this to the dummy variables: $\mathbf X_0 = 1 = D_1 + D_2 + \dots + D_7$ and there is perfect multicolinearity, which is assumed not to be present. This is fixed by not having any intercept in the regression.


Other notes

Types of data

There are three types of data:

  • Cross-sectional data: Data set concentrated at a single time across individuals, i.
  • Time-series data: Concentrated at a single individual across time, t.
  • Panel data: A 2-dimensional set of data points across individuals, i, and time, t.

We take the word across to mean that there is a set of distinct individuals (times) which is indexed by i (t). Individuals may be any meaningful set of data that are connected somehow. It may be a tuple of a persons age, education, and income — or a company's pair tuple of (D/E ratio, EBIT) — or a country's tuple of (unemployment rate, GDP growth).

Written by

aleksfasting
Last updated: Mon, 7 Jul 2025 00:08:33 +0200 .
  • Contact
  • Twitter
  • Statistics
  • Report a bug
  • Wikipendium cc-by-sa
Wikipendium is ad-free and costs nothing to use. Please help keep Wikipendium alive by donating today!