Table of Contents

Learning objective of course
Statistical background
Linear regression
Times series models
Panel data models
1. SUR model
2. FE and RE models
Risk measurement and Portfolio optimization
1. Risk measures
  1. Value-at-Risk (VaR)
  2. Expected Shortfall (ES)
Tests and diagnostics registry

‹

TIØ4317: Empirical and Quantitative Methods in finance

Tags:

This compendium was written for the 2025 version of the course.

Learning objective of course

There are three parts:

Linear regression and diagnostic tests
Autoregressive models
Risk analysis and portfolio optimization

Statistical background

It is advantagous to be aquainted with consepts from statistics like probability distributions, correlation/covariance, moments, estimator theory, confidence intervals, hypothesis testing, and some familiarity with linear regression.

These consepts are exiplained in the compendium for TMA4240: Statistics.

Linear regression

Classical linear regression model

These are the assumptions of a classical linear regression model (CLRM).

For n independent variables per individual i, $X_{i,1}, X_{i,2}, ..., X_{i,n}$ and one dependent variable $Y_i$, we have

$$ Y_i = \beta_0 + \beta_1 \ X_{i,1} + \beta_2 X_{i,2} + \dots + \beta_n X_{i,n} + u_i $$

and,

$E[u_i | X_i] = 0$
$Var(u_i \ | X_i ) = \sigma^2$ i.e. homoscedasticity (constant variance in error term).
$\text{Cov}(u_i, u_j) = 0 \; \forall \ i,j$ i.e. no autocorrelation (error terms are uncorrelated across individuals).
$u_i \sim N(0, \sigma^2)$ (error term is unbiased and ly distributed)
No perfect colinearity (the independent variables are not perfectly correlated).

Assumption violation (important)

If any assumptions are violated it may lead to wrong coefficient estimates, their standard errors and the distributions that we evaluate the estimators with. Therefore we often try to remedy any violations with different techniques like non-linear variables (explained below), until the assumptions are satisfied.

Below we find a list of assumption violations. In the sections following this one some of the terms and procedures mentioned in the list will be elaborated upon

If $E[u] \neq 0$

Detection: There will be model bias
Effect: Residual test statistics will be wrong because the residuals are affected by the bias, e.g. $R^2$
Solution: Add an intercept to regression model, $\beta_0$

If $Var(u_i)$ is variable (heteroscedasticity)

Detection: GQ-test, White-test, ARCH-test
Effect: OLS estimators will not be BLUE
Solution: Find some $z_t$ such that $Var(u_t) = \sigma z_t$. Then divide every variable in the regression model by $z_t$.

If there is autocorrelation ($Cov(u_i | u_j) \neq 0$)

Detection: Durbin-Watson test, Breusch-Godfrey, Box-Pierce, Box-Ljung
Effect: OLS will not be BLUE and $R^2$ will be inflated
Solution: Adding lags (see ARMA models), but it is debated and dangerous. May also be caused by omitted variables.

If there is endogeneity ($E[u | X] \neq 0$)

Detection: By checking the exogeneity assumption, $E[u | X] = 0$. It may be caused by omitted variables, measurement error and sample selection.
Effect: The coeffeicients on all other variables will be biased and inconsistent if an important variable is omitted. Standard errors are also biased
Solution: Controll for omitted variables, use an instrumental variable (See below), Heckman correction (See below), ...

Endogeneity may also be caused by the inclusion of an unimportant variable. Then the coefficients will be unbiased, but inefficient.

If error term is not normally distributed

Detection: The Bara-Jarque test tests the skewness and kurtosis of some distribution for normality.

$$ W = T\ \left[ \frac{S^2}{6} + \frac{(K - 3)^2}{24} \right] \sim \chi^2(2) $$

where $S = \frac{E[u_i^3]}{\sigma^3}$, $K = \frac{E[u_i^4}{\sigma^4}$

Effect: Will make hypothesis testing faulty, as we assume distributions in the tests.
Solution: No clear solution

In case of multicolinearity

Detection: Check for correlation between explanatory variables. If they are highly correlated there is muticolinearity.
Effect: $R^2$ will be high, and the standard errors of the coefficients will be high. The regression is too sensitive to modification (not stable) and therefore the confidence intervals of the coefficients will be too wide.
Solution: May want to change the regression model to remove the colinearity. Another solution is PCA

Instrumental variables (IV)

If an independent variable X is corrolated with the error term u, we try to divide the variable into two parts: one term corrolated with the error and one term uncorrolated.

The technique for finding an instrumental variable starts by udentifying some other variable (the IV) which is correlated to the endogenious variable, but is uncorrelated with unobserved effects on the dependent variable (the error term). Then we use Two-stage least squares method (2SLS).

The first stage involves predicting the endogenious variable with the new IV with a linear regression. We can generate predictions for the endogenious variablem where the unobserved effects on the dependent variable is removed. So we use this in the final regression model instead of the endogenious variable. Mathematically it looks like this:

If we have a regression model,

$$ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + u_i $$

but $X_1$ is endogenious, then we find some other variable, Q, which has $Cov(X_1,Q)\neq 0$, and $Cov(Q, u_i) = 0$. Then we model the endogenious variable $X_1$ with a regression model like

$$ X_{1,i} = \lambda_0 + \lambda_1 Q_i + v_i $$

where all of the unobserved effects of $X_1$ "are" in $v_i$. So if we predict $X_i$ with $\hat X_{1,i} = \lambda_0 + \lambda_1 Q_i$, then we can reformulate a new regression model,

$$ Y_i = \beta_0 + \beta_1 + \hat X_{1,i} + \beta_2 X_{2,i} + u_i $$

Now $\hat X_{1,i}$ is no longer endogenious by applying an instrumental variable.

Sample selection and Heckman correction

Sample selection occurs when the observations are not made randomly and a specific subset of the population is not in the sample. Heckman correction attempts to remidy this. It is done in two steps:

The first step consists of estimating a selection model to predict the probability of an observation being included in a random sample. This is called a limited dependent variable regression.

The second step involves estimating the main model with a correction variable: The Inverse Mills Variable, which we found in the first model. The purpose of the IMV is to correct for the selection.

Difference-in-Difference (DiD)

Difference-in-Diffrence tries to estimate the causal relationship by comparing a group with a property (experiment group) and a group without a property (Control group).

PCA (Principle component analysis)

Principal component analysis in the context of regression takes the explanatory variables as vectors and finds a new basis of orthogonal vectors. These new vectors can be used as explanatory variables, and they are independent variables that explain variance in the dependent variable. These variables can also be ranked in order of "importance" to the dependent variable, so we can pick the k most important vectors and use them in the regression instead of using them all.

Mathematically PCA is done by finding the eigenvectors of the $X^T\ X$ matrix. The eigenvectors are the new orthogonal explanatory variables, and the effect on the dependent variable is measured by the corresponding eigenvalues.

OLS estimator

Ordinary least squares (OLS) is an estimation method for the linear regression model. The method works by minimizing the sum of squares of the error terms, $u_i$, in other words, the difference between the model prediction and the real meassurement.

We often write the LR model in its matrix form when discussing OLS,

$$ \mathbf Y = \mathbf X\mathbf \beta + \mathbf u $$

where Y and u are n-dimensional vectors, X is an $n \times (m+1)$ dimensional matrix and $\mathbb \beta$ is an m+1 dimensional vector and m is the number of independent variables. The X matrix has the first column with only 1's to represent the $X_1$ part of the model above, and the rest of the columns

We define the OLS estimator by,

$$ \hat \beta = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf Y $$

These are the properties of the OLS estimator:

$E[\hat \beta] = \beta$, unbiased estimator
Efficient (minimal variance)

These two properties are called the best linear unbiased estimator, (BLUE). In other words, this is the best etimator when the assumptions above are satisfied.

For the CLRM with one independent variable the OLS estimator is given in the formula sheet for the exam by,

$$ \hat\beta_1 = \frac{\sum_{i=1}^n(Y_i - \bar Y)(X_i - \bar X)}{\sum_{i=1}^n(X_i -\bar X)^2}, $$

$$ \hat\beta_0 = \bar Y - \bar X\hat\beta_1 $$

Now we define the estimator for the parameters of the model. This is needed for hypothesis testing in the next section about the t-test. The estimated variance of each parameter is,

$$ Var(\hat \beta) = s^2 (X^T X)^{-1}, $$

where $s^2 = \frac{\hat u^T u}{T-k}$, where T is the number of observations and k is the number of independent variables.

Note on the estimated variance of an estimator

The equation given above is not completely correct, but is the primary one, because the real variance is $\mathit{Var(\hat \beta) = \sigma^2 (X^T X)^{-1}}$, but since we don't know $\sigma$ we use the estimator $s$. Therefore, the equation above is given as an estimator for the variance of the estimator, $\hat \beta$.

Statistics and hypothesis testing for LR

The t-ratio and t-test

The t-ratio is a test statistic for hypothesis testing. It is defined by the error of a parameter divided by the standard error,

$$ t = \frac{\hat \beta - \beta}{SE(\hat \beta)}, $$

This is a t-distribution so we follow the t-distribution procedure for hypothesis testing.

The $\beta$ will be indexed in most linear regressions because each estimator will have its own t-ratio, like $t_j = \frac{\hat \beta_j - \beta_j}{SE(\hat \beta_j)}$. Then we can estiate significance for each estimate.

Also note the F-statistic which is not discussed in this text yet. It is used to test regressions under restrictions.

Goodness of fit

We need a statistic that determines how well a model fits the data (goodness of fit). This is the purpose of the $R^2$ statistic. It is the square of the correlation between $y$ and $\hat y$, and it measures how much of the squared error can be explained by the model.

It is the ratio of the sum of squared regression error (SSR), to the sum of total squares (SST, i.e. the variance)

$$ \begin{matrix} R^2 = \frac{SSR}{SST}, \\ SST = \sum(y_i - \overline y_i)^2,\;\; SSR = \sum(\hat y_i - \overline y_i)^2 \end{matrix} $$

This meassures how much of the variance that the model is able to capture. If most of the variance is explained by the regression then the $R^2$ will be close to 1; if the regression is not able to model the variance. Though, it is important to remember that adding a new covariate (regressor) will never decrease the $R^2$, so one might be fooled by adding more covariates if not careful.

The purposes of the t-ratio and the R^2 statistic are different. The former measures relevancy of regressor, and the latter measures goodness of fit.

Hypothesis testing by maximum likelihood methods

For now we only concern ourselves with the distributions of the following three statistics as the distribution is important for interpretation (relevant for exam).

Wald

Let q be the number of restrictions (number of parameters being estiamted), T be the sample size and k the number of parameters. Then the Wald statistic is in epproximately an F-distribution F(m, T-k)

Lagrange multiplier (LM)

The distribution for the LM statistic is a $\chi_q^2$-distribution.

Generally, the finite Wald statistic is better for small sample sizes, as it accounts for T, the sample size.

Nonlinear regression

Any independent variable might have a higher polynomial degree than 1, like be squared and still be a valid regression model. Then the coefficients will be interpreted differently. The dependent variable might also vary with the independent variables logarithmically. In these cases it might be worth considering how a change in the independent variables affect the dependent varible (i.e. taking a derivative).

The most "important" nonlinear type of correlation in regression in this course is dummy variables. Dummy variables can be considered as binary variables that check if a condition is satisfied. You can add dummy variables up to a certain limit. This limit is decided by

Note on the risk of dummy variable

Though, you should be careful not to add as many dummy variables as there are outcomes while still having an intercept. For example, if we add a dummy variable in a regression for every day of the week one of them will always be true for every single day. In other words, the intercept (which may be interpreted as an independent variable which is set to 1) will be correlated like this to the dummy variables: $\mathbf X_0 = 1 = D_1 + D_2 + \dots + D_7$ and there is perfect multicolinearity, which is assumed not to be present. This is fixed by not having any intercept in the regression.

Times series models

Characteristics of financial returns

We need to be aquainted with some terms: stationarity, autocorrelation, volatility clustering and leverage effects.

Volatility clustering is present in a time series when large volatility is usually followed by large volatility. So absolute retuns tend to have autocorrelation for small lags.

Leverage effects are effects where lower returns lead to higher volatility, while higher-than-average returns do not result in increased volatility.

Autocorrelation

We've already discussed autocorrelation, but for the following material it is essential to be very familiar with the term. When autocorrelation is present there is a constant correlation between the value of a time series at some time $t_i$ and the value at some later time $t_i + k$ with respect to k. In other words, the correlation is the same for all $t_i$ given that k is the same. There is one autocorelation in a time series for all $k$. Mathematically autocorrelation is defined as,

$$ \tau_k = \frac{\mathrm{Cov}(r_t, r_{t-k})}{\sqrt{\mathrm{Var}(r_t)\mathrm{Var}(r_{t-k})}} = \frac{\mathbb{E}[(r_t - \mu)(r_{t-k} - \mu)]}{\sigma^2} $$

The function mapping the integers to the autocovariances of some time series for each integer is the autocovariance function (acf), or mathematically $s \mapsto \gamma_s =\ E\left[(y_t - E[y_t])(y_{t-s}-E[y_{t-s}])\right]\;\; \forall s\in\mathbb{N}$. It is often more useful to use autocorrelations because then the units are normalized to 1. Convenienty the autocorelation is found easily from the autocovariances by, $\tau_k = \frac{\gamma_k}{\gamma_0}$

We also have another autocorrelation function than the typical ACF walled the Partial Autcorrolative Function (PACF). This function messures the correlation between an observation made at some arbitrary time and one made $l$ periods ago without the effect of intermediate lags. So if there is some autocorrelation in lags $a$ and $b$ for some process, the PACF will not include whatever effect this may have on lags $a+b$ in the function. The PACF functions will often have very complicated formulas.

Stationarity

Stationarity generally means that some time-series has no trend appart from seasonality. It has constant variance and constant mean. Mathetatically we have two definitions of stationarity. A weak and a strong definition.

A time series is a strictly (strongly) stationary time seires if,

$$ P(y_{t_1} \leq b_1,... ,y_{t_n} \leq b_n) = P(y_{t_1+\lambda} \leq b_1,... ,y_{t_n+\lambda} \leq b_n) $$

i.e. if the probability distribution of a time series is the same with respect to lag.

A process is a weakly stationary time series if all the conditions are satisfied

$E[y_t] = \mu$ (constant mean)
$E[(y_t-\mu)(y_t-\mu)] = \sigma^2 < \infty$ (constant variance)
$E[(y_{t_1}-\mu)(y_{t_2} - \mu)] = \gamma_{t_2-t_1}$ (autocovariance)

Stationarity example: White noise

White noise are examples of stationary processes in that they have constant mean, constant variace and no non-trivial autocorrelation (meaning that $\tau_0 = 1$). We test for white noise by assuming that the autocorrelation is normally distributed with variance $\sigma^2=\ 1/T$ so that we can do a significance test.

Non-stationarity example: Random walk

Random walks are more discussed in this course. They are defined by $y_t = y_{t-1} + u_t$, where u_t is white noise. They are not strictly stationary as the probability distribution of the time series is not constant for any lag. It changes depending on the white noise over many lags. Though, the acf of a random walk is strictly decreasing, so it is more predictable than white noise in many ways.

Seasonality

Seasonality is the property where some value is predicatbly fluctuating within some timeframe.

There are many ways of correcting for seasonality. Seasonal differencing is the most relevant preprocessing technique for this course. If a time series has a lag k seasonality we difference two intervals of size k to find the corrected time series. In other words $Y'_t = Y_t - Y_{t-k}$ If a time series is sampled daily and we have yearly seasonality we will have $k=365$.

Other methods of correcting for seasonality involve different types of preprocessing, feature engineering, machine learning, or using seasonal models like SARIMA.

Characteristics

The folliwing characteristics of financial returns are discussed numerous times in this course:

Non-stationarity of prices, but stationarity of returns.
No autocorrelations of returns
There is autocorrelation of squared returns.
Volatility clustering
Fat-tailed distributions (non-normal with large kutosis)
Leverage effects

We would also like to define the notion of unit roots. If a non-stationary process $y_t$ can be differenced $d$ times to become a stationary process then $y_t \sim I(d)$ (the process has d unit roots). Most financial time series have 1 unit root.

Tests for time series models

There are many things to test for in financial time series.

We test for normality wiht the Jarque-Bera test, as discussed above. The JB test statistic is distributed as $\chi^2_2$
We test for autocorrelation with either Durbin-Watson, Box-Pierce or Ljung-Box tests.
To detect heteroscadistity we use the ARCH test.
We test for stationarity and unit roots by the Dickey-Fuller test, or the Phillips-Perron test.
We test HAC estimators with the Newey West test.

Univariate time series models

Univariate time series models makes predictions about a time series based on the history of the series. In other words the models use $E[y_t | F_{t-1}]$ where $F_{t-1}$ is the history of the series up to time $t-1$.

Moving averages models (MA)

A q-th order moving averages model is defined by,

$$ y_t = \mu + u_t + \theta_1 \ u_{t-1} + \theta_2 \ u_{t-2} +\dots + \theta_q \ u_{t-q} $$

where $u_i$ are i.i.d random variables with $E[u_t]=0, \ Var(u_t)=\sigma^2$.

Now we study some properties of MA models. They all have expected value $E[y_t] = E[\mu]+ E[u_t + \theta_1 \ u_{t-1} + \theta_2 \ u_{t-2} +\dots + \theta_q \ u_{t-q}] = \mu$. They have variance, $Var(y_t)=\left(1+ \theta_1^2 + \dots + \theta_q^2\right)\ \sigma^2$.

The autocovariances can be calculated as,

$$\gamma_s = (\theta_s + \theta_{s+1}\theta_1 + \theta_{s+1}\theta_2 + \dots + \theta_q \theta_{q-s})\sigma^2$$

for $s<q$, and otherwise the autocorrelation will be 0.

Autoregressive models (AR)

An autoregressive model of order p is defined as,

$$ y_t = \mu + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + u_t $$

Note on notation: Lag operator

Sometimes you might see used the lag operator. It is defined as $L^i y_t = y_{t-i}$. We may also define some function, $\phi$ where,

$\phi(L) = 1-(\phi_1L^1+\dots+\phi_pL^p)$

Then, $\phi(L)y_t = \mu+u_t$ ecnapsulates the AR model as defined above.

Stationarity of AR model

The stationary condition of an AR model is that all the roots og the polynomial below lies outside of the unit circle.

$$ 1 - \phi_1 z_1 - \phi_2 z_2 - \dots - \phi_p z^p $$

From algebra we know that this polynomial will have at most p unique roots.

An AR model can also be expressed as an MA($\infty$) model, but the AR model has to be stationary for such a representation to exist.

Moments of AR model

The expected value is,

$$ E[y_t] = \frac{\mu}{1-\phi_1 - \phi_2 - \dots - \phi_p} $$

The autocorrelations can be found from solving the Yule-Walker equations:

$$ \begin{bmatrix} \tau_1 \\ \tau_2 \\ \vdots \\ \tau_p \end{bmatrix} = \begin{bmatrix} 1 & \tau_1 & \dots & \tau_{p-1} \\ \tau_1 & 1 & \dots & \tau_{p-2} \\ \vdots & \vdots & \ddots & \vdots \\ \tau_{p-1} & \tau_{p-2} & \dots & 1 \end{bmatrix} \begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_p \end{bmatrix} $$

From the definitions of variance, covariance and autocorrelation we can calculate these quantities for an AR model analytically by hand with the moments.

ARMA

An ARMA process is a combination between an AR- and an MA process. Mathematically an ARMA(p,q) process is,

$$ y_t = \mu + u_t + \phi_1 y_{t-1} + \dots + \phi_p y_{t-p} + \theta_1 u_{t-1} + \dots + \theta_q u_{t-q} $$

or if they are expressed with lag operators,

$$\phi(L)y_t = \mu + \theta(L) u_t$$

The stationary condition of the ARMA model requires that all roots of the characteristic polynomial have roots with absolute value larger than one, as explained above.

Note also that for lags larger than q, the MA order, the model will behave like an AR model when discussing autocorrelations.

Building ARMA models

The Box-Jenkins approach is a 3-step process for buidling an ARMA model.

Identification: Determine the order of an ARMA model.
Estimation: Typically done with least squares or maximum likelihood.
Diagnostics: We may use two methods: deliberate overfitting and residual diagnostics.

Identification is typically done by finding the model that minimizes different information criteria. See below

ARMA variants

There are many different variants of ARMA models. One of them is called ARIMA, where the added I stands for integrated. An integrated autoregressive model has at least one root (of the characteristic polynomial) inside the unit circle. Therefore we introduce the ARIMA model to difference the data d times to make it satisfy the stationary condition. So, an ARIMA(p,d,q) model is equivalent to an ARMA(p,q) model where the original data has been differences d times.

Forecasting

TODO: Write about forecasting procedures

We may evaluate forecasting procedures with Mean Squared Error (MSE). It is useful because it is differentiable, but it is not easily interpretable.

Root mean squared error (RMSE) and mean abslute error (MAE) try to solve this, as they are easier to interpret. MAE is less sensitive to outliers than the RMSE, but MAE is also not differentiable while RMSE still is. They are defined as,

$$ RMSE = \sqrt{\frac{1}{n}\sum(y_i - \hat y_i)^2} $$

$$ MAE = \frac{1}{n} \sum |y_i - \hat y_i| $$

We may also use Theil's U-statistic, defined by

$$ U = \sqrt{\frac{\sum(y_i - \hat y_i)^2}{\sum y_i^2}} $$

This metric measures error relatively.

We also have multiple tests for comparing forecasts. The Diebold-Mariano Test is defined by,

$$ DM = \frac{\overline d}{\sqrt{Var(d)/n}} $$

where d is the loss differential between two forecasts.

We also have the Clark-West test:

$$ CW = \frac{\overline d - \lambda}{\sqrt{Var(d)/n}} $$

where $\lambda$ is an adjustment for nested models. This is designed to adjust for nested models, unlike the DM test.

Volatility

The ACF curves of returns are statistically insignificant, but the ACF of squared returns (volatility) is present for some lags. This is indicative of volatility clustering.

When discussing volatility we differentiate between empirical and implied volatility. Empirical volatility is backward looking and is meassures the volatility of previous returns. Implied volatility is forward looking and is a meassure of expected volatility. It can be inferred from the Black Scholes model.

We also diffrentate between two types of volatility modeling: deterministic and stochastic. We will discuss three types of deterministic volatilty modeling. These are the GARCH family, HAR family and the MIDAS family. We will only discuss one type of stochastic modeling, Heston.

ARCH and GARCH models

ARCH

ARCH is short for AutoRegressive Conditional Heteroscedasticity. The ARCH(q) model is,

$$ \epsilon_t = \sqrt{\sigma_t^2} z_t, \ \ \ z_t\sim D(0,1) $$

where D is some distribution and

$$ \sigma_t^2 = \omega + \alpha_1 \epsilon_{t-1}^2 + \alpha_2 \epsilon_{t-2}^2 + \dots + \alpha_q \epsilon_{t-q}^2, $$

where $\alpha\geq 0$ and $\omega>0$.

We can test for ARCH effects with Engles ARCH test.

The null hypothesis is $H_0:\ \alpha_1 = 0, \ \dots, \ \alpha_q = 0$, and we can find the $R^2$ of this hypothesis and the test statistic will be

$$ T\cdot R^2 \sim \chi_q^2 $$

where T is the number of observations.

We typically estimate ARCH models with Maximum likelihood numerically. This method is called Quasi Maximum Likelihood.

GARCH

But ARCH does not always result in a good fit because it needs too many parameters. Therefore Generalized ARCH (GARCH) was developed. GARCH(p,q) is defined by,

$$ \epsilon_t = \sqrt{\sigma_t^2} z_t, \ \ \ z_t \sim D(0,1) $$

and,

$$ \sigma_t^2 = \omega + \alpha_1\epsilon_{t-1}^2 + \dots + \alpha_q \epsilon_{t-q}^2 + \beta_1 \sigma_{t-1}^2 + \dots + \beta_p \sigma_{t-q}^2 $$

GARCH(p,q) is stationary if

$$ \sum_{i=1}^{q}\alpha_i + \sum_{j=1}^{p}\beta_j \leq 1 $$

The uncondititonal variance of the GARCH(p,q) model is,

$$ E[\epsilon_t^2] = \sigma^2 = \frac{\omega}{1 - \sum_{i=1}^{q}\alpha_i - \sum_{j=1}^{p}\beta_j} $$

From this equation we can see why the condition above is necessary for the GARCH model.

GJR-GARCH

The greates limitation of GARCH is that is cannot correct for leverage effects (the asymmetric effect of the sign of returns on volatility). Volatility is not dependent on the sign of returns in this model. Therefore we can estimate a GJR-GARCH model like,

$$ \sigma_t^2 = \omega + \sum_{i=1}^q \alpha_i \epsilon_{t-i}^2 + \sum_{i=1}^{q} \gamma_i \epsilon_{t-i}^2 I(\epsilon_{t-i} < 0) + \sum_{j=1}^{p} \beta_i \sigma_{t-i}^2 $$

where $I$ is a boolean function.

This model may also be written as,

$$ \sigma_t^2 = \omega + \sum_{i=1}^q \left(\alpha_i + \gamma_i I(\epsilon_{t-i} < 0)\right)\epsilon_{t-i}^2 + \sum_{j=1}^{p} \beta_i \sigma_{t-i}^2 $$

EGARCH

There also exists Exponential GARCH (EGARCH) to correct for asymetric levereage affects. It is formulated as,

$$ \log(\sigma^2) = \omega + \sum_{i=1}^q \left[\alpha_i \left| \frac{u_{t-i}}{\sigma_{t-i}} \right| + \gamma_i \frac{u_{t-i}}{\sigma_{t-i}}\right] + \sum_{j=1}^p \beta_j log(\sigma_{t-j}^2) $$

The nature of the logarithm creates the desired asymmetry.

HAR models

To define a HAR model we first need to define realized volatility.

For some time t, there is a consistent estimator for the true latent variance,

$$ RV_t^2 = \sum_{t=1}{M} r_{t,i}^2 $$

where, $M = 1/\Delta$ and the $\Delta$-period intraday return is,

$$ r_{t,i} = \log(P_{t-1+i\times\Delta}) - \log(P_{t-1 + (i-1)\times\Delta}) \ \ \ (\text{i.e. log-returns}) $$

The HAR model is then a regression model,

$$ RV_t = \beta_0 + \beta_D \ RV_{t-1} + \beta_W \ RV_{t-1,t-5} + \beta_M \ RV_{t-1,t-22} $$

where $RV_{t-1,t-a}$ is the average intraperiod realized volatility between the two times $t-1$ and $t-a$.

MIDAS models

MIDAS is short of Mixed Data Sampling. It is a regression method that may use data which has been sampled at different frequencies. The model is defined as,

$$ y_t = X_t'\beta + f\left( {X_{t/S}^H}, \theta, \lambda \right) + \epsilon_t $$

Panel data models

SUR model

Panel data is a matrix of data that measures values related to different quantities (individuals) and times. It is the development of cross sectional data across time. It is useful to study these datasets because we can discover relationships between quantities with these techniques.

A simple way to model this data is with a "pooled regression". But in cases where there is heterogeneity between individuals and some parameters might not be constant accross time the resulting model might fall to a special case of Simpsons Paradox, as seen in the figure below. (in this case the intercept is not constant)

Simpsons Paradox

A solution to this might be to use SUR (Seemingly unrelated regression). In SUR each individual is treated independently and each in reality there are $N$ equations in the model (where $N$ is the number of individuals and $T$ is the number of temporal observations). Of course, this method is only accurate when T > N.

The model in matrix format is,

$$ Y_{i, t}= X_{i,t} \beta_i + u_{i, t} $$

FE and RE models

We deal with two types of model that model panel data with heterogeneity: Fixed Effects (FE) and Random Effects (RE).

Fixed effects models assumes that an individuals heterogeneity is corrolated with the explanatory variables. In FE models the heterogeneity is captured by allowing for different intercepts and minimizing bias by differencing it out for each individual. But this does not allow for time-invariant regressors. As FE models try to model a "within-individual" (across time) specific vairation and not within time (across individials) any variable that doesn't vary across time will be colinear with the intercept and should not be included in the model. Therefore it becomes impossible to interpret the effects of time-invariant regressors on the dependent variable. An example of this might be the effect of gender on income.

A Random Effects models assumes that individual specific heterogeneity is not correlated with exmplanatory variables - the opposite of the Fixed Effects model. Since there are no correlations with the regressors the heterogeneity will be in the random error term. But if there in reality is a correlation with the explanatory variables the results will be biased.

The FE model can be estimated by either Least Squares Dummy Variable (LSDV) or Fixed Effects model Within Estimator. The LSDV model uses lots of dummy variables for individual-specific effects and estimates with OLS. The Fixed Effects model Within Estimator demeans the effects for each individual to find the unobserved heterogeneity across individuals.

There is also a third appraoch, the first difference operator, which tries to model changes over time instead of levels.

It is generally thought that FE is only plausible when the sample can represent the whole population. The RE model often produce more efficient estimation than FE bacause RE has fewer variables, but RA also assumes that the error term is uncorrelated with the explanatory variables.

We can use a Hausman specification test to determine the model.

Risk measurement and Portfolio optimization

Risk measures

Say we have tome portfolio value at time 0 and time 1. Then with a set of investments the portfolio value at time 1 will be some probability distribution $V_1$.

A risk measure $\rho$ will reduce this number to a value (in the same units as the currency) to measure riskiness by defining the amount of buffer capital needed to make the position acceptable. Any risk measure can satisfy the following properties,

Translation invariance: $\rho(X + cR_0) = \rho(X) - c$

Where $R_0$ is the risk-free rate. We interpret this by saying that if we add $c$ capital at the risk free rate $R_0$ to some portfolio X, we would decrease the buffer capital by that same amount $c$.

Monotonicity: If $X_2 \leq X_1$ then $\rho(X_1) \leq \rho(X_2)$

whith this property saying that more valuable positions are riskier.

Normalization: $\rho(0) = 0$

Saying that taking no position is acceptable as well.

Positive homogeneity: $\rho(\lambda X) = \lambda \rho(X)$ for all $\lambda$.

Making larger positions riskier in terms of buffer capital.

Subadditivity: $\rho(X_1+X_2) \leq \rho(X_1)+\rho(X_2)$

Again, rewarding diversification when adding them to each other.

When a risk measure satisfies all these properties above, it is coherent.

Value-at-Risk (VaR)

The value at risk at level $p\in(0,1)$ of a portfolio X at time 1 is defined as,

$$ VaR_p(X) = \min{m : P(m R_0 + X < 0) \leq p} $$

where $R_0$ is the risk free rate. So the value at risk is the capital needed to be invested into risk free assets so that the probability of losing money (discounted) is less than $p$. In other words, it uses quantiles to determine risk.

Note that VaR is not subadditive and is therefore not coherent. Another disadvantage og VaR is that the user has to mute it to time horizon and confidence level. It also does not describe losses larger than the VaR, as it does not describe the tailedness of the distribution.

Expected shortfall tries to remidy this.

Expected Shortfall (ES)

Expected shortfall tries to remidy the fact that VaR metric does not describe losses larger than the VaR. Expected shortfall is therefore the expected value of the loss, given that it is larger than the VaR, defined as

$$ ES_p(X) = \frac 1 p \int_0^p VaR_u (X) du $$

Note that ES is a coherent risk measure because it rewards diversification.

Tests and diagnostics registry

Heteroscadisticity tests

Goldfeld-Quandt (GQ) test

The GQ test is a test for heteroscadistity. It is useful for testing the homoscedasticity assumption in regression analysis.

It splits the samplpe length into two subsamples of length T1 and T2. Then we do a regression on both samples and the residual variances are calculated. The null hypotehsis is that the variances are equal.

We have the test statistic

$$ GQ = \frac{S_1^2}{S_2^2} \sim F(T_1 - k, T_2 - k) $$

White's test

White's test tries to model the residuals with an auxiliary regression with the same regressors as the original equation. Then we can run significance tests with F-test or $R^2$ test on the regression to find whether it is dependent on the regressors.

ARCH test

The ARCH test detects heteroscedasticity in a regression by estimating an ARCH(q) model on the residuals of the original regression. Then we run an LM test with the residual test statistic

$$ LM = T \ R^2 \sim \chi_q^2 $$

and the null hypothesis stating that $\gamma_0 = \gamma_1 = \dots = \gamma_q$, i.e. no autocorrelation.

Autocorrelation

The Durbin-Watson test tests for first order corrolation only and is distributed around 2 under the null hypothesis of no autocorrolation.

There are many other tests described in this course like Breusch-Godfrey, Box-Pierce and Box Ljung, but none are explained here yet.

Information criteria (IC)

Information criteria are typically used for model identification for ARMA models (finding the appropriate (p,q)-order). The main mechanisms of these criteria reward RSS minimization, but add a penalty for the number of parameters. This way we can increase the goodness of fit, but still avoid overfitting the data.

There are three popular IC: AIC, SBIC, and HQIC.

They are defined as,

$$ AIC = ln(\hat\sigma^2) + \frac{2k}{T} $$

$$ SBIC = ln(\hat \sigma^2) + \frac{k}{T} \ln T $$

$$ HQIC = \ln(\hat \sigma^2) + \frac{2k}{T} \ln\ln T $$

where $k = p+q+1$ and $T$ is the sample size. All of these metrics are meant to be minimized.

They differ in what attributes they reward/punish the most. SBIC has a stiffer parameter-penalty than the AIC, as may be seen in the formulas. SBIC is also consistent, and AIC is not.

TODO: Write about HQIC.

Hausman specification test

The Hausman test compares the estimators under FE and RE models. The null hypothesis is:

$$ H_0: \ \hat \beta_{FE} = \hat \beta_{RE} $$

if we reject H0 then RE is inconsistent and FE is preferred, and if we don't reject H0 then RE is consistent and efficient. The test statistic is,

$$ H= (\hat\beta_{FE} - \hat\beta_{RE})\left[Var(\hat\beta_{FE}) - Var(\hat\beta_{RE})\right]^{-1}(\hat\beta_{FE} - \hat\beta_{RE}) \sim \chi^2_k $$

where k is the number of parameters

The hypothesis of the test is based on the fac that the RE model is efficient, but it is inconsistent if the random effects $\epsilon_i$ are correlated with the regressors. But since the FE parameters are conistent wihtout this assumption it will be correct. Therefore, if the RE parameters equals the FE parameters, RE will be the appropriate model, but if they are not equal, FE is the appropriate model.

Types of data

There are three types of data:

Cross-sectional data: Data set concentrated at a single time across individuals, i.
Time-series data: Concentrated at a single individual across time, t.
Panel data: A 2-dimensional set of data points across individuals, i, and time, t.

We take the word across to mean that there is a set of distinct individuals (times) which is indexed by i (t). Individuals may be any meaningful set of data that are connected somehow. It may be a tuple of a persons age, education, and income — or a company's pair tuple of (D/E ratio, EBIT) — or a country's tuple of (unemployment rate, GDP growth).

Written by

aleksfasting

Last updated: Fri, 20 Feb 2026 23:46:18 +0100 .