‹

CS-434: Unsupervised and Reinforcement Learning in Neural Networks

Tags:

Introduction

This compendium is made for the course CS-434 Unsupervised and Reinforcement Learning in Neural Networks at École Polyteqnique Fédérale de Lausanne (EPFL) and is a summary of the lectures and lecture notes. Is is not the complete curriculum, but rather a list of reading material.

Prerequisites

Eigendecomposition

Compute eigenvalues and eigenvectors

Matrix operations

Positive semi-definite:

$\mathbb{v^T A v} \geq 0$ for all vectors $v$.

Unsupervised Learning

Synaptic Plasticity: In neuroscience, synaptic plasticity is the ability of synapses to strengthen or weaken over time, in response to increases or decreases in their activity

Hebbian learning

When an axon of cell $j$ repeatedly or persistently takes part in firing cell $i$, then $j$’s efficiency as one of the cells firing $i$ is increased.

Rate-based Hebbian Learning

active = high rate = many spikes per second

Oja rule (1989)

Detects first principal component.

$$\frac{d}{dt}w_{ij} = a^{corr}_2 v^{pre}_j v^{post}_i - c(w_{ij})(v^{post}_i)^2$$

Spike time dependent learning window

Function that maps distance between pre and post to how they wire together.

Component Analysis

Principal component analysis

Subtract mean
Calculate covariance matrix
Find eigenvalues and corresponding vector
The vector with the greatest eigenvalue points in the direction of the principal component
- $FinalData = RowFeatureVector \times RowDataAdjust$
- $RowOriginalData = (RowFeatureVector^T \times FinalData) + OriginalMean $

Independent coponent analysis

Subtract mean
Whitening (zero variance in all dimensions)
1. Change to coordinates of maximum variance: PCA
2. Normalize: divide each com,ponent by $\sqrt{\lambda^n}$

Independence

$p(y_1, y_2) = p(y_1)p(y_2)$

Normalized gaussian distribution has no preferred axis.

Kurtosis

The classical measure of non-Gaussianity is kurtosis or the fourth-order cumulant. The kurtosis of $y$ is classically defined by

$ kurt(y) = E\{y^4\} - 3(E\{y^2\})^2 $

Since $y$ has unit variance, the equation simplifies to

$ kurt(y) = E\{y^4\} - 3 $

and we know that the gaussian has fourth moment $E\{y^4\}$, which makes the kurtosis zero for a gaussian distribution.

Temporal ICA

Find unmixing matrix $W$ such that outputs are independent.

Clustering

K-means

Determine winner $k$
- $\| \vec{w}_k - \vec{x}^{\mu} \| \leq \| \vec{w}_i - \vec{x}^{\mu} \|$ for all $i$
Update winner
- $\Delta \vec{w}_k = \eta ( \vec{x}^{\mu} - \vec{w}_k )$

Kohonen maps

Kind of an extension of k-means.

Learning rule:

$$ \Delta \vec{w}_{j} = \eta \Lambda(j, k) (\vec{x}^{\mu} - \vec{w}_j)$$

Reinforcement Learning

Reinforcement learning = Hebbian learning + reward

$Q(s,a)$ is expected reward: $$Q(s,a) = \sum_{s'} P^a_{s \rightarrow s'} R^a_{s \rightarrow s'}$$

Bellmann equation

$$Q(s,a) = \sum_{s'} P^a_{s \rightarrow s'} \Big[ R^a_{s \rightarrow s'} + \gamma \sum_{a'} \pi(s', a') Q(s',a') \Big] $$

Exploration vs. expoitation dilemma.

On-policy: The same policy is used to select the next action and to update the Q-values.
Off-policy: The action is selected, using policy A (e.g. soft-max). The Q-values are updated using policy B (e.g. $\epsilon$ greedy).

SARSA (on-policy)

TD: Temporal Difference
Eligability traces: A mixture between TD and Monte Carlo methods.

$\Delta Q(s,a) = \eta [r-(Q(s,a) - \gamma Q(s',a'))]$

Q-learning (off-policy)

Policy Gradient

Forget Q-values. Optimize directly the reward for an action.

Stimuli/observations $\rightarrow$ action table

When not to use TD algorithms:

Continous state spaces are need hard-to-tune function approximations
Can diverge even if fully observable using function approximations
Continous action are difficult to represent using TD algorithms

Written by

Martin Hallén

Last updated: Tue, 26 Jan 2016 17:26:02 +0100 .