CS-434: Unsupervised and Reinforcement Learning in Neural Networks
Introduction
This compendium is made for the course CS-434 Unsupervised and Reinforcement Learning in Neural Networks at École Polyteqnique Fédérale de Lausanne (EPFL) and is a summary of the lectures and lecture notes. Is is not the complete curriculum, but rather a list of reading material.
Prerequisites
Eigendecomposition
Compute eigenvalues and eigenvectors
Matrix operations
Positive semi-definite:
Unsupervised Learning
- Synaptic Plasticity: In neuroscience, synaptic plasticity is the ability of synapses to strengthen or weaken over time, in response to increases or decreases in their activity
Hebbian learning
When an axon of cell
Rate-based Hebbian Learning
active = high rate = many spikes per second
Oja rule (1989)
Detects first principal component.
Spike time dependent learning window
Function that maps distance between pre and post to how they wire together.
Component Analysis
Principal component analysis
- Subtract mean
- Calculate covariance matrix
- Find eigenvalues and corresponding vector
- The vector with the greatest eigenvalue points in the direction of the principal component
$FinalData = RowFeatureVector \times RowDataAdjust$ $RowOriginalData = (RowFeatureVector^T \times FinalData) + OriginalMean $
Independent coponent analysis
- Subtract mean
- Whitening (zero variance in all dimensions)
- Change to coordinates of maximum variance: PCA
- Normalize: divide each com,ponent by
$\sqrt{\lambda^n}$
Independence
Normalized gaussian distribution has no preferred axis.
Kurtosis
The classical measure of non-Gaussianity is kurtosis or the fourth-order cumulant. The kurtosis of
Since
and we know that the gaussian has fourth moment
Temporal ICA
Find unmixing matrix
Clustering
K-means
- Determine winner
$k$ $\| \vec{w}_k - \vec{x}^{\mu} \| \leq \| \vec{w}_i - \vec{x}^{\mu} \|$ for all$i$
- Update winner
$\Delta \vec{w}_k = \eta ( \vec{x}^{\mu} - \vec{w}_k )$
Kohonen maps
Kind of an extension of k-means.
Learning rule:
Reinforcement Learning
Reinforcement learning = Hebbian learning + reward
Bellmann equation
Exploration vs. expoitation dilemma.
- On-policy: The same policy is used to select the next action and to update the Q-values.
- Off-policy: The action is selected, using policy A (e.g. soft-max). The Q-values are updated using policy B (e.g.
$\epsilon$ greedy).
SARSA (on-policy)
- TD: Temporal Difference
- Eligability traces: A mixture between TD and Monte Carlo methods.
Q-learning (off-policy)
Policy Gradient
Forget Q-values. Optimize directly the reward for an action.
Stimuli/observations
When not to use TD algorithms:
- Continous state spaces are need hard-to-tune function approximations
- Can diverge even if fully observable using function approximations
- Continous action are difficult to represent using TD algorithms