Wikipendium

Share on Twitter Create compendium Add Language
Edit History
Tools
  • Edit
  • History
  • Share on Twitter

  • Add language

  • Create new compendium
Log in
Table of Contents
  1. Introduction
    1. Prerequisites
      1. Eigendecomposition
      2. Matrix operations
        1. Positive semi-definite:
  2. Unsupervised Learning
    1. Hebbian learning
      1. Rate-based Hebbian Learning
      2. Oja rule (1989)
      3. Spike time dependent learning window
    2. Component Analysis
      1. Principal component analysis
      2. Independent coponent analysis
        1. Independence
        2. Kurtosis
        3. Temporal ICA
    3. Clustering
      1. K-means
      2. Kohonen maps
  3. Reinforcement Learning
    1. Bellmann equation
    2. SARSA (on-policy)
    3. Q-learning (off-policy)
    4. Policy Gradient
‹

CS-434: Unsupervised and Reinforcement Learning in Neural Networks

Tags:
  • Hebbian learning
  • Reinforcement learning
  • Unsupervised learning
  • Self Organizing Maps
  • Kohonen Maps
  • Neural networks
  • Artificial intelligence
+

Introduction

This compendium is made for the course CS-434 Unsupervised and Reinforcement Learning in Neural Networks at École Polyteqnique Fédérale de Lausanne (EPFL) and is a summary of the lectures and lecture notes. Is is not the complete curriculum, but rather a list of reading material.

Prerequisites

Eigendecomposition

Compute eigenvalues and eigenvectors

Matrix operations

Positive semi-definite:

$\mathbb{v^T A v} \geq 0$ for all vectors $v$.

Unsupervised Learning

  • Synaptic Plasticity: In neuroscience, synaptic plasticity is the ability of synapses to strengthen or weaken over time, in response to increases or decreases in their activity

Hebbian learning

When an axon of cell $j$ repeatedly or persistently takes part in firing cell $i$, then $j$’s efficiency as one of the cells firing $i$ is increased.

Rate-based Hebbian Learning

active = high rate = many spikes per second

Oja rule (1989)

Detects first principal component.

$$\frac{d}{dt}w_{ij} = a^{corr}_2 v^{pre}_j v^{post}_i - c(w_{ij})(v^{post}_i)^2$$

Spike time dependent learning window

Function that maps distance between pre and post to how they wire together.

Component Analysis

Principal component analysis

  1. Subtract mean
  2. Calculate covariance matrix
  3. Find eigenvalues and corresponding vector
  4. The vector with the greatest eigenvalue points in the direction of the principal component
    • $FinalData = RowFeatureVector \times RowDataAdjust$
    • $RowOriginalData = (RowFeatureVector^T \times FinalData) + OriginalMean $

Independent coponent analysis

  1. Subtract mean
  2. Whitening (zero variance in all dimensions)
    1. Change to coordinates of maximum variance: PCA
    2. Normalize: divide each com,ponent by $\sqrt{\lambda^n}$

Independence

$p(y_1, y_2) = p(y_1)p(y_2)$

Normalized gaussian distribution has no preferred axis.

Kurtosis

The classical measure of non-Gaussianity is kurtosis or the fourth-order cumulant. The kurtosis of $y$ is classically defined by

$ kurt(y) = E\{y^4\} - 3(E\{y^2\})^2 $

Since $y$ has unit variance, the equation simplifies to

$ kurt(y) = E\{y^4\} - 3 $

and we know that the gaussian has fourth moment $E\{y^4\}$, which makes the kurtosis zero for a gaussian distribution.

Temporal ICA

Find unmixing matrix $W$ such that outputs are independent.

Clustering

K-means

  1. Determine winner $k$
    • $\| \vec{w}_k - \vec{x}^{\mu} \| \leq \| \vec{w}_i - \vec{x}^{\mu} \|$ for all $i$
  2. Update winner
    • $\Delta \vec{w}_k = \eta ( \vec{x}^{\mu} - \vec{w}_k )$

Kohonen maps

Kind of an extension of k-means.

Learning rule:

$$ \Delta \vec{w}_{j} = \eta \Lambda(j, k) (\vec{x}^{\mu} - \vec{w}_j)$$

Reinforcement Learning

Reinforcement learning = Hebbian learning + reward

$Q(s,a)$ is expected reward: $$Q(s,a) = \sum_{s'} P^a_{s \rightarrow s'} R^a_{s \rightarrow s'}$$

Bellmann equation

$$Q(s,a) = \sum_{s'} P^a_{s \rightarrow s'} \Big[ R^a_{s \rightarrow s'} + \gamma \sum_{a'} \pi(s', a') Q(s',a') \Big] $$

Exploration vs. expoitation dilemma.

  • On-policy: The same policy is used to select the next action and to update the Q-values.
  • Off-policy: The action is selected, using policy A (e.g. soft-max). The Q-values are updated using policy B (e.g. $\epsilon$ greedy).

SARSA (on-policy)

  • TD: Temporal Difference
  • Eligability traces: A mixture between TD and Monte Carlo methods.

$\Delta Q(s,a) = \eta [r-(Q(s,a) - \gamma Q(s',a'))]$

Q-learning (off-policy)

Policy Gradient

Forget Q-values. Optimize directly the reward for an action.

Stimuli/observations $\rightarrow$ action table

When not to use TD algorithms:

  1. Continous state spaces are need hard-to-tune function approximations
  2. Can diverge even if fully observable using function approximations
  3. Continous action are difficult to represent using TD algorithms

Written by

Martin Hallén
Last updated: Tue, 26 Jan 2016 17:26:02 +0100 .
  • Contact
  • Twitter
  • Statistics
  • Report a bug
  • Wikipendium cc-by-sa
Wikipendium is ad-free and costs nothing to use. Please help keep Wikipendium alive by donating today!