Table of Contents

Modern Deep-learning based Computer Vision
Traditional/Classical Computer Vision
"Common" Computer Vision

‹

TDT4265: Computer Vision and Deep Learning

Tags:

Modern Deep-learning based Computer Vision

Shallow fully-connected Feed-Forward Artificial Neural Networks

Artificial Neural Networks

ANNs are inspired by biological neural networks and is built up by layers of neurons. This video explains neural networks in a very good way.

Wordlist:

Shallow NN: a neural net with few hidden layers. The opposite is deep neural networks with many layers
Fully-connected: all neurons are connected to all neurons in both the previous and the next layer
Feed-Forward: only forward links between the layers

Notation:

$w_{jk}^l$: weight to neuron j in layer l from neuron k in layer l-1
$b_j^l$: bias of neuron j in layer l
$z_j^l$: weighted input for neuron j in layer l
$a_j^l$: activation of neuron j in layer l

Activation functions

Sigmoid AF: $$ \sigma (z) = \frac{1}{1+e^{-z}} $$ "Squashes" the input between 0 and 1.
Rectified Linear Unit (ReLU): $$f(x)=max(0,x)$$ Does'nt update the weights if $x<0$ (dead ReLU-problem)
Leaky ReLU: $$f(x) = max(\alpha x, x)$$ Fixes the problem with the dead ReLU
Softmax AF: $$a_j^L = \frac{e^{z_j^L}}{\sum_k e^{z_k^L}}$$ Scales the output as probabilities (all are between 0 and 1 and sums up to 1). Often used in the ouput layer.

Learning

When training a neural network we actually just adjust the parameters $w$ and $b$ so that the network best fits the data. To do so we define a cost function for the network and tries to minimize this. To do so we use the Gradient descent method.

Gradient descent

Watch this video to get an understanding of the Gradient descent method.

Summarized we want to find a minima of the cost function by gradually moving towards it by moving a small step in the direction that brings us most downhill. The gradient descent method uses backpropagation to estimate the gradient of the cost function, so understanding backpropagation is essential to understand how neural networks learn.

The cost function that is often used in describing the Gradient descent-method is the quadratic cost function $$ C = \frac{1}{2n} \sum_x |y(x)-a^L(x)|^2 $$

Stochastic Gradient Descent (SGD)

Stochastic gradient descent is based on an idea to estimate the gradient of $C$ by computing $\nabla C_x$ for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient $\nabla C$ which speeds up the process. We choose $m$ randomly training inputs $X_1, X_2, \mbox{...}, X_m$ and refer to them as a mini-batch. We would expect that the average value of $\nabla C_{X_j}$ will be roughly the same to the overall $\nabla C_x$.

$$ \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_j} $$

Backpropagation

A very good explanation for the backpropagation-algorithm is found here and the calculus is well explained here.

Backpropagation is the backbone of learning in neural network. It gives us an expression for the partial derivate of the cost function, $\frac{\partial C}{\partial w}$, that tells us how the cost function will change when we change the different weights and biases.

Algorithm

Input x: set the corresponding activation $a^1$ for the input layer.
Feedforward: Compute the weighted input $z^l = w^l a^{l-1} + b^l$ and the activation $a^l = \sigma(z^l)$ for every layer $l=1,2,...,L$.
Output error: Compute the error $\delta^L$ for the output error, $$\delta^L = \nabla_a C \odot \sigma'(z^L)$$
Backpropagate the error: Start at layer L-1 and compute the error for every layer $$\delta^l = ((w^{l+1})^T \delta^{l+1})\odot \sigma'(z^l)$$
Output: The gradient of the cost function is now given by $$\frac{\partial C}{\partial w_{jk}^l}=a_k^{l-1} \delta_j^l$$ and $$\frac{\partial C}{\partial b_{j}^l} = \delta_j^l$$.

Improving learning

The cross-entropy cost function

$$C = -\frac{1}{n}\sum_x (y\ln a + (1-y)\ln (1-a))$$

This cost function is positive and tends toward zero as the neuron gets better at computing the desired output. When $y=0$ and $a\approx0$, $C \to 0$ and same when $y=1$ and $a \approx 1$. This cost function avoids the problem of learning slowing down because it learns fast even when the error is big. With quadratic cost function the neuron learns slower when the error is big, which is less like humans learn. The larger the error, the faster the network with a cross-entropy cost function learns. Using this cost function is nearly always better than quadratic cost when the output neurons are sigmoid neurons.

Vanishing gradients

The vanishing gradients is what causes learning slowdown. It´s caused by the term $\sigma'(z)$ in the derivative of the cost function. When $\sigma '(z)$ is small, $\frac{\partial C}{\partial b_1} \approx 0$ . This is in particular when we do backpropagation and calculating gradients of loss with respect to the weights, the gradients tends to get smaller and smaller as we keep moving backwards. This means that the neurons in the earlier layers learn very slowly as compared to the neurons in the later layers in the hierarchy.

Softmax

Another way to address the problem of learning slowdown is looking at softmax layers of neurons. The idea is to define a new type of output layer which uses the softmax function to compute the output instead of the sigmoid function.

The softmax function: $$a_j^L = \frac{e^{z_j^L}}{\sum_k e^{z_k^L}}$$

The sum of all activations in the ouput layers with this activation function is 1, so the output is kind of like a probability distribution.

Overfitting and regularization

With a large number of free parameters you can describe a wide range of phenomena, but this does'nt necessarily make it a good model, it just means it has a lot of freedom and can be used to describe a lot of datasets. With a lot of free parameters a model can be very good on existing data, but fail to generalize on new data. When training a network the accuracy will converge towards a value and stop increasing. Sometimes the cost function will continue to decrease even though the accuracy is constant, but the network is not getting better at generalizing. This is overfitting.

Signs of overfitting:

the cost on the test data gets worse after a while
accuracy does'nt improve
accuracy on the training set is $100\%$ while it's less on the test set. the network starts "memorizing" the training data instead of generalizing.

How can we prevent and reduce overfitting?

stop training when accuracy stops improving
using the validation set for testing and stopping when accuracy for this set has saturated
increase the size of the training set
different regularization techniques

The validation dataset is often used to find the right hyper-parameters instead of the test set. This is to prevent the network from overfitting to the test set, as you add another set to fine-tune the parameters on. This separation of the dataset into training, validation and test is called the "hold out-method".

Regularization

L2 regularization / Weight decay

Adds a regularization term to the cost function. The effect of this term is that the network will prefer to learn small weights. When tuning the regularization parameter, $\lambda$, you choose a compromise between making the weights small and minimizing the original cost function (which of the two is most "important").

L2 regularization term: $$C = C_0 + \underline{ \frac{\lambda}{2n} \sum_w w^2}$$

The regularization causes the network to generalize better and get a higher accuracy because it can train more without overfitting.

Why does this kind of regularization prevent overfitting?

L2 regularization keeps the weights small and the overall network "simpler", aka less complex. Less complex models is less sensitive to noise as the behaviour changes less if we change a few inputs than if it was a complex model. Think of a dataset of points and the difference between a linear model and a model of 9th order that fits the datapoints. If both models fit the points pretty well and the linear model only contain a little noise we still prefer this model because it's much less complex and will, most likely, fit better to new datapoints. We say it's better to generalize, and this also is true for our neural network. The less complex the better, and small weights will contribute towards this.

L1 regularization

Like L2 it adds a regularization term to the cost function.

L1 regularization term: $$C = C_0 + \underline{ \frac{\lambda}{n} \sum_w |w|}$$

Like L2 it shrinks the weights. L1 concentrates the weight of the network in a small number of high-importance connections while the other weights are driven towards zero.

Dropout

Dropout is a regularization technique that modifies the network itself instead of of just the cost function. For every iteration of feed-forwarding the input and backpropagating the result, we leave out half of the neurons and only run the algorithm on the other half. For the next iteration we use a different set of "active" neurons.

The effect of this is like we would train several different networks and then averaging the resulting weights and biases. Because we train a different set of active neurons each iteration these will be affected by overfitting in different ways, and averaging the result will hopefully remove these effects. The result is that the network is robust to the loss of any individual piece of evidence.

Artificially expanding the training data

A large training set is a good way to improve the ckassification accuracy, but it's often expensive to produce more training data. A technique to get a bigger set without getting new data is to manipulate the data that we already have by rotations, translations, adding noise, speeding up/down ++.

Weight initializing

Initializing the weights as a Gaussian random variable between 0 and 1 can lead to learning slowdown because of saturation. To avoid this we can instead initialize the weights as Gaussian variables between 0 and the standard deviation $\frac{1}{\sqrt{n_{in}}}$. Good weight initializion can bring ut to a good accuracy faster, so it speeds up the learning process.

Choosing the hyper-parameters

There is a very large number of possible combinations of hyper-parameters and no easy solution or ways to tell what is best. The best way is to experiment with your dataset and see what gives the best results. A good basic strategy is to start by reducing the complexity of the problem at hand and tuning the parameters with a simpler task, then reducing the complexity of the network, and all over have a high frequency of monitoring so you discover changes in outcome faster.

Convolutional Neural Networks and Image classification

CNNs are deep neural networks, which means that they have many hidden layers. They are built up by convolutional layers, ReLU-layers, pooling-layers and some fully-connected layers.

Convolution

The convolutional layers have several filters (or kernels) that's slid over the input image to check how well that filter matches with a part of the input image. These filters together form a feature map.

Each neuron in the feature map-layer is connected to a region in the previous layer. You can think of it as lighting a flashlight from every neuron onto a certain part of the input image. These flashlights or regions are called local receptive fields. The weights that connect the neuron and the local receptive field are the same for all neurons. This is the same as thinking that the flashlights are held still, but they are looking for the same thing (they are the same filter). These filters we refer to is the same as the weights that connect the neurons in the input image and feature map.

The activations in the feature map are calculated with convolution. That is, you take the inner product of the activations in the input layer and the filter. $$s_{i,j} = \sum_m \sum_n a_{m,n} w_{m,n} +b$$ Convolution is similar to the traditional "template matching" method.

We can have several of these filters, or kernels, in a layer. If we have 3 filters we get 3 feature maps. Connecting this to our flashlights we can say that we have 3 flashlights in a row pointing at the same part of the input image, but looking for different things. Convolution is faster and less computationally expensive that fully-connected layers because of the filter-size and the fact that you don't have connections between every neuron. For example if you have a 20x20 input image and 10 neurons in the first layer. Then you need $20*20*10=4000$ weights for a fully-connected layer, but if you use a convolutional layer with 10 filters of size 3x3 you only need $3x3x10=90$ weights.

Pooling

Pooling is a method for downsizing the filtered images. A pooling layer is added after a convolution layer (and a ReLU) to reduce the number of feature-map coefficients. The way it does this is by sliding a "window" over the feature map, for example of size 2x2, and putting the max value inside this window into a new pooled feature map. This makes the feature map smaller while still keeping the most important information.

Another benefit of pooling is that it makes the feature map less sensitive to location since it doesn't care where in the window the max value is.

Structure

In a convolutional network you start with a convolutional layer which creates a feature map from the input image. Then these neurons are taken through the ReLU-function, followed by a max pooling layer.

Padding and stride

Padding: for convolution-windows to fit around every input tile we add an appropriate number of rows and columns around our input image to fit center convolution windows around every input tile.

Stride: the distance between two windows (convolutional or pooling). if the stride is $>1$ the feature map will be downsized.

Classification

After several Convolution+ReLU+Pooling layers we put some fully-connected layers at the end of the network to support classification. These are traditional FC-layers of neurons, and the last layer uses the softmax activation function so we get the output as probabilities for each class.

Object detection and Segmentation

Object detection is the task of finding which objects are in the image and where they are.

Classification: what object is in the image (one)
Localization: where is the object
Detection: find and localize all objects in image

Object detection as regression

Regression: instead of giving a class you output a number (function output).

Instead of training on the error from the outputted class, you give the system a ground truth bounding box and use the L2 distance to calculate the loss the prediction and the ground truth.

Object detection as Classification

Sliding window

Run a classifier by sliding a "window" over the image and detect objects based on the output from these classified windows.

The problem with this approach is that you need to test many positions and scales and run classification on a large number of regions.

Region proposals

A way to reduce the number of regions is using an algorithm to propose regions of interest instead of classificating all windows.

How to measure performance?

Intersection over Union (IoU)

Measures how much of the boxes that are overlapping (predicted box and ground truth). $$\frac{Intersection}{Union} = \frac{A \cap B}{A \cup B}$$

mean Average Precision

Average of how many true positives the system predicts of all of the classes. $$mAP = \frac{1}{\# classes} \sum_{c \in classes} \frac{\# TP(c)}{\# TP(c) + \# FP(c) }$$

Region of Interest Pooling

A pooling technique to extract fixed size features from arbitrary sized region proposals.

Obtain region of interest from feature map
Divide the RoI into pooling sections corresponding with output size
Apply max pooling on these sections

Region-based CNNs

Main method:

Generate a set of region proposals
Run these regions through a CNN
Compute output

We are introduced to 3 types of R-CNNs which does each of these three steps differently: R-CNN, Fast R-CNN and Faster R-CNN.

R-CNN

Scan image for possible objects with Selective Search. This generates $\approx 2000$ region proposals.
Run a CNN on each of the region proposals.
Take the output of each of the CNNs and feed it into a SVM for classification and a linear regressor to tighten the bounding boxes.

Fast R-CNN

Feature extraction is done before proposing regions, so it only runs one CNN over the full image instead of 2000
Replaced the SVM for classification with a softmax layer.
- faster because now we only have to train one CNN instead

Faster R-CNN

Replaced the Selective Search algorithm with a neural net: the region proposal network (RPN)
- At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimention
- For each sliding window location it generates multiple possible regions based on $k$ fixed-ratio anchor boxes
- Each region proposal gets an "objectness" score
The RPN only proposes regions, it does not classify them
Faster R-CNN = RPN + Fast R-CNN

You Only Look Once (YOLO)

YOLO divides the input image into a grid of $SxS$ cells. Each of these cells are responsible for predicting $N$ bounding boxes and a class. Each of the bounding boxes gets a confidence score (probability that it contains an object), and this confidence score and the class prediction are combined.

YOLO is much faster than Faster R-CNN because even though it has to do $SxSxN$ (normally this number is $13x13x5=845$) separate predictions it only has to run through the neural network once.

Single Shot MultiBox Detector (SSD)

Instead of proposing regions and classifying them in 2 steps like the R-CNNs does, SSD does it in a "single shot".

Method:

Sends image thorugh several convolutional layers, giving several sets of feature maps at different scales
For each location in feature map, use a 3x3 convolutional filter to evaluate a set of fixed-size bounding boxes
For each box, predict the bbox offset and class probability
During training, match ground truth with predicted boxes based on IoU

Traditional/Classical Computer Vision

Point descriptors

Consider the problem of building a panorama-image from two images. Two problems arises:

how to find the same points independently in both images
for each point, how do we match this with the corresponding point in the other image?

To solve these problems we need to find good features. A feature is a piece of information which is relevant for solving the computational task related to a certain application (source), and in computer vision it can for example be points, edges or objects.

Characteristics of good features:

Repeatability/Precision
Saliency/Matchability (distinctive descriptor)
Compactness and efficiency (fewer features than pixels)
Locality (feature is relatively small)

Harris corners

The Harris corner algorithm is a way to detect corners in images.

We consider a gray-scale image $I$. Take an image patch over the area $(x,y)$ and shift it by $[u,v]$. The change in appearance for the shift is given by: $$E(u,v) = \sum_{x,y} w(x,y) [I(x+u,y+v)-I(x,y)]^2 $$ $w(x,y)$ is the window function, and can be square or Gaussian (Gaussian windows are better but square windows compute faster).

This error function $E(u,v)$ can be approximated by a second order Taylor expansion: $$E(u,v) \approx \begin{bmatrix} u & v \end{bmatrix} \begin{bmatrix} \sum w(x,y) I_x^2 (x,y) & \sum w(x,y) I_x (x,y) I_y (x,y) \\ \sum w(x,y) I_x (x,y) I_y (x,y) & \sum w(x,y) I_y^2 (x,y) \end{bmatrix} $$

which can be written in matrix form as: $$E(u,v) \approx \begin{bmatrix} u & v \end{bmatrix} M \begin{bmatrix} u \\ v \end{bmatrix} $$

where the structure tensor $M$ is $$M = \sum_{x,y} w(x,y) \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix} $$

We define $R$: $$R = det(M) - \alpha Trace(M)^2 = \lambda_1 \lambda_2 - \alpha(\lambda_1+\lambda_2)$$ We can use this $R$ to get a picture of what the image patch consists of.

Corner: $R$ large, $R>0$
Edge: $R<0$
Flat region: R is small

The Harris Corner Detection Algorithm:

Convert image to grayscale
Compute the partial derivatives $I_x (x,y)$ and $I_y (x,y)$
Setup the structure tensor $M$
Calculate the Harris response
Non-maximum suppression

Scale-Invariant Feature Transform (SIFT)

Scale-space Extrema Detection
- Find the Difference of Gaussians (DoG) by comparing two images with different blurring levels
- Search for local extrema over scale and space. These are our potential keypoints
Keypoint Localization
- Remove edges (corners are far more valuable)
- Disregard extremas under a certain threshold
Orientation Assignment
- The gradient magnitude and direction is calculated in an area around the keypoint and organized in a histogram
- An orientation is given to each keypoint according to the peak in the histogram. This ensures rotation variance
Keypoint Descriptor
- A $16x16$ neighbourhood around the keypoint is divided into 16 sub-blocks of $4x4$ size, and for each of these sub-blocks a 8 bin orientation histogram is created
- These histograms are represented as a vector: the keypoint descriptor

Random Sample Consensus (RANSAC)

Fitting a line or model to a dataset is easy if we know which points belong to the model and which do not. RANSAC is built on the idea that if we had a proposed line (model) we can guess which points belong to that line (model).

Algorithm:

Sample (randomly) the number of points required to fit the model (the minimal set $s$)
Instantiate the model for these sample points
Score the model by counting inliers (points within a threshold from the model)
Repeat until a model with high confidence is found

Region descriptors

Histogram of Oriented Gradients (HOG)

The HOG feature descriptor uses the distribution of the directions of the gradients as features. It utilizes the fact that the magnitude of the gradients for edges and corners are large.

Calculation:

1.Preprocessing

patches at multiple scales are analyzed

2.Calculate the Gradient Images

calculate horizontal and vertical gradients, then find magnitude and direction $$ g = \sqrt{g_x^2+g_y^2} $$ $$ \theta = arctan(\frac{g_y}{g_x}) $$

3.Calculate Histogram of Gradients in $8x8$ cells

divide image into $8x8$ cells and calculate histogram for each
bin in the histogram is selected based on direction and weighted on the magnitude
9 bins for angles: 0, 20, 40, 60, 80, 100, 120, 140 and 160

4.$16x16$ block normalization

to make the histograms independent of lighting variations we normalize them
normalize over a $16x16$ block instead of the $8x8$ that its already divided into
concenate the 4 histograms into a $36x1$ element vector and normalize

5.Calculate the HOG feature vector

all the $36x1$ vectors are concenated into one big feature vector

Viola-Jones

The Viola-Jones object detection framework was primarily used of facial detection. The algorithm consists of 4 steps:

1.Haar Feature selection

All faces share some similarities which can be matched using Haar features. There are 3 types of features: two-rectangle feature, three-rectangle feature and four-rectangle feature. The features are related to a certain spatial location in the subwindow.

2.Creating an Integral Image

An integral image is a summed area table. The integral image at location $x,y$ contains the sum of pixels above and to the left. $$ ii(x,y) = \sum_{x'\leq x, y'\leq y} i(x',y') $$

Using this integral image, any rectangular sum can be computed using four array references.

3.Adaboost training

Selects the best features

4.Cascade classifiers

Features are grouped into stages. If a window fails the first stage it is disregarded. This is a way to remove windows that definitely doesn't include a face.

Deformable Parts Model (DPM)

Models an object as a set of parts
- Example: models a face as a set of eyes, a mouth and a nose
Spring-model: "springs" between the parts, so deviations from expected geometric arrangements are penalized
Want to maximize the local responses and minimize the global shape deformation
Good model when it's variations in object, e.g. body poses

Supervised classification

Supervised classification is classification on data where the training data is labeled with the true classification.

k-Nearest Neighbour

Choose the class that belongs to the $k$ nearest neighbours of the point you are classifying.

Algorithm:

Compute distance $D(x,x_i)$ from point $x$ to every training example $x_i$
Select the $k$ closest instances and their labels
Output the classification which is most frequent among these points

The most common distance measure is Eucledian distance $$ D(x,x_i) = \sqrt{\sum_i (x-x_i)^2} $$

Pros and cons:

no assumptions about data
sensitive to outliers
computationally expensive

Support Vector Machine (SVM)

The idea of the SVM is that given a set of feature vectors we find a separating hyperplane that separates the classifications. We can use this hyperplane (or line for 2D) to classify new samples.

For non-linear classification we use the kernel trick, which is to map the input into high-dimension feature spaces. This way the set becomes linearly separable.

Pros and cons:

good with high dimensional data
works well on small data sets
hard to pick right kernel and parameters

Unsupervised classification

k-Means

Clustering: grouping of datapoints into groups/clusters

Algorithm:

select the number of classes ($k$) and randomly select $k$ center points
classify data points based on distance to the center points
recompute the group center point by taking the mean of all the points in each group
repeat until the group center is found (when it doesn't change from one iteration to the next)

Pros and cons:

fast
have to select how many classes there are
starts with random cluster centers, may get different results based on these initial points

Mean-Shift

The Mean-Shift method is sliding-window based. Aims to find dense area of points by using a circular sliding window which moves at each iteration.

Mean-Shift uses a Hill Climbing-algorithm. At every iteration the sliding window is shifted towards the region of higher density by shifting the center point to the mean of the points within the window. This is done with several sliding windows which starts at different places in the data set, so it can locate several different clusters.

Pros and cons:

no need to select number of classes
the selection of the window-size can be non-trivial

Principal Components Analysis (PCA)

PCA is a method to understand global properties of a dataset consisting of vectors. It analyzes the covariance matrix to understand what dimensions are more important.

Principal components: the eigenvectors with the highest eigenvalues (the "most important")

We can use PCA to represent images more concisely and to help with recognition and matching.

"Common" Computer Vision

Pinhole Camera Model

The Pinhole camera model describes the mathematical relationship between the coordinates of a point in 3D and its projection onto the image plan of an ideal pinhole camera. In a pinhole camera each ray from the scene passes undeflected to the image plane, and so does not take lenses and aperture into account.

Coordinate frames

Transformations

Transformations can be mapping a point from A to its representation in frame B, or describing a frame A relative to frame B.

2D-2D

Described by a translation vector $\textbf{t}=(\Delta x,\Delta y)^T$ and a rotation angle $\theta$. To map a point from one frame to another you need the rotation matrix

$$ \textbf{R} = \begin{bmatrix} \cos{\theta} & -\sin{\theta} \\ \sin{\theta} & \cos{\theta} \end{bmatrix} $$

The projection transform is then given by $$ \begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} \textbf{R} & & \textbf{t} \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} $$

$$\tilde{\textbf{x}}' = \begin{bmatrix} \textbf{R} & & \textbf{t} \\ 0 & 0 & 1 \end{bmatrix} \tilde{\textbf{x}} $$

3D-3D

The rotation matrix for a 3D-3D-transformation is $3x3$.

For a rotation around the Z-axis: $$ \begin{bmatrix} \cos{\theta_z} & -\sin{\theta_z} & 0 \\ \sin{\theta_z} & \cos{\theta_z} & 0 \\ 0 & 0 & 1 \end{bmatrix} $$

For a rotation around the X-axis: $$ \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos{\theta_x} & -\sin{\theta_x} \\ 0 & \sin{\theta_x} & \cos{\theta_x} \end{bmatrix} $$

For a rotation around the Y-axis: $$ \begin{bmatrix} \cos{\theta_y} & 0 & \sin{\theta_y} \\ 0 & 1 & 0 \\ -\sin{\theta_y} & 0 & \cos{\theta_y} \end{bmatrix} $$

These rotation matrices can be multiplied to form rotations around multiple axes.

3D-2D

To transform points in the real world (3D) onto the image plane (2D) we use the intrinsic camera parameter matrix K and the extrinsic camera matrix M.

Stereo

Stereo is the method of combining several 2D images to get a 3D understanding of the scene. The relations are based on the assumption that the camera can be approximated by the pinhole camera model.

The stereo principle is that each image point corresponds to a ray, and to find the absolute point position you can intersect the ray found from images from two different angles.

For every pixel $x$ in the first image:

find corresponding epipolar scanline in other image
examine pixels on scanline and find the best match $x'$
compute the disparity $d=x-x'$ and set $depth = b*f/d$

The goal is to compute a disparity map. The biggest problem with this is the correspondance problem: how to match features in the two images. A way to solve this is to limit the search for matches by using known geometry of the camera positions or by an epipolar constraint; to search for matches only along the epipolar line.

The Essential Matrix

The Essential matrix is a matrix $E$ that relates the image of a point in one camera to its image in the other, given a translation and rotation.

The Fundamental Matrix

The Fundamental matrix $F$ relates corresponding points in stereo images. $F \cdot x$ describes an epipolar line on which the corresponding point $x'$ on the other image must lie. For all corresponding points holds ${x'}^T Fx=0$. And with respect to the essential matrix and the intristic calibration matrix: $E = K^T F K$

Motion estimation and optical flow

Motion analysis has many application, e.g. segmentation of objects, estimating 3D structure, learning how things move (dynamic models) or improving video quality (motion stabilization).

Optical flow estimates the apparent motion of objects or surfaces from image sequences.

Optic flow problem: estimate pixel motion by searching for nearby pixels of the same color. -> Given a pixel in $I(x,y,t)$, search for nearby pixels of same color in $I(x,y,t+1)$.

This assumes that there is small motion in the sequence and that the color is kept constant. This can be problematic and there exists several methods to solve these problems.

Aperture problem

The motion of a contour is hard to decide without the full object.

We don't know if this is moving to the right, down or a place in the middle.

Horn-Schunck method

Introduces a global constraints of smoothness in the flow to solve the aperture problem.

Lucas-Kanade method (LK)

Assumes that the flow is constant (same $(u,v)$ for all pixels) in a local neighbourhood of the pixel and solves the basic optical flow equations for all pixels in the neighbourhood by the least square criterion.

Iterative LK method

Estimates the velocity at each pixel by solving the LK-equations. Then warps the image towards the next in the sequence using estimated flow field. It repeats this until convergence.

Hierarchical LK method

Solves the problem with assuming there is small motion by reducing the resolution before estimating motion. It computes iterative LK at one level of resolution, warps and then upsamples before it continues with iterative LK at the next level.

Sparse LK method

Like the Hierarchical LK, but is only applied to good feature locations.

Tracking

Kalman filter

The Kalman filter is an iterative mathematical process that uses a set of equations and data inputs to estimate the coming values.

For tracking, it uses the current sensor information to give an educated guess as to where you are going to end up next. Then it updates its tracking information with the new sensor information, and uses this to again predict the next step. In the calculation it takes into account both process and measurement noise.

Particle filters

Generate hypotheses for where we are. In the beginning this is just a random distribution of points. These hypotheses are referred to as "particles".
Evaluate the probability for a particle given all sensor data and prior data (e.g. a map). Do this for all particles.
Resample by removing unlikely particles and generating new particles at the likely positions.
When the object moves, move the particles according to a speed model for the object.
Evaluate all particles again based on the new sensor data, and resample again.
Continue updating the weights of the particles, resampling and propagate particles in time.

Written by

mariepauline kristiap hward

Last updated: Fri, 25 May 2018 10:00:13 +0200 .