TDT4265: Computer Vision and Deep Learning
Modern Deep-learning based Computer Vision
Shallow fully-connected Feed-Forward Artificial Neural Networks
Artificial Neural Networks
ANNs are inspired by biological neural networks and is built up by layers of neurons. This video explains neural networks in a very good way.
Wordlist:
- Shallow NN: a neural net with few hidden layers. The opposite is deep neural networks with many layers
- Fully-connected: all neurons are connected to all neurons in both the previous and the next layer
- Feed-Forward: only forward links between the layers
Notation:
$w_{jk}^l$ : weight to neuron j in layer l from neuron k in layer l-1$b_j^l$ : bias of neuron j in layer l$z_j^l$ : weighted input for neuron j in layer l$a_j^l$ : activation of neuron j in layer l
Activation functions
- Sigmoid AF
$$ \sigma (z) = \frac{1}{1+e^{-z}} $$ "Squashes" the input between 0 and 1.- Rectified Linear Unit (ReLU)
$$f(x)=max(0,x)$$ Does'nt update the weights if$x<0$ (dead ReLU-problem)- Leaky ReLU
$$f(x) = max(\alpha x, x)$$ Fixes the problem with the dead ReLU- Softmax AF
$$a_j^L = \frac{e^{z_j^L}}{\sum_k e^{z_k^L}}$$ Scales the output as probabilities (all are between 0 and 1 and sums up to 1). Often used in the ouput layer.
Learning
When training a neural network we actually just adjust the parameters
Gradient descent
Watch this video to get an understanding of the Gradient descent method.
Summarized we want to find a minima of the cost function by gradually moving towards it by moving a small step in the direction that brings us most downhill. The gradient descent method uses backpropagation to estimate the gradient of the cost function, so understanding backpropagation is essential to understand how neural networks learn.
The cost function that is often used in describing the Gradient descent-method is the quadratic cost function
Stochastic Gradient Descent (SGD)
Stochastic gradient descent is based on an idea to estimate the gradient of
Backpropagation
A very good explanation for the backpropagation-algorithm is found here and the calculus is well explained here.
Backpropagation is the backbone of learning in neural network. It gives us an expression for the partial derivate of the cost function,
Algorithm
-
Input x: set the corresponding activation
$a^1$ for the input layer. -
Feedforward: Compute the weighted input
$z^l = w^l a^{l-1} + b^l$ and the activation$a^l = \sigma(z^l)$ for every layer$l=1,2,...,L$ . -
Output error: Compute the error
$\delta^L$ for the output error,$$\delta^L = \nabla_a C \odot \sigma'(z^L)$$ -
Backpropagate the error: Start at layer L-1 and compute the error for every layer
$$\delta^l = ((w^{l+1})^T \delta^{l+1})\odot \sigma'(z^l)$$ -
Output: The gradient of the cost function is now given by
$$\frac{\partial C}{\partial w_{jk}^l}=a_k^{l-1} \delta_j^l$$ and$$\frac{\partial C}{\partial b_{j}^l} = \delta_j^l$$ .
Improving learning
The cross-entropy cost function
This cost function is positive and tends toward zero as the neuron gets better at computing the desired output. When
Vanishing gradients
The vanishing gradients is what causes learning slowdown. It´s caused by the term
Softmax
Another way to address the problem of learning slowdown is looking at softmax layers of neurons. The idea is to define a new type of output layer which uses the softmax function to compute the output instead of the sigmoid function.
The softmax function:
The sum of all activations in the ouput layers with this activation function is 1, so the output is kind of like a probability distribution.
Overfitting and regularization
With a large number of free parameters you can describe a wide range of phenomena, but this does'nt necessarily make it a good model, it just means it has a lot of freedom and can be used to describe a lot of datasets. With a lot of free parameters a model can be very good on existing data, but fail to generalize on new data. When training a network the accuracy will converge towards a value and stop increasing. Sometimes the cost function will continue to decrease even though the accuracy is constant, but the network is not getting better at generalizing. This is overfitting.
Signs of overfitting:
- the cost on the test data gets worse after a while
- accuracy does'nt improve
- accuracy on the training set is
$100\%$ while it's less on the test set. the network starts "memorizing" the training data instead of generalizing.
How can we prevent and reduce overfitting?
- stop training when accuracy stops improving
- using the validation set for testing and stopping when accuracy for this set has saturated
- increase the size of the training set
- different regularization techniques
The validation dataset is often used to find the right hyper-parameters instead of the test set. This is to prevent the network from overfitting to the test set, as you add another set to fine-tune the parameters on. This separation of the dataset into training, validation and test is called the "hold out-method".
Regularization
L2 regularization / Weight decay
Adds a regularization term to the cost function. The effect of this term is that the network will prefer to learn small weights. When tuning the regularization parameter,
L2 regularization term:
The regularization causes the network to generalize better and get a higher accuracy because it can train more without overfitting.
Why does this kind of regularization prevent overfitting?
L2 regularization keeps the weights small and the overall network "simpler", aka less complex. Less complex models is less sensitive to noise as the behaviour changes less if we change a few inputs than if it was a complex model. Think of a dataset of points and the difference between a linear model and a model of 9th order that fits the datapoints. If both models fit the points pretty well and the linear model only contain a little noise we still prefer this model because it's much less complex and will, most likely, fit better to new datapoints. We say it's better to generalize, and this also is true for our neural network. The less complex the better, and small weights will contribute towards this.
L1 regularization
Like L2 it adds a regularization term to the cost function.
L1 regularization term:
Like L2 it shrinks the weights. L1 concentrates the weight of the network in a small number of high-importance connections while the other weights are driven towards zero.
Dropout
Dropout is a regularization technique that modifies the network itself instead of of just the cost function. For every iteration of feed-forwarding the input and backpropagating the result, we leave out half of the neurons and only run the algorithm on the other half. For the next iteration we use a different set of "active" neurons.
The effect of this is like we would train several different networks and then averaging the resulting weights and biases. Because we train a different set of active neurons each iteration these will be affected by overfitting in different ways, and averaging the result will hopefully remove these effects. The result is that the network is robust to the loss of any individual piece of evidence.
Artificially expanding the training data
A large training set is a good way to improve the ckassification accuracy, but it's often expensive to produce more training data. A technique to get a bigger set without getting new data is to manipulate the data that we already have by rotations, translations, adding noise, speeding up/down ++.
Weight initializing
Initializing the weights as a Gaussian random variable between 0 and 1 can lead to learning slowdown because of saturation. To avoid this we can instead initialize the weights as Gaussian variables between 0 and the standard deviation
Choosing the hyper-parameters
There is a very large number of possible combinations of hyper-parameters and no easy solution or ways to tell what is best. The best way is to experiment with your dataset and see what gives the best results. A good basic strategy is to start by reducing the complexity of the problem at hand and tuning the parameters with a simpler task, then reducing the complexity of the network, and all over have a high frequency of monitoring so you discover changes in outcome faster.
Convolutional Neural Networks and Image classification
CNNs are deep neural networks, which means that they have many hidden layers. They are built up by convolutional layers, ReLU-layers, pooling-layers and some fully-connected layers.
Convolution
The convolutional layers have several filters (or kernels) that's slid over the input image to check how well that filter matches with a part of the input image. These filters together form a feature map.
Each neuron in the feature map-layer is connected to a region in the previous layer. You can think of it as lighting a flashlight from every neuron onto a certain part of the input image. These flashlights or regions are called local receptive fields. The weights that connect the neuron and the local receptive field are the same for all neurons. This is the same as thinking that the flashlights are held still, but they are looking for the same thing (they are the same filter). These filters we refer to is the same as the weights that connect the neurons in the input image and feature map.
The activations in the feature map are calculated with convolution. That is, you take the inner product of the activations in the input layer and the filter.
We can have several of these filters, or kernels, in a layer. If we have 3 filters we get 3 feature maps. Connecting this to our flashlights we can say that we have 3 flashlights in a row pointing at the same part of the input image, but looking for different things.
Convolution is faster and less computationally expensive that fully-connected layers because of the filter-size and the fact that you don't have connections between every neuron. For example if you have a 20x20 input image and 10 neurons in the first layer. Then you need
Pooling
Pooling is a method for downsizing the filtered images. A pooling layer is added after a convolution layer (and a ReLU) to reduce the number of feature-map coefficients. The way it does this is by sliding a "window" over the feature map, for example of size 2x2, and putting the max value inside this window into a new pooled feature map. This makes the feature map smaller while still keeping the most important information.
Another benefit of pooling is that it makes the feature map less sensitive to location since it doesn't care where in the window the max value is.
Structure
In a convolutional network you start with a convolutional layer which creates a feature map from the input image. Then these neurons are taken through the ReLU-function, followed by a max pooling layer.
Padding and stride
- Padding: for convolution-windows to fit around every input tile we add an appropriate number of rows and columns around our input image to fit center convolution windows around every input tile.
- Stride: the distance between two windows (convolutional or pooling). if the stride is
$>1$ the feature map will be downsized.
Classification
After several Convolution+ReLU+Pooling layers we put some fully-connected layers at the end of the network to support classification. These are traditional FC-layers of neurons, and the last layer uses the softmax activation function so we get the output as probabilities for each class.
Object detection and Segmentation
Object detection is the task of finding which objects are in the image and where they are.
- Classification: what object is in the image (one)
- Localization: where is the object
- Detection: find and localize all objects in image
Object detection as regression
Regression: instead of giving a class you output a number (function output).
Instead of training on the error from the outputted class, you give the system a ground truth bounding box and use the L2 distance to calculate the loss the prediction and the ground truth.
Object detection as Classification
Sliding window
Run a classifier by sliding a "window" over the image and detect objects based on the output from these classified windows.
The problem with this approach is that you need to test many positions and scales and run classification on a large number of regions.
Region proposals
A way to reduce the number of regions is using an algorithm to propose regions of interest instead of classificating all windows.
How to measure performance?
Intersection over Union (IoU)
Measures how much of the boxes that are overlapping (predicted box and ground truth).
mean Average Precision
Average of how many true positives the system predicts of all of the classes.
Region of Interest Pooling
A pooling technique to extract fixed size features from arbitrary sized region proposals.
- Obtain region of interest from feature map
- Divide the RoI into pooling sections corresponding with output size
- Apply max pooling on these sections
Region-based CNNs
Main method:
- Generate a set of region proposals
- Run these regions through a CNN
- Compute output
We are introduced to 3 types of R-CNNs which does each of these three steps differently: R-CNN, Fast R-CNN and Faster R-CNN.
R-CNN
- Scan image for possible objects with Selective Search. This generates
$\approx 2000$ region proposals. - Run a CNN on each of the region proposals.
- Take the output of each of the CNNs and feed it into a SVM for classification and a linear regressor to tighten the bounding boxes.
Fast R-CNN
- Feature extraction is done before proposing regions, so it only runs one CNN over the full image instead of 2000
- Replaced the SVM for classification with a softmax layer.
- faster because now we only have to train one CNN instead
Faster R-CNN
- Replaced the Selective Search algorithm with a neural net: the region proposal network (RPN)
- At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimention
- For each sliding window location it generates multiple possible regions based on
$k$ fixed-ratio anchor boxes - Each region proposal gets an "objectness" score
- The RPN only proposes regions, it does not classify them
- Faster R-CNN = RPN + Fast R-CNN
You Only Look Once (YOLO)
YOLO divides the input image into a grid of
YOLO is much faster than Faster R-CNN because even though it has to do
Single Shot MultiBox Detector (SSD)
Instead of proposing regions and classifying them in 2 steps like the R-CNNs does, SSD does it in a "single shot".
Method:
- Sends image thorugh several convolutional layers, giving several sets of feature maps at different scales
- For each location in feature map, use a 3x3 convolutional filter to evaluate a set of fixed-size bounding boxes
- For each box, predict the bbox offset and class probability
- During training, match ground truth with predicted boxes based on IoU
Traditional/Classical Computer Vision
Point descriptors
Consider the problem of building a panorama-image from two images. Two problems arises:
- how to find the same points independently in both images
- for each point, how do we match this with the corresponding point in the other image?
To solve these problems we need to find good features. A feature is a piece of information which is relevant for solving the computational task related to a certain application (source), and in computer vision it can for example be points, edges or objects.
Characteristics of good features:
- Repeatability/Precision
- Saliency/Matchability (distinctive descriptor)
- Compactness and efficiency (fewer features than pixels)
- Locality (feature is relatively small)
Harris corners
The Harris corner algorithm is a way to detect corners in images.
We consider a gray-scale image
This error function
which can be written in matrix form as:
where the structure tensor
We define
- Corner:
$R$ large,$R>0$ - Edge:
$R<0$ - Flat region: R is small
The Harris Corner Detection Algorithm:
- Convert image to grayscale
- Compute the partial derivatives
$I_x (x,y)$ and$I_y (x,y)$ - Setup the structure tensor
$M$ - Calculate the Harris response
- Non-maximum suppression
Scale-Invariant Feature Transform (SIFT)
-
Scale-space Extrema Detection
- Find the Difference of Gaussians (DoG) by comparing two images with different blurring levels
- Search for local extrema over scale and space. These are our potential keypoints
-
Keypoint Localization
- Remove edges (corners are far more valuable)
- Disregard extremas under a certain threshold
-
Orientation Assignment
- The gradient magnitude and direction is calculated in an area around the keypoint and organized in a histogram
- An orientation is given to each keypoint according to the peak in the histogram. This ensures rotation variance
-
Keypoint Descriptor
- A
$16x16$ neighbourhood around the keypoint is divided into 16 sub-blocks of$4x4$ size, and for each of these sub-blocks a 8 bin orientation histogram is created - These histograms are represented as a vector: the keypoint descriptor
- A
Random Sample Consensus (RANSAC)
Fitting a line or model to a dataset is easy if we know which points belong to the model and which do not. RANSAC is built on the idea that if we had a proposed line (model) we can guess which points belong to that line (model).
Algorithm:
- Sample (randomly) the number of points required to fit the model (the minimal set
$s$ ) - Instantiate the model for these sample points
- Score the model by counting inliers (points within a threshold from the model)
- Repeat until a model with high confidence is found
Region descriptors
Histogram of Oriented Gradients (HOG)
The HOG feature descriptor uses the distribution of the directions of the gradients as features. It utilizes the fact that the magnitude of the gradients for edges and corners are large.
Calculation:
1.Preprocessing
- patches at multiple scales are analyzed
2.Calculate the Gradient Images
- calculate horizontal and vertical gradients, then find magnitude and direction
$$ g = \sqrt{g_x^2+g_y^2} $$ $$ \theta = arctan(\frac{g_y}{g_x}) $$
3.Calculate Histogram of Gradients in
- divide image into
$8x8$ cells and calculate histogram for each - bin in the histogram is selected based on direction and weighted on the magnitude
- 9 bins for angles: 0, 20, 40, 60, 80, 100, 120, 140 and 160
4.
- to make the histograms independent of lighting variations we normalize them
- normalize over a
$16x16$ block instead of the$8x8$ that its already divided into - concenate the 4 histograms into a
$36x1$ element vector and normalize
5.Calculate the HOG feature vector
- all the
$36x1$ vectors are concenated into one big feature vector
Viola-Jones
The Viola-Jones object detection framework was primarily used of facial detection. The algorithm consists of 4 steps:
1.Haar Feature selection
All faces share some similarities which can be matched using Haar features. There are 3 types of features: two-rectangle feature, three-rectangle feature and four-rectangle feature. The features are related to a certain spatial location in the subwindow.
2.Creating an Integral Image
An integral image is a summed area table. The integral image at location
Using this integral image, any rectangular sum can be computed using four array references.
3.Adaboost training
Selects the best features
4.Cascade classifiers
Features are grouped into stages. If a window fails the first stage it is disregarded. This is a way to remove windows that definitely doesn't include a face.
Deformable Parts Model (DPM)
- Models an object as a set of parts
- Example: models a face as a set of eyes, a mouth and a nose
- Spring-model: "springs" between the parts, so deviations from expected geometric arrangements are penalized
- Want to maximize the local responses and minimize the global shape deformation
- Good model when it's variations in object, e.g. body poses
Supervised classification
Supervised classification is classification on data where the training data is labeled with the true classification.
k-Nearest Neighbour
Choose the class that belongs to the
Algorithm:
- Compute distance
$D(x,x_i)$ from point$x$ to every training example$x_i$ - Select the
$k$ closest instances and their labels - Output the classification which is most frequent among these points
The most common distance measure is Eucledian distance
Pros and cons:
- no assumptions about data
- sensitive to outliers
- computationally expensive
Support Vector Machine (SVM)
The idea of the SVM is that given a set of feature vectors we find a separating hyperplane that separates the classifications. We can use this hyperplane (or line for 2D) to classify new samples.
For non-linear classification we use the kernel trick, which is to map the input into high-dimension feature spaces. This way the set becomes linearly separable.
Pros and cons:
- good with high dimensional data
- works well on small data sets
- hard to pick right kernel and parameters
Unsupervised classification
k-Means
Clustering: grouping of datapoints into groups/clusters
Algorithm:
- select the number of classes (
$k$ ) and randomly select$k$ center points - classify data points based on distance to the center points
- recompute the group center point by taking the mean of all the points in each group
- repeat until the group center is found (when it doesn't change from one iteration to the next)
Pros and cons:
- fast
- have to select how many classes there are
- starts with random cluster centers, may get different results based on these initial points
Mean-Shift
The Mean-Shift method is sliding-window based. Aims to find dense area of points by using a circular sliding window which moves at each iteration.
Mean-Shift uses a Hill Climbing-algorithm. At every iteration the sliding window is shifted towards the region of higher density by shifting the center point to the mean of the points within the window. This is done with several sliding windows which starts at different places in the data set, so it can locate several different clusters.
Pros and cons:
- no need to select number of classes
- the selection of the window-size can be non-trivial
Principal Components Analysis (PCA)
PCA is a method to understand global properties of a dataset consisting of vectors. It analyzes the covariance matrix to understand what dimensions are more important.
Principal components: the eigenvectors with the highest eigenvalues (the "most important")
We can use PCA to represent images more concisely and to help with recognition and matching.
"Common" Computer Vision
Pinhole Camera Model
The Pinhole camera model describes the mathematical relationship between the coordinates of a point in 3D and its projection onto the image plan of an ideal pinhole camera. In a pinhole camera each ray from the scene passes undeflected to the image plane, and so does not take lenses and aperture into account.
Coordinate frames
Transformations
Transformations can be mapping a point from A to its representation in frame B, or describing a frame A relative to frame B.
2D-2D
Described by a translation vector
The projection transform is then given by
3D-3D
The rotation matrix for a 3D-3D-transformation is
For a rotation around the Z-axis:
For a rotation around the X-axis:
For a rotation around the Y-axis:
These rotation matrices can be multiplied to form rotations around multiple axes.
3D-2D
To transform points in the real world (3D) onto the image plane (2D) we use the intrinsic camera parameter matrix K and the extrinsic camera matrix M.
Stereo
Stereo is the method of combining several 2D images to get a 3D understanding of the scene. The relations are based on the assumption that the camera can be approximated by the pinhole camera model.
The stereo principle is that each image point corresponds to a ray, and to find the absolute point position you can intersect the ray found from images from two different angles.
For every pixel
- find corresponding epipolar scanline in other image
- examine pixels on scanline and find the best match
$x'$ - compute the disparity
$d=x-x'$ and set$depth = b*f/d$
The goal is to compute a disparity map. The biggest problem with this is the correspondance problem: how to match features in the two images. A way to solve this is to limit the search for matches by using known geometry of the camera positions or by an epipolar constraint; to search for matches only along the epipolar line.
The Essential Matrix
The Essential matrix is a matrix
The Fundamental Matrix
The Fundamental matrix
Motion estimation and optical flow
Motion analysis has many application, e.g. segmentation of objects, estimating 3D structure, learning how things move (dynamic models) or improving video quality (motion stabilization).
Optical flow estimates the apparent motion of objects or surfaces from image sequences.
Optic flow problem: estimate pixel motion by searching for nearby pixels of the same color.
-> Given a pixel in
This assumes that there is small motion in the sequence and that the color is kept constant. This can be problematic and there exists several methods to solve these problems.
Aperture problem
The motion of a contour is hard to decide without the full object.
We don't know if this is moving to the right, down or a place in the middle.
Horn-Schunck method
Introduces a global constraints of smoothness in the flow to solve the aperture problem.
Lucas-Kanade method (LK)
Assumes that the flow is constant (same
Iterative LK method
Estimates the velocity at each pixel by solving the LK-equations. Then warps the image towards the next in the sequence using estimated flow field. It repeats this until convergence.
Hierarchical LK method
Solves the problem with assuming there is small motion by reducing the resolution before estimating motion. It computes iterative LK at one level of resolution, warps and then upsamples before it continues with iterative LK at the next level.
Sparse LK method
Like the Hierarchical LK, but is only applied to good feature locations.
Tracking
Kalman filter
The Kalman filter is an iterative mathematical process that uses a set of equations and data inputs to estimate the coming values.
For tracking, it uses the current sensor information to give an educated guess as to where you are going to end up next. Then it updates its tracking information with the new sensor information, and uses this to again predict the next step. In the calculation it takes into account both process and measurement noise.
Particle filters
- Generate hypotheses for where we are. In the beginning this is just a random distribution of points. These hypotheses are referred to as "particles".
- Evaluate the probability for a particle given all sensor data and prior data (e.g. a map). Do this for all particles.
- Resample by removing unlikely particles and generating new particles at the likely positions.
- When the object moves, move the particles according to a speed model for the object.
- Evaluate all particles again based on the new sensor data, and resample again.
- Continue updating the weights of the particles, resampling and propagate particles in time.