TDT4173: Machine Learning and Case-Based Reasoning
# Introduction
## Well Posed Learning Problems
> __Definition__: A computer program is said to __learn__ from experience _E_ with respect to some class of tasks _T_ and performance measure _P_ if its performance at tasks in _T_, as measured by _P_, improves with experience _E_.
An example of a task, inspired by [MarI/O](https://www.youtube.com/watch?v=qv6UVOQ0F44):
- Task _T_: Playing the first level of Super Mario Bros.
- Performance measure _P_: How far the Mario character is able to advance.
- Training experience _E_: Playing the level over and over.
## Designing a learning system
- __Direct and indirect training examples__: Whether the system will receive examples of _one_ choice and whether it’s bad or good, or only a sequence of choices and their final outcome.
- __Credit assignment__: Determining the degree to which each choice in a sequence contributes to the final outcome (in indirect examples).
### Adjusting weights (LMS weight update rule)
The ideal target function can be called $V$. This is the magical best function which we are trying to approximate with our machine learning. The approximation that we achieve is called $\hat{V}$. One training example can be called $b$, and each of its attributes is called $x_i$. The output for a given example is thus $\hat{V}(b)$.
Let’s imagine that the training example $b$ has $n$ attributes. In a program which represents $\hat{V}(b)$ as a linear function $\hat{V}(b) = w_0 + \sum_{i=1}^n w_i \times x_i$, it is possible to approximate $V$ with the following method, called the Least Mean Squares (LMS) training rule:
For each training example $\langle b, V_{train}(b)\rangle$:
- Use the current weights to calculate $\hat{V}(b)$
- For each weight $w_i$, update it as $w_i \leftarrow w_i + \eta (V_{train}(b) - \hat{V}(b)) x_i$
$\eta$ is a small constant (e.g. 0.1) that dampens the size of the weight update.
## Learning
"Any process by which a system improves performance" -H. Simon
Basic definition: "Methods and techniques that enable computers to improve their performance through their own experience"
## Why machine learning?
* Today, lots of data is collected everywhere (Big Data)
* There are huge data centers
# Concept learning and the General-to-Specific Ordering
Concept learning is about inferring a boolean-valued function from training examples of input and output.
We may for instance be trying to learn whether a certain person will be enjoying sports on a particular day given different attributes of the weather. Let us call this the _EnjoySports_ target concept.
## A concept learning task
Let's take a look at the _EnjoySports_ target concept. When learning the target concept, the learner presented a set of training examples. Each instance is represented by the attributes Sky, AirTemp, Humidity, Wind, Water and Forecast.
|| Example || Sky || AirTemp || Humidity || Wind || Water || Forecast || EnjoySport ||
|| 1 || Sunny || Warm || Normal || Strong || Warm || Same || Yes ||
|| 2 || Sunny || Warm || High || Strong || Warm || Same || Yes ||
|| 3 || Rainy || Cold || High || Strong || Warm || Change || No ||
|| 4 || Sunny || Warm || High || Strong || Cool || Change || Yes ||
### The inductive learning hypothesis
Our learning task is to determine a hypothesis $h$ which is identical to the target concept $c$ over the entire set of instances, $X$. However, our only information about the target concept $c$ is our knowledge of it through the training examples themselves. (If we had already known the true $c$, there wouldn’t be much of a point to learning it.)
In order to determine what hypothesis will work best on unseen examples, we have to make an assumption (i.e. we can’t really prove it): What works best for the training examples will work best for the unseen examples. This is the _fundamental assumption of inductive learning_.
> __The inductive learning hypothesis__: Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.
## Hypothesis space
## Candidate elimination algorithm
This is an important part of the curriculum. Learn it.
## Inductive bias
TL;DR: Assumptions that you need to make in order to generalize
An inductive bias is the set of assumptions needed to take some new input data, which the system has _not_ encountered before, and predict the output. E.g: a self-driving car may have been trained to avoid cats in the road, but suddenly it encounters a dog. What should it do?
If the system does not have an inductive bias, then it has not learned anything other than reacting to the specific examples that it has been trained on. In that case, it is basically just a [key-value store](https://en.wikipedia.org/wiki/Key-value_database) for its training examples. An inductive bias is essential to machine learning.
There are two types of bias: __Representational bias__ and __Search bias__. For example, in decision tree learning, there is no representational bias (because anything can be represented), but there is search bias. The search is greedy and ID3 keeps the most important information high up in the tree tends to produce the shortest possible decision tree. The candidate elimination algorithm has representational bias because it cannot represent anything. It assumes that the solution to the problem can be expressed as a conjunction of concepts.
# Decision Tree Learning
> Decision tree learning is a method for approximating discrete-valued target functions in which the learned function is represented by a decision tree.

_This flowchart could be reshaped to a decision tree. (CC-BY-NC 2.5 Randall Munroe)._
## Overfitting
TL;DR: A hypothesis overfits training data if there is an alternative hypothesis that generalizes better.
Decision trees can run into the problem of overfitting. This usually occurs when there is noise in the training data (which is often the case).
Techniques for reducing overfitting:
* Stop growing then data split is not statistically significant
* Post prune decision tree. Minimum Description Length principle for pruning decision trees.
* Select the best tree by measuring performance over validation data set
## Cross validation
# Evaluating Hypotheses
# Bayesian Learning
# Computational Learning Theory
# Instance-Based Learning
# Learning Sets of Rules
# Analytical Learning
# Case–Based Reasoning
# Deep Learning
# Ensemble Methods
# Support Vector Machines