TDT4300: Data Warehousing and Data Mining
Dette er et sammendrag av læreboka "Data Mining" av TAN.
#Data Mining, TAN
##chapter 1 INTRODUCTION
###chapter 1.1 What is data mining?
__Data mining__
is the process of automatically discovering useful information in large repositories.
__Knowledge discovery in databases (KDD)__
is the overall process of converting raw data into useful information. Data mining is an integral part of KDD.
__Preprosessing__
feature selection, dimensionality reduction, normalization data subsetting. Make the data into an appropriate format, reduce noise etc.
__Postprocessing__
filtering patterns, visualization, pattern interpretation. Only useful and valid results are incorporated into an decision support system (DSS).
###chapter 1.2 Motivating challenges
__Scalability __
Algorithms must support growing data sets.
__High dimensionality __
the algorithms must support the complexity that increases when the high dimensionality (number of features) increases.
__Heterogeneous and complex data __
need techniques for heterogeneous attributes and complex data models.
__Data ownership and distribution __
sometimes the data needed is not owned or stored in one location/organization. Requires distributed data mining techniques. challenges: how to reduce communication in distribution, how to effectively consolidate the result, data security issues.
__Non-traditional analysis __
non traditional methods, types of data and data distribution.
###chapter 1.3 The origins of data mining
Data mining draws upon ideas from sampling, estimation hypothesis testing from Statistics. and search algorithms, modeling techniques and learning theories from AI, pattern recognition, machine learning. Other areas: optimization, evolutionary computing, information theory, signal processing, visualization, information retrieval.
others: database systems, (parallell) computing (for massive data sets), distributed techniques.
###chapter 1.4 Data mining tasks
__two categories:__
_Predictive tasks _objective is to predict the value of a particular attribute (target/dependent variable) based on the values of other attributes (explanatory/independent variables).
_Descriptive tasks objective_ is to derive patterns (correlations, trends, clusters, trajectories, anomalies) that summarize underlying relationships in data.
__Predictive modeling __
building a model for target variable as a function of the explanatory variables. Two types of predictive modeling tasks:
classification for discrete target variables.
Regression for continuous target variables.
The model should minimize the error between the predicted and the true values of the target variable.
__Association analysis __
used to discover patterns that describe strongly associated features in the data. Patterns represented as implication rules or feature subsets. Extract the most interesting patterns in an efficient manner.
__Anomaly detection __
the task of identifying observations whose characteristics are significantly different from the rest of the data. The observations are called anomalies/outliers. The goal is to find them and not label normal objects as anomalous.
##chapter 2 DATA
###chapter 2.1 Types of data
Data set collection of data objects (records).
Data objects described by a number of attributes (variables).
__definition Attribute:__
is a property or characteristic of an object that may vary, either from one object to another (eye color) or from one time to another (temperature).
__Definition Measurement scale:__
is a rule (function) that associates a numerical or symbolic value with an attribute of an object.
The values used to represent an attribute may have properties that are not properties of the attribute itself, and vice versa.
__Attribute types __
(see table 2.2 p.26): nominal and ordinal (categorical/qualitative), interval and ratio (quantitative, numeric).
Permissible transformations are transformations that do not change the meaning of the attribute (table 2.3 p.27).
__Number of values__
Describing attributes by the number of values
- Discrete attribute has a finite/countable infinite set of values.
- Special case: binary attributes
- Continuous attributes have real number values (temperate, height, weight). represented in floating-point variables.
number of values can be combined with attribute types, but all do not make sense.
Example: Count attributes. Are both discrete(number of values) and ratio(attribute type).
__Asymmetric attributes __
only presence (non-zero attributes) are regarded as important. As a student who takes a course (1) or not (0) on a university. There are fewer 1´s than 0’s, so it is smarter to focus on the 1’s. Special case: asymmetric binary attributes. important for association analysis. special case: asymmetric discrete, asymmetric continuous attributes.
__Data sets __
general characteristics
- Dimensionality
- the number of attributes that the object in the data set possess. Curse of dimensionality is issues with high dimensionality. Need reduction in preprocessing phase.
- Sparsity
- only process and store the non-zero values.
- Resolution
- capture data at different resolutions to avoid that patterns disappear because of noise etc.
__Transaction (market basket) data __
a special type of record data. each record (transaction) involves a set of items. Like a super market basket.
__The data matrix __
if the data objects in a collection of data all have the same fixed set of numeric attributes. Then the data objects can be thought of as points (vectors in a multidimensional space, where each dimension represents a distinct attribute describing the object. Matrix m,n where m rows for each object and n columns for each attribute. Special case: sparse data matrix, only non-zero attributes are important. Document-term matrix, is a spase data matrix of a document where each word is an attribute and the order of the words are not important.
__Group-based data__
A graph is a powerful and convenient representation of data.
_data with relationships among objects_
Data often represented as a graph. Data objects are mapped to nodes of the graph, and relationships are paths between objects (with direction and weight). Example, The internet with web pages that links to other pages.
_data with objects that are graphs_
If objects have structure (the object contains sub-objects with relationships) they are frequently represented as graphs. Example molecules. Substructure mining is a branch of data mining.
__Ordered data__
Attributes have relationships that involve time or space.
_Sequential data (temporal data)_
extension of record data where each record has a time associated with it. Example a transaction of groceries with timestamp.
_Sequence data_
consists of a data set that is a sequence of individual entities (sequence of words, letters, etc.). Not timestamps but a position in an ordered sequence. Example DNA sequences (ATGC).
_Time series data_
Special case of sequential data, but each record is a time series (series of measurements over time). Example average monthly tempratures of Minneapolis.
temporal autocorrelation is important to consider: if two measurements are close in time, then their values are often similar.
_Spatial data_
Positions, areas, etc.
Example: weather data (several attributes for each geographical location). spatial autocorrelation (close points have same values).
__Handling non-record data__
Most data mining algorithms are designed for record data. For non-record data, extract features from data objects and create records for each object. Does not capture all relationships (example time etc.)
###chapter 2.2 Data Quality
Because preventing data quality problems is typically not an option, data mining focuses on
the detection and correction of data quality problems
the use of algorithms that can tolerate poor data quality.
First step of correction: data cleaning.
__Measurement and data collection issues__
Data are not perfect. Human error, limitations of measuring devices, flaws in data collection process, missing data/data sets, spurious or duplicate objects.
_Measurement and data collection error_
common problem: recorded value differs from true value.
Numeric difference = error.
Data collection error = omitting data objects/attribute values, inappropriately including a data
object.
noise, artifacts:
distortion of value, spurious object
techniques from signal processing to discover patterns lost in the noise
elimination of noise is difficult. Data mining algorithms must be robust, to produce good result even when noise is present.
Artifacts are deterministic disortions of data, such as a streak in the same place on a set of photographs.
precision, bias, accuracy:
In statistics, quality of measurement process and resulting data is measured by precision and bias.
precision: the closeness of repeated measurements (of the same quantity) to one another.
Bias: a systematic variation of measurements from the quantity being measured.
Accuracy: the closeness of measurements to the true value of the quantity being measured.
depends on precision and bias.
Significant digits, use only the number of digits specified in the data.
Outliers:
data objects that, in some sense, have characteristics that are different from most of the other data objects in the data set
values of an attribute that are unusual with respect to the typical values for that attribute
“anomalous” objects
noise and outliers are not the same
outliers can be legitimate
outliers can be of interest (fraud, network intrusion detection)
missing values:
not collected, refused to be entered (age, weight), conditional parts of forms.
should be taken into account in data analysis.
strategies for dealing with missing data
eliminate data objects or attributes with missing values
(-) lose important information.
(+) simple and effective.
can be done with care (if the data set has only a few objects missing data, it can be expedient).
estimate missing values (reliably)
interpolation with remaining values
ignore the missing values during analysis
if the total number of attributes is high enough
if the number of missing values is low
Inconsistent values:
Detect and, if possible, correct.
It can be necessary to consult with an external source of information.
Duplicate data:
or almost duplicates.
To detect and eliminate, two main issues must be addressed:
if there are two objects that actually represent a single object, then the values of corresponding attributes may differ, and these inconsistent values must be resolved.
care needs to be taken to avoid accidentally combining data objects that are similar, but not duplicates. Such as two distinct people with identical names. Can cause problems for algorithms.
Deduplication = dealing with the two issues.
__Issues related to applications__
- Timeliness
- some data age as soon as they are collected (snapshots of ongoing processes)
- Relevance
- Make sure that objects in data set are relevant. common problem: sampling bias
- occurs when a samples does not contain different types of objects in proportion to their actual occurrence in the population.
- example: survey data only describes only respondents.
- Knowledge about the data
- data that have documentation.
- type of feature (nominal, ordinal, interval, ratio)
- scale of measurement (meters, feet, etc.)
- origin of data
###chapter 2.3 Data preprocessing
To make data more suitable for data mining.
- aggregation
- sampling
- dimensionality reduction
- feature subset selection
- feature creation
- discretization and binarization
- variable transformation
__Aggregation__
“less is more”. Combine two or more objects into one object.
example: reduce transactions in a store one day to one common transaction.
Issue: how are the data combined?
Behaviour of groups of object is more stable than individual objects
Lose information about interesting details.
__Sampling__
Sampling is a commonly used approach for selecting a subset of the data objects to be analyzed.
Statistics use sampling because entire data sets are expensive and time consuming to obtain.
Data mining use sampling because the entire data set are expensive and time consuming to process.
If sampling is representative, then it is almost as well as using the entire data set.
representative if has approximately the same property of interest
sampling is a statistically process
_Sampling approaches_
simple random sampling
- variation 1: sampling without replacement
- variation 2: sampling with replacement
- same object can be picked more than once
- easier to analyse since the size of the data set is constant
- can fail if population consists of different types of objects, with widely different numbers of objects. Might not get representation of all object types.
Stratified sampling
- starts with prespecified groups of objects.
- simple version: extract equal number of objects from each group
- complex version: extract the number of objects from a group that is proportional to the size of the group
progressive sampling
- size can be difficult to determine. sometimes adaptive or progressive sampling is therefore used.
- start with small sample, increase til sufficient size has been obtained.
__Dimensionality reduction__
Benefits:
- many data mining algorithms work better if the dimensionality is lower
- reduce noise
- eliminate irrelevant features
- more understandable model
- more easily visualized
- amount of time and memory required by an algorithm is reduced
Reduction often consists of techniques that replace attributes with a new attribute that is a combination of the old attributes (feature subset selection).
Two important topics:
__1)the curse of dimensionality__
data is more complex to analyse as the dimensionality increases.
- classification:
- not enough data objects to allow the creation of a model that reliably assigns a class to all possible objects
- clustering:
- the definitions of density and the distance between points become less meaningful.
Therefore a lot of algorithms have trouble with high-dimensional data.
__2)Linear algebra techniques for dimensionality reduction__
Principal component analysis (PCA)
- technique that for continuous attributes finds new attributes that
- are linear combinations of the original attributes
- are orthogonal to each other
- capture the maximum amount of variation in the data
- Singular Value Decomoposition (SVD)
- related to PCA
__Feature Subset Selection__
Use only a subset of the features. If redundant and irrelevant features are present, one does not lose information.
- redundant features
- duplicate much/all information contained in one or more other attributes.
- irrelevant features
- contain almost no useful information for the task at hand
will reflect bias, objective at final algorithm.
For n attributes there are 2n subsets. Impractical in most situations.
Three standard approaches:
1. embedded
1. occurs naturally as part of the data mining algorithm. the algorithm decides what attributes to use, and what to ignore.
2. filter
1. features are selected before the algorithm runs.
3. wrapper
1. use the target algorithm as a black box to find best subset, but does not enumerate all possible subsets.
_architecture for feature subset selection (filter/wrapper)_
feature selection process consists of:
- measure for evaluating a subset
- search strategy that controls the generation of a new subset of features
- stopping criterion
- validation procedure
Filter (uses target algorithm) and wrapper (distinct from target algorithm) differs in the way they evaluate a subset.
flowchart of a feature subset selection process (figur i boken)
Stopping criterion is often based on 1+ conditions:
- number of iterations
- optimal value of the subset, or exceeds a threshold
- certain size of subset
- evaluation criteria has been achieved
evaluation approach
- run algorithm with both full data set and featured subset and compare results
- run algorithm with different subsets and compare results
_feature weighting_
more important features are assigned a higher weight. Sometimes based on domain knowledge, or can be determined automatically. Features with higher weighting plays a more important role in a model.
__Feature extraction__
Creation of new set of features from the original raw data.
Highly domain-specific. example image processing. Features and techniques to extract features are specific to fields.
_mapping the data to a new space_
in example time series of data and if there is a number of periodic patterns and noise present, patterns are hard to detect. Fourier transform applied to the time series patterns can be detected. Underlying frequency information is explicit.
_feature construction_
If the original feature is in a form not suitable for the data mining algorithm. One or more new features are constructed out of the original feature.
__Discretization and binarization__
Some algorithms require data to be categorical attributes (classification) or binary attributes (association patterns).
Continuous to categorical = discretization
Continuous/discrete to binary = binarization
If some categorical attribute has a large number of values, or if some values occur infrequently, it can be beneficial to reduce number of categories by combining values.
Binarization
Simple techniques:
- if there are m categorical values, then uniquely assign each original value to an integer in the interval [0, m-1]. If the attribute is ordinal, the order must be maintained by the assignment.
- convert each of the m integers to a binary number.
- If only the presence of an attribute is important, use asymmetric binary attributes. For n attribute there is n*x binary attribute values.
- Some attributes needs to be divided into two or more binary attributes. Example gender with one attribute that has value 1 if female, and one attribute that has value 1 if male.
discretization of continuous attributes:
Discretization for attributes used for classification or association analysis.
Transformation from continuous attribute to categorical attribute:
1. deciding how many categories, and how to map attribute-values to these categories
- after the values of the continuous attribute are sorted, divide them into n intervals by specifying n-1 split points.
- all the values in one interval are mapped to the same categorical value.
2. problem:
- how many split points?
- where to place them?
3. result represented as a set of intervals, or as a series of inequalities (xo < x < x1, etc.)
Unsupervised discretization:
distinct between discr. methods by whether class information is used (supervised) or not (unsupervised).
Unsupervised has relatively simple approaches. As equal width (divide the range of the attribute into a user-specified number of intervals each having same width), can be badly affected by outliers. Equal frequency (puts the same number of objects into each interval) is preferred, handles outliers better.
K-means and visually inspecting is also examples.
Usually better than no discretization.
Supervised discretization
Use the end purpose in mind and use class labels. Produce better result than unsupervised. Things to consider: purity and minimum size of an interval.
entropy: (formel i boka)
where k is number of class labels, mi is the number of values in the ith interval, mij is the values of class j in the ith interval, pij=mij/mi is the probability of clas j in the ith interval.
total entropy: (formel i boka)
where w is mi/m (the fraction of values in the ith interval) and m is the number of values.
Entropy is a measure of the purity of an interval. If an interval contains only objects of the same class (pure), the entropy equals 0. If the classes of values in an interval occur equally often, the interval is as impure as possible, and the entropy is maximum.
Simple approach for partitioning is by starting with bisecting the initial values so the resulting two intervals give minimum entropy. assumes ordered set of values. Repeat with the interval with highest entropy, until stopping criterion is reached.
_problem: categorical attributes sometimes have too many values_
can join together based on domain knowledge, or if it does not improve or exist, use a more empirical method such as group together only if grouping improves the classification accuracy or similar.
__Variable transformation__
A transformation that is applied to all the values of a variable. For each object, the transformation is applied to the value of the variable for that object.
_Simple functions_
A simple mathematical function is applied to each value individually.
Variable transformaiton should be done with causion since it change the nature of the data. Should be fully appreciated. Example 1/x, increases magnitude for <0,1>, but decreases for <1,n>.
_Normalization or standardization_
Common type of variable transformation. The goal is to make an entire set of values have a particular property.
Example: difference of two people based on income and age. Income varies more than age, income can dominate.
Mean and standard deviation is often affected by outliers. Mean is therefore replaced by median (middle value), standard deviation is replaced by absolute standard deviation.
###chapter 2.4 Measures of similarity and dissimilarity
Used by a number of data mining techniques; clustering, nearest neighbour classification, anomaly detection.
In many cases initial data set is not needed after similarities or dissimilarities have been computed.
__Proximity__
Refers to similarity or dissimilarity.
__Similarity__
between two objects is a numerical measure of the degree to which the two objects are alike.
0 = no similarity
1 = complete similarity.
__Dissimilarity__
between two objects is a numerical measure of the degree to which two objects are different. Distance is a synonym. [0,1] or [0, infinity].
__transformations__
often applied to convert a similarity to a dissimilarity (or vice versa)
- d = 1-s and s= 1-d.
- s > 0, d<0 and s = -d, and d=-s
transform a measure to fall within a specific range. example [0,1]
- s’ = (s-min_s)/(max_s-min_s). example: s’ = s-1/9 where min_s = 1 and max_s = 10.
__Similarity and dissimilarity between simple attributes__
objects having a simple attribute. S = 1 if match, s = 0 if dont match. d = 1 if dont match, d = 1 if match.
single ordinal attribute. If poor = 0, fair = 1, ok = 2, good = 3, wonderful = 4, then d(good, ok) = 3-2 = 1.
downside: assumes equal distance between the values.
table 2.7 s.69!!!!
__Dissimilarities between data objects__
distance
Euclidean distance d between x and y is given by following formula:
(formel i boka)
n is the number of dimensions. xk and yk is the kth attributes of x and y.
Euclidean distance is generalized by Minkowski distance metric:
(formel i boka)
r is a parameter. Three most common examples:
1. r = 1. City block (Manhattan, taxicab, L1 norm) distance.
2. r = 2. Euclidean distance (L2 norm).
3. r = Supremum (Lmax norm) distance. The maximum difference between any attribute of the objects.
(formel i boka)
Properties of distances:
positivity
d(x,y) is positive for all x, y.
d(x,y) = 0 only if x = y
symmetry
d(x,y) = d(y,x) for all x, y
triangle inequality
d(x,z) d(x,y) + d(y,z) for all points x,y,z.
Measures that satisfy all = metrics
__Similarities between data objects__
For similarities the triangle inequality often does not hold.
__Examples of proximity measures__
Simple matching coefficient
SMC = number of matching attribute valuesnumber of attributes=f(11)+f(00)f(01)+f(10)+f(00)+f(11)
treats presences and absences equally.
Jaccard coefficient
handles objects consisting of asymmetric binary attributes. Example if number of true is very little combined with false.
J = number of matching presencesnumber of attributes not involved in 00 matches=f(11)/f(01)+f(10)+f(11)
Example: Jaccard and SMC
x = (1,0,0,0,0,0,0,0,0,0)
y = (0,0,0,0,0,0,1,0,0,1)
f01 = 2, f10 = 1, f00 = 7, f11 = 1
SMC = (0+7)/(2+1+0+7) = 0,7
J = 0/(2+1+0) = 0
Cosine similarity
word documents often does not contain zero entries. Cosine similarities does not depend on (and ignores) 0-0matches, as jaccard measure also does. Cosine similarity also must be able to handle non-binary vectors.
If x and y are two documents vectors then
cos(x,y) = x * y/||x||||y||
x*y = the vector dot product.
(s.75 for regneeksempel)
if cos = 0, the angle is 90 degrees, and x and y do not share any terms.
Extended jaccard coefficient (tanimoto coefficient)
useful for document data. reduces to jaccard in case of binary attributes.
EJ(x,y) , formel i boka..
##chapter 3 EXPLORING DATA
Data exploration
The preliminary investigation of the data in order to better understand its specific characteristics. It can aid in preprocessing and data analysis techniques. Help in data mining when patterns can be found visually. Understand and interpret results.
This chapter overlap with Exploratory Data Analysis (EDA). Created in the 1970s.
###chapter 3.2 Summary statistics
Summary statistics are quantities that capture various characteristics of a potentially large set of values with a single number or a small set of numbers. Mean, standard deviation etc.
Frequency
For unordered categorical values, frequency is almost the only useful thing to compute.
frequency (vi) = number of objects with attribute value vim
m is the number of objects.
Mode
the mode of a categorical attribute is the value that has the highest frequency. May contain information about the presence of missing values (heights in cm and not in mm, or if a value is used to indicate a missing value).
Percentiles
Ordered data.
Given an ordinal/continuous attribute x and a number p (0,100), the pth percentile xp is a value of x such that p% of the observed values of x are less than xp. Eksempel s.101.
Measures of location
Mean
x= 1mi = 1mxi
The middle of a set of values, if the values are distributed in a symmetric manner. If not, use median.
Sensitive to outliers.
Trimmed mean, when a percentage p of the top and bottom are trimmed, and mean is calculated as usual. (median is trimmed mean where p=100%).
Median
median (x) = x( r + 1 ) if m is odd (the middle value)
= ½ (x(r) + x( r + 1 )) if m is even (the average of the two middle values)
Not so sensitive to outliers.
Measures of spread
Measures if widely spread or relatively concentrated around a single point.
range
Simplest measure of spread.
range(x) = max(x)-min(x)=x(m)-x(1)
in the set {x1, x2, …, xm}.
Not good if most of the values are concentrated, but there are some extreme min and max outliers.
variance
variance = sx2 = 1m-11 = 1m(xi-x)2
Standard deviation (sx) is the square root of the variance.
Variance is computed with mean, and is therefore also sensitive to outliers.
More robust measures:
ADD: absoulte average deviation
ADD = 1m1 = 1m|xi-x |
MAD: median absolute deviation
MAD(x) = median({|x1 - x|, … ,|xm - x|})
IQR: interquaratile range
IQR = x75% - x25%
Multivariate summary statistics
measures of location can be obtained by computing mean and median separately for each attribute. For multivariate data, it can be computed independent of each other.
For continuous variables, the spread is most commonly captured with covariance matrix S.
si j = covariance(xi,yi)
covariance(xi , yi) = 1m-11 = 1m(xki-xi)(xkj-xj)
The variances lies along the diagonal.
Covariance is the measure of how two attributes vary together and depends on the magnitudes of the variables.
Near 0, not have a relationship.
Correlation measure how strong two attributes are linearly related. Preferred for data exploration.
correlation (xi,yi) = covariance(xi, yi)s(i)s(j)
Diagonal = 1, other varies between (-1, 1).
Other ways to summarize data
Skewness
The degree to which the values are symmetrically distributed around a mean.
###chapter 3.3 Visualization
the display of information in a graphic or tabular format. Visual data mining.
Motivation for visualization
use information that is “locked up in people’s heads”.
Maps of world and temperature.
eliminate patterns and focus on the important ones.
General concepts
Representation: Mapping data to graphical elements
First step.
Objects usually represented as:
if only a single categorical attribute is being considered
categories based on values of the attribute are displayed as an entry in a table or an area on a screen.
if an object has multiple attributes
objects can be represented as a row (/column) in a table or as a line in a graph
if an object is a point in a 2- or 3-dimensional space
points can be represented as a geometric figure
circle, cross, box…
__Techniques__
based on number or type of attributes involved, or if the data has some special characteristics.
visualizing small numbers of attributes
Stem and leaf plots
one-dimensional integer or continuous data.
Split the values into groups, where each group contains those values that are the same except for the last digit
each group becomes a stem
last value in a group is a leaf
example: 35, 36, 42, 51.
stems: 3, 4, 5
leaves: 5, 6, 2, 1.
_Histogram_
Stem and leaf plots are a type of histogram where values are divided into bins.
For categorical data, each value is a bin. If too many bins, combine.
for continuous attributes, the range of values is divided into bins of (often) equally width.
Representation:
Each bin is represented by a bar. Bar plot.
Variations:
relative (frequency) histogram - change in scale in y axis.
pareto histogram - categories are sorted by count. Decreases from left to right.
two-dimensional histograms - each attribute is divided into intervals ,and the two sets of intervals define two-dimensional rectangles of values. can show how to values of an attribute co-occur. (figur på s.114)
_Box plots_
figur s. 115. outliers = “+”. Compact and many can be shown in same plot.
_Pie chart_
typically used with categorical attributes with relatively small number of values. not popular in technical publications because the size of relative areas can be hard to judge.
_Cumative distribution function (CDF) - Percentile plots_
For each value in a statistical distribution a CDF shows the probability that a point is less than that value. For each observed value an Empirical CDF shows a fraction of points that are less than this value. (figur s.117)
_Scatter plots_
each data object is plotted as a point in the plane using the values of the two attribute as x and y coordinates.
main uses for scatter plots:
graphically show the relationship between two attributes. scatter plots can also be used to detect non-linear relationships, either directly or using a scatter plot of the transformed attributes.
when class labels are available, they can be used to investigate the degree to which two attributes separate the classes.
_visualizing spatio-temporal data_
when data has spatial or temporal attributes.
The earth is a spatial grid.
various points in time.
Data may have only a temporal component, such as time series data that give the daily prices of stocks.
_contour plots_
if a three-dimensional data where the third dimension has a continuous value (temperature, elevation). Visualized as contour plot, which breaks the plane (the first two dimensions) into separate regions where the third dimension is roughly the same.
surface plot
the third attribute is the height above the plane (x, y).
Require the z to be defined for all combinations of x,y.
_Vector field plot_
characteristics may have both a magnitude and a direction. Display both. (figur s. 123).
_Lower-dimensional slices_
When a spatio-temporal data set is recorded over time = 4 dimensions. Not easily displayed. Separate slices can be displayed (example one for each month).
_Animation_
To deal with slices of data. Display successive two-dimensional slices of data.
__visualizing higher-dimensional data__
Can show higher dimensions, but show only some aspects of the data.
_Matrices_
array of values. If class labels are known, data should be reordered so similar class labels are grouped together. If different attributes have different ranges, it is standardized to have a mean of 0, and a standard deviation of 1.
If class labels are not known, matrix recording or seriation can be used to rearrange the rows and columns of the similarity matrix, so that groups of highly similar objects and attributes are together and can be visually identified.
_parallel coordinates_
one coordinate axis for each attribute. axes are parallel to one another instead of pendicular (x,y liksom).
An object is represented as a line (not a point).
- each value represent a point, a line is drawn between points to form the line
- figur s.127
_star coordinates and chernoff faces_
encode objects as glyphs or icons.
star coordinates
- uses one axis for each attribute, and radiate from a center point.
- typically, all attribute values are range (0,1).
chernoff face
- program to map values to features of a face.
- figur s.129
they do not scale well, and are of limited use in data mining.
Do’s and Dont’s
Guidelines - ACCENT principle for effective pragpical display (d.A. Burn).
__A__pprehension
- ability to correctly perceive relations among variables. Does the graph maximize apprehension of the relations among variables?
__C__larity
- ability to visually distinguish all the elements of a graph. are the most important elements or relations visually most prominent?
__C__onsistency
- ability to interpret a graph based on similarity to previous graphs. are the elements, symbol shapes, and colors consistent with their use in previous graphs?
__E__fficiency
- ability to portray a possibly complex relation in as simple a way as possible. are the elements of the graph economically used? is the graph easy to interpret?
__N__ecessity
- the need for the graph, and the graphical elements. Is the graph a more useful way to represent the data than alternatives? are all the graph elements necessary to convey the relationships?
__T__ruthfulness
- ability to determined the true value represented by any graphical element by its magnitude relative to the implicit or explicit scale. are the graph elements accurately positioned and scaled?
_Tufte’s guidelines for graphical excellence:_
- is the well-designed presentation of interesting data - a matter of substance, of statistics and of design.
- consists of complex ideas communicated with clarity, precision, and efficiency.
is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.
- is nearly always multivariate.
- requires telling the truth about the data.
###chapter 3.4 OLAP and multidimensional data analysis
viewing data sets as multidimensional arrays.
OLAP = On-Line Analytical Processing.
OLAP systems has a strong focus on the interactive analysis of data and typically provide extensive capabilities for visualizing the data and generating summary statistics.
Representing data as a table
each row is an object, each column is an attribute.
Multidimensional table, s.132.
General case for multidimensional data:
starting point:
- tabular representation of the data = fact table.
step 1:
- identification of the dimensions and identification of an attribute that is the focus of the analysis.
- The dimensions is categorical attributes, or converted continuous attributes.
- The size of the dimension is the number of attribute values.
- Content of each cell = target quantity.
step 2:
- each row is mapped to a cell. The indices of the cell are specified by the values of the attributes (dimensions). The value of the cell is the value of the target attribute. Cells not defined assumes to be 0.
Analyzing multidimensional data
- data cubes: computing aggregate quantities
- aggregation is important to find a total/summary of data. Also to avoid too many entries when representing/computing.
- There are a number of different aggregations (totals) to be computed for a multidim.array, depending on how many attributes we sum over.
- For n dimensions, where we sum over k dimensions = nkarrays of totals, each with dimensions n - k (ikke den derre streken midt på. damn g.docs.equations).
data cube
a multidimensional representation of the data, together with all possible totals (aggregates).
The size of each dimension does not need to be equal.
Can have more or less than 3 dimensions.
Generalization of cross-tabulation (statistical terminology).
Dimensionality reduction
collapses a column into a single cell.
If sj is the number of possible values of the jth dimension, the number of cells is reduced by a factor of sj.
pivoting is aggregating over all dimensions except 2. result: cross-tabulation with dim= 2.
Slicing and dicing
slicing is selecting a group of cells from the entire multidim. array by specifying a specific value for one or more dimensions.
dicing is selecting a subset of cells by specifying a range of attribute values. (selecting a subset from the complete array).
Roll-up, drill-down
data categories can be viewed as hierarchies. A chair is a furniture, and a chair can be divided into lounge chair, sitting chair, armchair etc.
roll-up to roll data up in a single category (from days to one month).
drill-down to drill down data from a single category to several categories (from month to days).
##chapter 4 CLASSIFICATION: BASIC CONCEPTS, DECISION TREES, AND MODEL EVALUATION
__Classification__
The task of assigning objects to one of several predefined categories (mapping input attribute set x into its class y).
A problem that encompasses many diverse applications.
Detecting spam messages
medical results
galaxy-classifications based on shape
###chapter 4.1 Preliminaries
Record
The input for a classification task is a collection of records.
Each record (instance, example) is characterized by a tuple (x, y).
x = attribute set
y = special attribute, class label
Definition classification:
The task of learning a target function f that maps the attribute set x to one of the predefined class labels y.
f is known as a classification model. Useful for
Descriptive modeling
serve as tool for distinguish objects of different classes
Predictive model
predict the class label of unknown records.
###chapter 4.2 General approach to solving a classification problem
Classification technique / classifier
To build a classification model from an input data set.
Each technique employs a learning algorithm to find the best model for x and y.
Training set
a set where labels are known. Is used to build a classification model which is applied to the test set. A test set has unknown labels.
Confusion matrix
counts of correct and incorrect results from the test records.
Performing metric
One number telling the accuracy shown in the confusion matrix.
Accuracy
= number of correct predictions / total number of predictions
Error rate
= number of wrong predictions / total number of predictions
###chapter 4.3 Decision tree induction
decision tree is a classifier (classification technique).
Root node
0 incoming edges, 0 or more outgoing edges
Internal nodes
1 incoming edges, two or more outgoing edges
Leaf (terminal) nodes
1 incoming edge, 0 outgoing edges
Root and internal nodes contains attribute test conditions.
__Algorithms for making decision trees:__
_Hunt’s algorithm_
basis for: ID3, C4.5, CART
grow a tree recursively.
- Step 1:
- if all records in Dt belongs to the same class yt, then t is a leaf node labeled as yt.
- Step 2:
- If Dt contains records that belong to more than one class, an attribute test condition is selected to partition the records into smaller subsets. A child node is created for each outcome of the test condition, and Dt is distributed to the children based on the outcomes.
- Algorithm is then applied to each child node.
Will work if every combination of attribute values is present in the training data, and each combination has a unique class label.
Additional conditions:
- if a child in step 2 can be empty, there are no records associated with the node.
In step 2, if all records associated with Dt have identical attribute values, it is not possible to split those records any further.
Design issues of Decision tree induction
Learning algorithm should address these issues:
How should the training records be split?
How should the splitting procedure stop?
Methods for expressing attribute test conditions
Attribute types:
- binary
- must generate two potential outcomes
- nominal
- multiway split (1 to several)
- binary split (OR - conditions)
- Ordinal
- ordered binary, and must keep the order of the values (not small - large and medium - extra large)
- continuous
- can be expressed as comparison test (A < v) (A > v).
Measures for selecting the best split
the measures are often based on the degree of impurity of the child nodes. Smaller degree, more skewed the class distribution.
(0,1) = 0 impurity
(0.5, 0.5) = highest impurity.
s.158
entropy, gini og classification error som eksempler på impurity measures for binary classification. og graf for å sammenligne.
Gain
The difference between impurity of parent node (before splitting) and child node (after splitting). The larger the difference, the better the test condition. Used to decide how good a test condition performs.
Decision trees algorithms often use a test condition that maximize the gain. I(parent) is same for all test conditions, so max the gain means minimize the weighted average impurity measures of the child nodes.
When entropy is used, the difference in entropy is known as information gain.
s. 160
Splitting binary attributes
splitting of nominal attributes
splitting of continuous atrributes
Gain ratio
Impurity measures (entropy, Gini index) tend to favor attributes with large number of distinct values.
Strategies to avoid too pure outcome (gender vs car brand vs customer ID):
1. restrict test conditions to binary splits only.
- employed by several decision tree alg.
2. modify the splitting criterion to take into account the number of outcomes produced by the attribute test condition.
- gain ratio (information gain /split info)
( mangler, s.164.. TreeGrowth algorithm (decision tree induction) og Example: Web robot detection)
__Characteristics of decision tree induction__
1. nonparametric approach for building classification models. Does not require any prior assumptions.
2. Finding an optimal decision tree is an NP-complete problem. Many alg are heuristic.
3. Inexpensive techniques make it possible to quickly construct models even when training set is large.
4. Easy to interpret.
5. provide an expressive representation for learning discrete-valued functions.
6. D.T.alg are robust to presence of noise (especially overfitting).
7. attribute-redundancy (langt punkt, se s. 169)
8. number of records become smaller when traversing down the tree. Data fragmentation problem is when number at leaf node is too small to make a statistically significant decision.
9. subtree can be replicated multiple times in a D.T. Makes the D.T more complex and more difficult to interpret.
10. border between two regions of different classes is known as decision boundary.
1. Oblique decision tree allow more than one attribute in a test condition.
2. Constructive induction Partition the data into homogeneous non-rectangular regions. crates composite attributes representing existing attributes arithmetic or logical.
11. choice of impurity method has little effect on the performance of the decision tree induction algorithms because the measures are so consistent with each other.
###chapter 4.4 Model overfitting
errors by a classification model divided into 2 types:
1. training errors (resubstitution / apparent error)
- number of misclassification errors committed on training records.
2. generalization errors.
- the expected error of the model on previously unseen records.
Model overfitting
When a model has low training error but higher generalization error than another model with higher training error.
Also when the node number becomes too high, the test error begins to increase even though its training error rate continue to decrease (see p.174).
Model underfitting
High error when the tree size is small. The model has yet to learn the true structure of the data.
Handling the overfitting in DT induction
strategies:
pre-pruning (early stopping rule)
alg is halted before generating a fully grown tree that perfectly fits the entire data set. a restrictive stopping condition required. too high threshold = underfitting, too low might cause overfitting.
Post-pruning
initially grown to full size. then tree-pruning step. trim tree in bottum-up-fashion.
replacing subtree with new leaf node or the most frequently used branch of the substree.
Terminate when no further improvements can be seen.
chapter 4.5 Evaluating the performance of a classifier
Holdout method
two sets: training set, test set.
Classification model is induced from the training set.
Performance evaluated on test set.
Limitations:
fewer labels available for training than for testing.
the model may be highly dependent on the composition of the training and test sets. smaller training set, larger variance of the model. and the other way around.
Random subsampling
Repeating the holdout method. Overall accuarcy: acc(sub) = i = 1kacc(i)/k
Problems: not utilize as much data as possible for training. No control for how many times a record is used for training and testing.
Cross-validation
each record is used the same number of times for training and 1(once) for testing.
divide into sets. K-fold. each run a partition is used for testing, the rest of them are used for training. Runs k times, each partition is used once for testing.
special set. k = N, leave-one-out approach.
bootstrap
Does not (as the others) assume that the training records are sampled without replacement → no duplicate records in the training and test sets.
In bootstrap, the records are placed back in the pool, and are as likely as the others to be drawn.
Probability of a record to be chosen: 1 - (1- 1/N)N. If N is large, it approaches 1-e-1 = 0.632.
Accuracy = accboot= 1bi = 1b(0.532 x ei+0.368 + accs)
CHAPTER 5 CLASSIFICATION: ALTERNATIVE TECHNIQUES
Nearest-neighbor classifiers
From chapter 4, a two-step process for classification:
an inductive step for constructing a classification model from data,
a deductive step for applying the model to test examples.
eager learners:
Decision tree and rule-based classifiers.
Designed to learn a model that maps the input attributes to the class label as soon as the training data becomes available.
Lazy learners:
delay the process of modelling the training data until it is needed to classify the test examples.
Example: Rote classifier (memorizes the entire training data and performs classification only if the attributes of a test instance match one of the training examples exactly. Drawback: some records may not be classified.
Approach: Nearest neighbor
find all the training examples that are relatively similar to the attributes of the test example; nearest neighbors. Can use them to determine class labels of the test example.
“if it walks like a duck, quacks like a duck, and looks like a duck, then it’s probably a duck”.
Algorithm
Computation can be costly if training example is to large. However, efficient indexing techniques are available to reduce the amount of computations needed to find the nearest neighbors of a test example.
Once the list is obtained, the test example classified based on the majority class of its nearest neighbors:
Majority Voting: y'=argmaxv(xi,yi)DzI(v = yi)
v is class label, yi is the class label for one of the nearest neighbors, and I(*) is an indicator function that returns the value ‘ if its argument is true (0 false). Every neighbor has same impact = sensitive for choice of k.
Reduce inpact of k: can weight the influence of each neighbor xi according to its distance:
wi = 1/d(x’, xi)2. Then farther away x’s has less impact.
distance-weighted Voting: y'=argmaxv(xi,yi)DzwiI(v = yi)
Characteristics of nearest - neighbor classifiers
part of a more general technique known as instance-based learning (make predictions without making an abstraction(model)).
require a proximity-measure to determine the similarity or distance between instances
require a classification function that returns the predicted class of a test instance based on its proximity to other instances.
Lazy learners do not require model building. but require to compute proximity indicidually between the test and the training examples.
eager use a lot of resources on model, and is fast on the classifying.
Make their predictions based on local information,
trees and rule-based classifiers try to find a global model that fits the entire input space.
Nearest neighbor classifiers (with small k) are quite susceptible to noise.
can produce arbitrarily shaped decision boundaries. They provide a more flexible model representation compared to eager methods (often constrained to rectilinear decision boundaries).
boundaries have high variability because they depend on the composition of training examples. Increasing the number of nearest neighbors may reduce such variability.
can produce wrong predictions unless appropriate proximity measure and data preprocessing steps are taken.
such as weight (xx kg) and height (x,xx m). Weight will dominate unless not taken into consideration.
CHAPTER 6 ASSOCIATION ANALYSIS: BASIC CONCEPTS AND ALGORITHMS
TID = transaction ID
Association analysis
useful for discovering interesting relationships hidden in large datasets. Relationships can be represented as association rules.
section 6.1 Problem definition
Binary representation
market basket represented in a binary format. each transaction is a row, and a present item = 1, non-present = 0. Ignores the quantity, price etc.
Asymmetric binary value
an item that is more important when present than absent.
itemset
I = {i1,i2,...id} is the set of all itemset. If an itemset contains k items, it is called a k-itemset. {beer, diapers, milk} is a 3-itemset.
Transaction set
T = {t1,t2,...,tn} is the set of all transactions. ti contains a subset of items chosen from I.
Support count
The number of transactions that contain a particular itemset.
(X) = |{ti | X ti, tiT}
association rule
implication expression of form X → Y, where X and Y are disjoint itemsets. Strength of rule can be measure with its support and confidence.
support
how often a rule is applicable to a given data set.
support, s(XY)=(XY)N
Used to eliminate rules with low support (most likely to be random and uninteresting).
Confidence
how frequently items in Y appears in transactions that contain X.
confidence, c(XY)=(XY)(X)
Measures the reliability of the inference made by the rule. Gives the conditional property of Y given X.
Association rule mining problem (definition):
given a set of transactions T, find all the rules having support minsup and confidence minconf, where minsup and minconf are the corresponding support and confidence thresholds.
Brute-force (1)
compute support and confidence for all possible rules.
Expensive, exponentially many rules that can be extracted from a data set: R = 3d - 2d+1 + 1, where d is the number of items in the data set.
section 6.2 Frequent itemset generation
itemset of k items can generate 2k -1 frequent itemsets.
Ways to reduce computational complexity:
reduce the number of candidate itemsets (M)
apriori principle, does not count support values
reduce the number of comparisons
instead of matching each candidate itemset against every transaction. Uses advanced data structures to store the candidate itemsets or to compress the data sets.
Apriori principle
if an itemset is frequent, then all of its subsets must also be frequent. and if an itemset is infrequent, then all its supersets is infrequent too. Can prune based on this = support-based pruning.
Anti-monotone property:
the support of an itemset never exceeds the support for its subsets. X is subset of Y, f(Y) must not exceed f(X). Anti-monotone (Downward closed).
Monotone property:
Monotone (upward closed) = if X is a subset of Y, f(X) must not exceed f(Y).
Apriori algorithm (s.337)
initially makes a single pass over the data set to determine the support of each item. Upon completion of this step, the set of all frequent 1-itemset F1 will be known.
next, the algorithm iteratively generate new candidate k-itemsets using the frequent (k-1)-itemsets found in the previous iteration.
The algorithm passes once more to count the support.
Then the algorithm eliminates all candidate itemsets whose support counts are less than minsup.
The algorithm terminates when there are no new frequent itemsets generated.
The frequent-itemset-generation part of the Apriori algorithm is a
level-wise algorithm
it traverses the itemset lattice one level at a time (k = 1 til k = n)
generate-and-test
first generate candidate itemsets from frequent sets, and then test each candidate.
total number of iterations is kmax + 1 (kmax is the max size of the frequent itemsets).
List of requirements for an effective candidate generation procedure:
avoid generating too many unnecessary candidates.
unnecessary: at least one of its subsets is infrequent.
Anti-monotone property!
the candidate set is complete (no frequent itemsets are left out by the procedure).
a candidate itemset should not be generated more than once.
Brute-force method (2)
computations needed for each candidate is O(k). The overall complexity is
O(k =1dkd over k= O(d*2d-1)
Fk-1 x F1 method (figur s.340)
A frequent itemset is extended with another frequent items. If k=3, a k-1-itemset is extended with a k=1-itemset.
Produces O(|Fk-1| x |F1|) candidate k-itemsets, where |Fj| is the number of frequent j-itemsets.
Overall complexity:
O(kk|Fk-1| |F1|)
The method is complete.
The method does not prevent that an itemset is generated more than once.
can be prevented with sorted items in lexical order, and each itemset(k-1) is extended with an item that is lexically larger.
{bread, diapers} with {milk}
not {diapers, milk} with {bread}
Better than brute-force, but still produce a lot of unnecessary candidates.
heuristics available to improve
every k-itemset survived pruning step, should be present in k-1 candidates. Otherwise, it is infrequent.
Fk-1 x Fk-1 method (figur s.342)
The procedure merges a pair of frequent (k-1)itemsets only if their first k-2 items are identical.
Lexical ordering to prevent duplicate candidates.
Additional candidate pruning is needed to ensure that the remaining k-2 subsets of the candidate is frequent.
Support counting
one approach: compare each transaction against every candidate itemset, and to update the support counts of candidates contained in the transaction.
expensive (computationally)
another approach: enumerate the itemsets contained in each transaction and use them to update the support counts of their respective candidate itemsets.
figur s.343 som forklarer fremgangsmåten.
first enumerate, then increment support if it is a match between a k-2(?) itemset and a candidate.
Hash tree function (figur s.344)
sort both candidate itemsets into buckets, and transaction itemsets (of similar size) into same buckets.
Match needs only to be done inside each bucket.
les på dette… (s.344)
Computational complexity
can be affected by following factors:
support threshold
lowering threshold → more itemsets is frequent. More complex.
number of items (dimensionality)
more space needed to store the support counts of items.
number of transactions
run time increases since it repeatedly passes over the data set
average transaction width
the maximum size of frequent itemsets tend to increase. Complexity increases.
more itemsets are contained in the transaction, number of hash tree traversals increases during counting.
Time complexity:
generation of frequent 1-itemsets
for each transaction, need to update the support count for every item present in the transaction.
w is average transaction width
N is total number of transactions
O(Nw) is the time the operation requires.
candidate generation
blæ (se s 349).
support counting
cost for support counting: O(Nk(W over k)k)
W is the maximum transaction width, k is the cost for updating the support count of a candidate k-itemset in the hash tree.
section 6.3 Rule generation
Each itemset of size k can produce max 2k - 2 association rules.
Partition an itemset Y into non-empty subset X and Y-X, such that X→ Y-X satisfies the confidence threshold. (all subsets are infrequent due to anti-monotone property).
confidence for rule X → Y-X = (Y)(X)
Confidence-based pruning
No anti-monotone property for confidence.
Theorem:
If a rule X → Y-X does not satisfy the confidence threshold, then any rule X’ → Y - X’ where X’ is a subset of X, must not satisfy the confidence threshold as well.
Rule generation in apriori algorithm
(tegn inn fra s. 350)
algoritme:
les også eksempel s. 353.
section 6.4 Compact representation of frequent itemsets
in practice, the number of frequent itemsets produced from a transaction data set can be very large.
identify a small representative set of itemsets from which all other frequent itemsets can be derived.
Maximal frequent itemsets
Definition: A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets are frequent.
figur 6.16:
Itemsets are divided into frequent/infrequent itemsets.
{a,d} is maximal frequent because all of its immediate supersets ({a,b,d},{a,c,d},{a,d,e}) are infrequent.
Non-maximal = non of its immediate subsets are frequent.
Maximal frequent itemsets provide a compact representation because they form the smallest set of itemsets from which all frequent itemsets can be derived.
In figure 6.16 the maximal frequent itemsets are {a,c,e}, {a,d} and {b,c,d,e}.
Maximal frequent itemsets do not contain info about support of their subsets. Additional pass over data set is needed to determine support counts of the non-maximal frequent itemsets.
Closed frequent itemsets
minimal representation without losing support information.
definition closed itemset: an itemset X is closed if none of its immediate supersets has exactly the same support count as X.
or: X is not closed if at least one of its immediate supersets has the same support count as X.
figur 6.17
definition closed frequent itemset: an itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup.
The closed frequent itemsets to determine the support counts for the non-closed frequent itemsets. {a,d} must have the same support as {a,c,d} since it has larger support count than {a,b,d} and {a,d,e}.
figur 6.18
closed frequent itemset remove some of the redundant association rules. All maximum frequent itemsets are closed because none of the maximal frequent itemsets can have the same support count as their immediate supersets.
section 6.5 Alternative methods for generating frequent itemsets
Apriori - earliest, significant performance improvement, but still incurs considerable I/O overhead since it requires several passes. If transaction width is too high, it can also decrease.
Traversal of itemset lattice
(tre fra {} på topp til {a,b,c,d,e} på bunn).
General-to-specific, specific-to-general, bidirectional:
apriori uses this search strategy. Merging of k-1-itemsets to obtain candidate k-itemsets.
Alternatively can look at most specific first and then more general.
Bidirectional combine. Can find frequent itemset border.
figur 6.19
Equivalance classes:
Partition the lattice into disjoint groups of nodes (equivalence classes). A frequent itemset generation algorithm is used within each class.
Equivalence classes can also be defined according to its prefix or suffix label of an itemset. First search for frequent itemset with prefix a, then b,c, d, etc. Suffix starts from the back.
figur 6.20
Breadth-first, depth-first:
Apriori is a breadth-first. Starts with 1-itemsets, then 2-etc.
Depth first is often used by algorithms designed to find maximal frequent itemsets. Finds the border more quickly.
figur 6.21
figur 6.22
Representation of transaction data set
choice of representation can affect the I/O cost.
Horizontal:
adopted by apriori etc.
vertical:
TID list may be too large to fit into memory, and need more sophisticated methods to compress TID-list.
section 6.6 FP-growth algorithm
Not generate-and-test. Instead it encodes the data set using a compact data structure called an FP-tree and extracts frequent itemsets directly from this structure.
FP-tree representation
Compressed representation of the input data. Read the transactions one at a time, and map each transaction onto a path in the tree. Paths may overlap due to similar items.
More overlap is more compression. If tree can fit in main memory, frequent data sets can be extracted directly without passes over data stored on disk.
figur 6.24 -- les beskrivelse 1-5 under
Best-case scenario. All transactions contains the same items and there is only one branch in the tree.
Worst case: none of the branches overlap. The pointers requires space, so the tree requires more space in memory than original data set.
Pointers (between nodes that have the same items) help to facilitate the rapid access of individual items in the tree.
If ordering by decreasing support, it will result in this tree:
figur 6.25
section 6.7 Evaluation of association patterns
Need to have evaluation criterias to decide the quality of the association patterns made by the algorithms.
First set of criteria derived from statistical arguments. Patterns that involve a set of mutually independent items or cover very few transactions are considered uninteresting because they may capture spurious relationships in the data. Patterns can be eliminated by applying an:
objective interestingness measure
uses statistics derived from data to determine if a pattern is interesting.
support, confidence, correlation.
Second set: subjective arguments. A pattern is considered subjectively uninteresting unless it reveals unexpected information about the data or provides useful knowledge that can lead to profitable actions.
{Butter} → {Bread} vs {Diapers} → {Beer}.
Requires a lot of domain knowledge.
Approaches for incorporating subjective knowledge into pattern discovery:
visualization - interpreting and verifying the discovered patterns.
template-based approach - only rules that satisfy a user-specified template are returned to the user.
subjective interestingness measure - ??
a subjective measure can be defined based on domain info such as concept hierarchy or profit margin of items. the measure can then be used to filter patterns that are obvious and non-actionable.
Objective measures of interestingness
A data driven approach for evaluating the quality of association patterns. Domain-independent, requires minimal input from the users, other than a threshold.
An objectibe measure is usually computed based on the frequency counts tabulated in a contingency table:
(figur 6.7)
Limitations of the support-confidence framework
Support eliminates low support items, even though they might be of interest.
Limitation of confidence:
A tea-drinker that drinks coffee is 75% chance, and seems like a good rule, but when 80% drinks coffee regardless of tea or not. Therefore the rule is not good.
(skjønte ikke helt det som står på s. 373)
Interest factor
tea-coffee example show that high-confidence rules can be misleading because they ignores the support. Can solve by applying a metric; lift:
Lift =c(AB) s(B)
Which computes the ratio between the rule’s confidence and the support of the imteset in the rule consequent.
Binary values, it is called interest factor
I(A,B)=s(A,B)s(A) x s(B)=Nf11f1+ f+1
….????....
CHAPTER 8 CLUSTER ANALYSIS: BASIC CONCEPTS AND ALGORITHMS
Cluster analysis divide the data into groups (clusters) that are meaningful (should capture real structure of data) or/and useful (ex. for summarization).
Clustering Dividing into groups (clustering).
Classification Assigning particular objects to these groups (classification).
Taxonomy hierarchical classification.
Fields for clustering use
biology: find groups of genes that have similar functions.
information retrieval: group search results into a small number of clusters.
Climate: find patterns in the atmospheric pressure and areas of ocean that have a significant impact on climate.
psychology and medicine: identify different types of depression. detect patterns in the spatial or temporal distribution of a disease.
business: segment customers into a small number of groups for additional analysis and marketing activities.
Cluster prototype
A data object that is representative for the other objects in the cluster. Used by several techniques.
Cluster analysis
The study of techniques for finding the most representative cluster prototypes:
summarization
Apply the analysis technique (algorithm) to a small set of only cluster prototypes. Should result in almost as good result as for whole set, but depends on accuracy and number of prototypes.
compression
table created with the prototypes for each cluster (prototype has an integer value that is its index in the table). Each object is represented by the index of the prototype associated with its cluster = vector quantization. Used with image, sound, video data where
many of the data is highly similar to another
some loss of information is acceptable
substantial reduction in the data size is desired
efficiently finding nearest neighbours
can require computing the pairwise distance between all points. Reduce number of computations by using the prototypes if objects are relatively close to their prototype.
if two prototypes are far apart, their objects cannot be nearest neighbours. The nearness is measured by the distance between their prototypes.
section 8.1 Overview
Cluster analysis
groups data objects based only on information found in the data that describes the objects and their relationships. The goal is similar objects in a group, that is different from objects outside the group. The greater the similarity, the greater the difference between groups = better clustering.
The definition of a cluster is imprecise, and it should depend on the nature of the data and the desired result.
Clustering could be regarded as a form of classification in that it creates a labeling of objects with class (cluster) labels. Cluster analysis is sometimes referred to as unsupervised classification (supervised is as in chapter 4).
Different types of clustering
Hierarchical versus partitional
Nested (hierarchical) clustering
allowed with subclusters, one get a set of nested clusters organized as a tree.
often (but not always) the leaf nodes are singleton clusters of individual data objects.
Can be viewed as a sequence of partitional clusterings.
unnested (partitional) clustering
each object is in exactly one cluster(subset).
Exclusive versus overlapping versus fuzzy
exclusive clustering
assign each object to a single cluster. If it is reasonable to place an object in several clusters, it is addressed as non-exclusive clustering.
overlapping (non-exclusive) clustering
can simultaneously belong to more than one group (class).
a student can both be enrolled and an employee at an university.
rather than making an arbitrary assignment with placing an object in only one cluster, one can place the object in all of the “equally good” clusters.
fuzzy clustering
every object belongs to every cluster with a membership weight that is between 0 and 1 (absolutely belongs).
often impose that the weights should sum up to 1 for each object.
good to avoid assigning an object to only one when it may (but not certain as the student/employee) be close to several.
Complete versus partial
complete clustering
assigns every object to a cluster. May be desired if clustering is to organize documents, then all documents must be included to guarantee browsing.
partial clustering
does not. Some data objects may not belong to well-defined groups. Some data objects may be noise, outliers or uninteresting background.
Different types of clusters
Well separated
data contains natural clusters that are quite far from each other. The distance between two points within a group is always smaller than the distance between two points from different groups.
Can have any shape (globular, etc.)
Prototype-based
cluster where each object is closer to the prototype in the cluster than to a prototype in a different cluster.
For continuous attributes, the prototype is often a centroid.
For categorical attributes, the prototype is often medoid (the most representative point in the cluster).
Center-based clusters: when the prototype is the most central point. tend to be globular clusters.
graph-based
when data is represented as a graph and nodes with connections. A cluster is a group of nodes with connections, but no connections with nodes outside the cluster.
Important example: contiguity-based clusters
two objects are connected only if they are within a specified distance of each other.
can have trouble when noise is present.
other example: clique
set of nodes in a graph that are completely connected to each other. If we start by addings connections in the order of their distance, a cluster is made when there is a clique.
Tend to be globular.
Density-based
cluster is a dense region of objects that is surrounded by a region of low density.
Often employed when the clusters are irregular of intertwined, and when noise and outliers are present.
shared-property (conceptual clusters)
clusters as a set of objects that share some property. Emcompasses all previous definitions. Includes new types of clusters. Process of finding new (rare) tpes is called coceptual clustering.
Road map
K-means
property-based, partitional clustering technique that attempts to find a user-specified number of clusters (K), which are represented by their centroids.
agglomerative hierarchical clustering
refers to a collection of closely related clustering techniques that produce a hierarchical clustering by starting with each point as a singleton cluster and merge.
DBSCAN
density-based clustering algorithm that produces a partitional clustering, in which the number of clusters is automatically determined by the algorithm. low-density region points is considered noise and omitted; does not produce a complete clustering.
figur 8.2
section 8.2 K-means
K-means define a prototype in terms of a centroid (mean of a group of points), typically applied to objects in a continuous n-dimensional space.
K-medoid defines a prototype in terms of a medoid (most representative point for a group of points), can be applied to a wide range of data since it requires only a proximity measure for a pair of objects. Must be an actual data point.
Basic K-means algorithm
choose K initial centroids (user specified). K is the number of wanted clusters.
each point is then assigned to the closest centroid. Clusters are formed.
For each cluster the centroid is updated based on the points assigned to the cluster.
Repeat and update until no point changes clusters (the centroids remain the same).
Assigning points to the closest centroid
Proximity measures:
Eucledian (L2), data points in euclidean space (more general)
cosine similarity, more appropriate for documents (more general)
Manhatten (L1), Euclidean data (more special)
Jaccard measure, documents (more special)
Low-dimensional euclidean space it can be possible to avoid many of the computations. Also Bisecting K-means is a method for reducing the number of similarities computed.
Centroids and objective functions
need to specify an objective function and the proximity measure, then the centroids that we should choose can often be determined mathematically.
Data in Euclidean space
As objective function, can choose SSE (sum og squared error) aka. scatter: calculate the error of each data point (its Euclidean distance to the nearest centroid).
If two different cluster sets from two different runs of K-means have different SSE, we choose the one with the lowest SSE (the centroids are better).
SSE = i = 1KxC1dist(ci,x)2
dist is the standard Euclidean(L2) distance between two objects in Euclidean space.
(Equation 8.2 s.500) The mean of the ith cluster:
The mean is the centroid that minimizes the SSE.
Document data
K-means can also be used on document data and with cosine similarity measure.
Cohesion: to maximize the similarity of the documents in a cluster to the cluster centroid.
The mean is the cluster centroid.
Total cohesion (eq.8.3), the same as SSE:
The general case
proximity function
centroid
objective function
Manhattan (L1)
median
Minimize sum of the L1 distance of an object to its cluster centroid
squared Euclidean (L22)
mean
minimize sum of the squared L2 distance of an object to its cluster centroid
cosine
mean
maximize sum of the cosine similarity of an object to its cluster centroid
Bregman divergence
mean
minimize sum of the Bregman divergence of an object to its cluster centroid.
Bregman divergence is actually a class of proximity measures that includes the squared Eucledian distance, the Mahalanobis distance and cosine similarity.
Choosing initial centroid
random, different runs produce different total SSEs. Common approach, with poor resulting clusters.
approach: do different runs, select clusters with minimum SSE.
take a sample of points and cluster them using a hierarchical clustering technique. K clusters are extracted from the hierarchical clustering, and the centroids of those clusters are used as the initial centroids. practical only if:
sample is relatively small
K is relatively small compared to sample size.
Select first point at random, or take the centroid of all points. Then, for each successive initial centroid, select the point that is farthest from any of the initial centroids already selected. Randomly and well-separated centroids.
outliers! :(
expensive to compute the farthest point from the current set of initial centroids.
often therefore applied to the sample of the points. outliers often don’t show up in samples since they’re rare.
Time and space complexity
little space required (only data points and centroids are stored). O((m+K)n), where m is the number of points and n is the number of attributes.
The time required is also little. O(I*K*m*n) where I is the number of iterations required for convergence.
Additional issues
Handling empty clusters
if no points are allocated to a cluster during the assignment step. Need a replacement strategy:
choose the point that is farthest away from any current centroid
eliminates the point that contributes most to the SSE.
choose the replacement centroid from the cluster that has the highest SSE.
split cluster and reduce the overall SSE.
For several empty clusters, the procedure is repeated.
Outliers
Reduce the representativeness of the prototypes and increase the SSE.
Useful to discover and eliminate outliers beforehand. But not always (data compression, financial analysis etc.).
Can remove them afterwards by tracking every point and remove the largest contributor over several runs to a high SSE.
Remove small clusters, since they frequently represent groups of outliers.
Reducing the SSE with Postprocessing
Find more clusters (higher K). But in most cases we want to improve SSE without increasing K. Often possible because K-means converge to a local minimum.
Techniques to fix up resulting clusters:
splitting or merging clusters
Escape local SSE minima and still produce a clustering solution with the desired number of clusters.
splitting phase:
Strategy: slit a cluster. The one with the largest SSE or standard deviation for an attribute etc.
strategy: introduce a new cluster centroid. The furthest point or choose randomly from the points with the highest SSE.
merge phase:
strategy: disperse a cluster.. Accomplished by removing the centroid that correspond to the cluster and reassigning the points to other clusters. Often the cluster that increases the SSE the least.
strategy: merge to clusters. the clusters with the closest centroids, or merge the ones that result in the smallest increase in total SSE.
Updating centroids incrementally
Centroids can be updated for each assigned point instead of when all points are assigned. It required 0 or 2 updates to cluster centroids at each point (the point stays in the cluster (0) or moves to a new cluster (2)).
No empty sets are produced.
Better accuracy and faster convergence if weight of points decrease. Difficult to make a good choice for the relative weight.
Negative:
introduces order dependency (the order the points are introduced). Can be addressed by randomizing points.
more expensive.
Bisecting K-means
Extension of basic K-means. Simple idea: obtain K clusters, split the set of all points into two clusters, select one of these clusters to split, and so on, until K clusters have been produced.
algoritme 8.2 s. 509:
Ways to split:
can choose the largest cluster
can choose the one with largest SSE
can use a criterion based on both size and SSE.
Often refine the resulting clusters by using their centroids as the initial centroids for the basic K-means algorithm. ?? (skjønte ikke helt hva som står videre på s. 509).
Recording the sequence of bisection, we can produce a hierarchical clustering.
K-means and different types of clusters
Difficulties with finding the natural clusters (when clusters have non-spherical shapes or widely different sizes or densities).
Can be solved if the number of clusters are increased. Find for instance 6-7 clusters instead of 2-3. Each small cluster is pure (contains only points from one of the natural clusters).
Strengths and weaknesses
strengths
simple
can be used for a wide variety of data types
efficient, even though multiple runs are often performed
bisecting K-means are even more efficient
weaknesses
can’t handle all types of data
non-globular clusters, different size and density etc.
difficulties with outliers
can help with detection and removal
restricted to data for which there is a notion of a center (centroid)
can use K-medoid, but is more expensive.
section 8.3 Agglomerative Hierarchical Clustering
Old, but in widespread use.
Two basic approaches for hierarchical clustering:
Agglomerative starts with points as individual clusters, and at each step merge the closest pair of clusters. Requires definition of cluster proximity. (most common)
Divisive starts with one, all-inclusive cluster, and at each step split a cluster until only singleton clusters of individual points remain. Need to decide which cluster to split at each step and how to do the splitting.
Dendrogram
tree-like diagram used to display the hierarchical clustering. Shows the relationships and the order of the merging (or splitting).
(figur s. 516)
Nested cluster diagram
Another method of graphical representation. (figur s. 516)
Defining proximity between clusters
MIN
defines cluster proximity as the proximity between the closest two points that are in different clusters, (or the shortest edge between two nodes in different subsets of nodes).
Single link (similarities)
MAX
the proximity between the farthest two points in different clusters.
complete link (similarities)
Group average
defines cluster proximity to be the average pairwise proximities (average length of edges) of all pairs of points from different clusters.
figur 8.14 s. 517
Centroids
If we use centroids, the cluster proximity is defined as the proximity between cluster centroids.
Ward’s method
assumes that a cluster is represented by its centroid, but it measures the proximity between two clusters in terms of the increase in the SSE that result from merging two clusters.
Time and space complexity
Space
uses a proximity matrix. Storage of 1/2m2 proximities (assuming the matrix is symmetric), m is the number of data points.
Space needed to keep track of the clusters is proportional to the number of clusters(m-1).
Total space: O(m2).
Time
O(m2) required to compute the proximity matrix.
m-1 iterations to merge clusters (m clusters at the start, and two clusters merged for each iteration).
Overall time: O(m2 log m).
Specific techniques
Single link/MIN:
Add shortest links one at a time. Handles non-elliptical shapes well, but is sensitive to noise and outliers.
Complete link/MAX/CLIQUE
Farthest points away, but the shortest link? (gjør eksemplet i boka).
A group is not a cluster until all the points in it are completely linked (form a clique).
Group average
An intermediate approach between the single and complete link approaches.
cluster proximity:
where mi and mj is the size of clusters Ci and Cj
Ward’s method and centroid methods
Proximity between two clusters is the increase in the squared error that results when two clusters are merged. Uses same objective function as K-means clustering.
Similar to the group average method when the proximity between two points is taken to be the square of the distance between them.
Possibilities of inversions. Two clusters that are merged may be more similar (less distant) than the pair of clusters that were merged in a previous step.
Key issues in hierarchical clustering
Lack of global objective function
the techniques decide locally with different criterias, at each step, which clusters should be merged.
Ability to handle different cluster sizes
applies only where it involves sums: ward’s and group average.
how to treat the relative sizes of the pairs of clusters that are merged.
two approaches (applied to the points)
weighted: treats all clusters equally
gives the points in a cluster different weights
unweighted: takes the number of points in each cluster into account.
gives points in a cluster same weight
unweighted is preferred unless other is stated (perhaps classes of objects have been unevenly sampled).
Merging decisions are final
once a decision to merge is done, it cannot be undone later. Prevents a local optimization criterion from becoming a global optimization criterion. The clusters are not even stable (a points in a cluster may be closest to a centroid in a different cluster).
Some functions try to improve by moving branches of the tree around to improve a global objective function. Or uses a partitional clustering technique (K-means etc.) to create many small clusters, and then perform hierarchical clustering with the little clusters as starting point.
Strenght and weaknesses
strengths
used because underlying application (creation of a taxonomy) requires a hierarchy.
Studies that says that these algorithms can produce better-quality clusters.
Weaknesses
Expensive (high storage and computational requirements).
All merges are final (especially bad for noisy, high-dimensional data, for example a document).
can be solved by first using K-means or another method (som beskrevet over).
section 8.4 DBSCAN
density-based clustering locates regions of high density that are separated from one another by regions of low density. DBSCAN is simple and effective density-based clustering algorithm.
Traditional density: center-based approach
(other definitions of density is in chapter 9 (ikke pensum)).
Density is estimated for a particular point in the data set by counting the number of points within a specified radius, Eps, of that point. It includes the point itself.
Simple to implement. Need a proper Eps; Density of any point will depend on the radius (1(itself) to m (all points).
Classification of points
interior of a dense region (core point)
of the number of points within a given a neighborhood around the point as determined by the distance function and a user-specified distance parameter, Eps, exceeds a certain threshold, MinPts (user-specified).
on the edge of a dense region (border point)
falls within a
sparsely occupied region (noise point)
All points that’s not a core or a border point.
DBSCAN algorithm
Any two core points that are close enough (within a distance Eps of one another) are put in the same cluster. Likewise, any border point that is close enough to a core point is put in the same cluster as the core point.
Noise points are discarded.
Label all points as core, border or noise points
eliminate noise points
put an edge between all core points that are within Eps of each other.
Make each group of connected core points into a separate cluster.
Assign each border point to one of the clusters of its associated core points.
Time and space complexity
Time
O(m x time to find points in the Eps-neighborhood), m is the number of points.
worst case O(m2).
Low-density, can use kd-trees to allow efficient retrieval of points within a given distance of a specified point, best-case: O(m log m).
Space
all dimensions: O(m).
Need only to keep small amount of data for each point (cluster label, identification of core/border/noise).
Selection of DBSCAN parameters
Basic approach: look at the behaviour of the distance from a point to its kth nearest neighbor (k-dist). Will be small for points in a cluster and large for those who’s not, but will vary due to density. Compute k-dist for all points, arrange in increasing order (plot values), an sharp change of k-dist will be a suitable value for Esp.
The value of k for the selected Eps is MinPts (points for which k-dist is less than Eps = core points).
Clusters of varying density
DBSCAN can have trouble with density if the density of clusters varies widely. High Eps will detect high density, but lower density-clusters are marked as noise. Low Eps will find low-density clusters, but higher density clusters will be looked at as one.
Strength and weaknesses
strenght
density-based definition of cluster
relatively restistant to noise
can handle arbitrary shapes and sizes
weaknesses
trouble with varying density
high-dimensional data (hard to define density)
can be expensive when the computation of nearest neighbors requires computing all pairwise proximities (high-dim data).
section 8.5 CLUSTER EVALUATION
Not well-developer of commonly used part of cluster evaluation.
Important. Almost every clustering algorithm will find clusters in a data set, even if that data set has no natural cluster structure.
Important issues of cluster validation
determining the cluster tendency of a set of data; distinguishing whether non-random structure exists in the data.
Determining the correct number of clusters.
Evaluating how well the results of a cluster analysis fit the data without reference to external information.
comparing the results of a cluster analysis to externally known results, such as externally provided class labels.
Comparing two sets of clusters to determine which is better.
1, 2 and 3 use no external information, they are unsupervised techniques. 5 is both. 3,4,5 look at individual clusters (maybe no need to analyse all).
Evaluation measures
unsupervised (internal indices)
measures the goodness of a clustering structure without respect to external information.
SSE.
Further divided into two classes
measures of cluster cohesion (tightness, compactness)
how closely related the objects in a cluster are
measures of cluster separation (isolation)
how distinct or well-separated a cluster is from other clusters.
Supervised (external indices)
measure the extent to which the clustering structure discovered by a clustering algorithm matches some external structure.
Entropy (cluster label match)
Relative
compare different clusterings of clusters.
Supervised/unsupervised evaluation measure with the purpose of comparison.
Use both SSE, Entropy etc.
Unsupervised cluster evaluation using cohesion and separation
many internal measre of cluster validy for partitional clustering schemas are based on the nitions of cohesion or separation.
Overall cluser validity for a set of K clusters as a weighted sum of the validity of indiciduals clusters;
overall validity = i = 1kwivalidity(Ci)
Validity function can be cohesion, separation or some combinations of those quantities.
Cohesion: higher values are best
separation: lower values are best.
Graph-based view of cohesion and separation
cohesion of a cluster can be looked at as the sum of the weights of the links in the proximity graph that connect points within the cluster.
cohesion (Ci) = x og y Ciproximity (x,y)
The separation between two clusters can be measured by the sum of the weights of the links from points in one cluster to points in the other cluster.
separation(Ci,Cj) = x Ci, yCjproximity (x,y)
Prototype-based view of cohesion and separation
Cohesion can be defined as the sum of the proximities with respect to the prototype (centroid/medoid) of the cluster.
cohesion (Ci) = x Ciproximity (x,ci) (squared Euclidean distance)
Separation between two clusters is the proximity of the two cluster prototypes.
two measures of separation. ci is the centroid for Ci and c is the overall prototype.
separation (Ci) = proximity(ci , cj )
separation(Ci) = proximity (ci,c)
figur 8.28 s.538
Overall measures of cohesion and separation
Any unsupervised measure of cluster validity potentially can be used as an objective function for a clustering algorithm, and vice versa.
(her er endel på eget ark)
Unsupervised cluster evaluation using the proximity matrix
measuring cluster validity via correlation
evaluate the goodness by looking at the correlation between the actual similarity matrix and an ideal version of the similarity matrix based on the cluster labels.
Ideal cluster: similarity 1 to points in the cluster, 0 to points outside the cluster.
Matrix: one row and one column for each data point. Assign 1 to an entry if it belongs to the same cluster. 0 if not.
High correlation indicates that the points that belong to the same cluster are close to each other.
technique: order the similarity matrix with respect to cluster labels and then plot it.
takes 0(m2) to compute the proximity matrix (m is the number of objects). Large, but can use sampling. May need to oversample small clusters and undersample large ones to obtain adequate representation of all clusters.
unsupervised evaluation of hierarchical clustering
previous approaches: partitional clusterings.
cophenetic distance
the distance between two objects is the proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time.
Se eksempel s.545.
CoPhenetic correlation coefficient (CPCC)
is the correlation between the entries of this matrix and the original dissimilarity matrix and is a standard measure of how well a hierarchical clustering fits the data.
To evaluate which type of hierarchical clustering is best for a particular type of data.
Better with higher values.
Determining the correct number of clusters
can try to find natural number in data set by looking for the number of clusters at which there is a knee, peak or dip in the plot of the evaluation measure.
clusters may be intertwined or overlapping, or can be nested.
can look at the average silhoutte coefficient and the SSE together (s.847) when caution is needed.
cluster tendency
to determine if a data set has clusters, one can try to cluster it.
but algorithms will find some even if there are not any natural.
use several algorithms, can approve if some of the clusters are of good quality.
try to evaluate whether a data set has clusters without clustering.
use statistical tests for spatial randomness.
many approaches, most in low-dim Euclidean space.
Supervised measures of cluster validity
have external information, usual to measure the degree of correspondence between the cluster labels and the class labels.
The point of comparison is to see if a manually labeling can be done just as good by a cluster analysis.
Two approaches:
Classification-oriented
use measures of classification (entropy, purity, F-measure, precision, recall, etc) to evaluate the extent to which a cluster contains objects of a single class.
entropy: the degree to which each cluster consists of objects of a single class. (kapittel 4)
purity: the extent to which a cluster contains objects of a single class.
purity= sum (from i=1 to K) (mi / m ) * ei
pi = max(j) pij
Se på entropy, den er nesten lik.
precision: the fraction of a cluster that consists of objects of a specified class. precision of cluster i with respect to class j is precision(i,j)=pi j
Recall: the extent to which a cluster contains all objects of a specified class. recall(i,j) = mij / mj
mj is the number of objects in class j.
F-measure: a combination of precision and recall that measures the extent to which a cluster contains only objects of a particlular class and all objects of that class.
F(i,j) = (2 x precision(i,j) x recall(i,j))/(precision(i,j)+(recall(i,j)).
Se eksempel og tabell s.550
similarity-oriented
similarity measures for binary data (Jaccard measure etc.) to measure the extent to which two objects that are in the same class are in the same cluster and vice versa.
use matrices:
ideal cluster similarity matrix: has 1 in the ijth entry if two objects i and j, are in the same cluster. 0 otherwise.
ideal class similarity matrix: 1 in the ijth entry if two objects i and j, are in the same class. 0 otherwise.
regner ut correlation mellom de to, hvordan?
Rand statistic and jaccard coefficient to measure cluster validity:
Rand statistic = f00 + f11 / alle
jaccard coeffisient = f11 / (f01 + f10 + f11)
same cluster
different cluster
same class
f11
f10
different class
f01
f00
Cluster validity for hierarchical clusterings
supervised evaluation for hierarchical clustering is more difficult: pre existing hierarchical structure often does not exist.
Key idea
evaluate whether a hierarchical clustering contains, for each class, at least one cluster that is relatively pure and includes most of the objects of that class.
Compute F-measure for each class.
For each class, take the max F-measure for any cluster.
Calculate the overall F-measure for the hierarchical clustering by computing the weighted average of all per-class F-measures
weights based on class size
F = mjm maxi F(i,j)
max is taken over all clusters i, at all levels.
mj is the number of objects in class j.
m is the total number of objects.
Assessing the significance of cluster validity measures
Difficult to interpret the significance of the number/value obtained by a cluster validity.
purity: 1 good, 0 not good
Entropy: 0 good
SSE: 0 good.
May not be a minimum or maximum value. And intermediate values need to be interpret.
Can use a limit (absolute standard, tolerate a certain level of error etc.).
Can interpret the value of our validity measure in statistical terms.
the value is good if it is unusual (if it is unlikely to be the result of random chance).
When comparing two clusterings (relative measure) the difference may not need to be significant. 0,1% different would many not regard as significant.
two aspects to this significance:
whether the difference is statistically significant (repeatable)
whether the magnitude of the difference is meaningful with respect ot the application.