Data Mining Cheat Sheet
Introduction
Definition 1: The nontrivial extraction of implicit, previous unknown, and potentially useful information from data.
Definition 2: Data mining is the process of automatically discovering useful information in large data repositories.
Directed Knowledge Discovery: Goaloriented. Example: classification
Undirected Knowledge Discovery: Bottomup approach, no target field. Example: clustering
Data Mining Tasks:
Classifcation, Regression, Clustering, Association Rule Discovery, Anomaly Detection
Data Preprocessing
4 types of attributes:
Nominal, Ordinal, Interval and Ratio
Data quality examples:
Missing values, duplication and inconsistent data, noise and outliers
Methods for improving data quality:
Sampling, aggregation, dimensionality reduction, feature subset selection, feature creation, discretization and binarization, attribute transformation
Clustering
Similarity: often falls in the range [0,1] or [1,1]
Dissimilarity: upper bound can be 1 or $\infty$
Different types of similarities
Euclidean Distance: $L_2$ Norm
City Block: $dist(x,y) = \sum ^p _{i=1} x_m  y_m$
Minkowski Distance: $dist(x,y) = (\sum ^p _{i = 1}x_i  y_i^ \lambda ) ^{\frac{1}{\lambda}}$
Supremum Distance: $dist(x,y) = max_i x_i  y_i$
Cosine Similarity: $dist(x,y) = \frac{\sum _{i=1} ^p x_iyi }{[\sum {i=1} ^p xi ^2 \sum ^p {i=1} x_i^2]^{\frac{1}{2}}}$
Pearson's Correlation: $corr(x,y) = \frac{\sum ^p _{i=1} (x_i  \bar{x})(yi  \bar{y})}{[\sum ^p {i=1} (xi  \bar{x})^2 \sum^p {i=1} (y_i  \bar{y})^2]^{\frac{1}{2}}}$
Mahalanobis Distance: $dist(x,y) = (x  y)^TS ^{1}(x  y)$ $S{jk} = \frac{1}{n1}\sum ^n {i=1} (X_{ij}  \bar{X}j)(X{ik}  \bar{X}_k)$
Sequential KMeans

make the initial guess for the means $m_1,...,m_k$

set the counts $n_1,...,n_k$ to 0

until Interrupted
 acquire the next example $x$
 if $x$ is closest to $m_i$, then increase $n_i$ and update $m_i$ by $m_i + \frac{1}{n_i}(x  m_i)$

end
Forgetful Sequence KMeans
Replace the counts $n_1,...,n_k$ with a constant $a \in [0,1]$, update $m_i$ by $m_i + a (x  m_i)$
ExpectionMaximization Algorithm (Gaussian Mixture Methods)
Initialize the model parameters
Until there is no change in any parameters
Expectation Step : calculate the probability that each object belongs to each cluster
Maximization Step : find the new estimates of the parameters that maximize the expected likelihood
end until
Kernel KMeans
Transform datapoint $x_i$ into a feature space $\phi(x_i)$ and then apply the basic Kmeans
$K(x_i,x_j) = \phi(x_i)^T \phi(x_j)$
$\phi(x_i) \phi(x_j)^2 = K(x_i,x_i) + K(x_j,x_j)  2K(x_i,x_j)$
Homogeneous polynomial kernel: $K_q(x,y) = (x^Ty)^q$
Inhomogeneous polynomial kernel (c >= 0): $K_q(x,y) = (c + x^Ty)^q$
Gaussian Kernel: $K(x,y) = e^{\frac{xy^2}{2\sigma ^2}}$
SelfOrganizing Feature Maps
Unlike sequential KMeans, the centres (called units or neurons) have a predetermined topographic ordering relationship

make initial guesses for centroids $m_1,...,m_k$

repeat
 acquire the next example $x$
 if $m_i$ is closest to $x$, then update $m_i$ by $m_i + a(x  m_i)$; update $m_j$ (neighbor of $m_i$) by $m_j + a_n (x  m_j)$

end until centroids do not change or a threshold is exceeded
Agglomerative Hierarchical Clustering
Start out with each data point forming its own cluster and gradually merge clusters until all points have been gathered together in one big cluster
 compute the proximity matrix
 repeat
 merge the two closest clusters
 update the proximity matrix to reflect the proximity between the new cluster and the original clusters
 until only one cluster remains
Different methods to define the distances between clusters:
Single Linkage Clustering: distance between groups is defined as that of the closest pair of data
Pros: can handle nonelliptical shapes Cons: sensitive to noise and outliers
Complete Linkage Clustering: distance between two clusters is given by the distance between their most distant members
Pros: less susceptible to noise and outliers Cons: tend to break large clusters; biased towards globular clusters
Group Average Clustering: distance between two clusters is defined as the average of the distances between all pairs of records
Pros: less susceptible to noise and outliers Cons: biased towards globular clusters
Centroid Clustering: distance between two clusters is defined as the distance between the mean vectors of the two clusters
Median Clustering: distance between two clusters is defined as the distance between the prototypes of the two clusters
the prototype is the data point when there is only one point in the clusterl; When we merge two clusters, the new prototype is the mid point of the two prototypes
Ward's Method: Similarity of two clusters is based on the increase in squared error when two clusters are merged
$\Delta SSE(C_i,C_j) = \frac{n_in_j}{n_i+n_j}\mu _i  \mu _j^2$
DBSCAN
Core point: has more than a specified number of points (MinPts) within Eps
Border point: has fewer than MinPts within Eps, but is in the neighborhood of a core point
Noise point: neither is core point nor border point
 Label all points as core, border, or noise points
 Eliminate noise points
 Put an edge between all core points that are within Eps of each other
 Make each group of connected core points into a separated cluster
 Assign each border point to one of the clusters containing a core point from which the border point is directly density reachable
Pros: resistant to noise, can handle clusters of different shapes and sizes
Cons: sensitive to denses, can not handle high dimensional data well
Cluster Evaluation
$SSE = \sum i \sum {x \in C_i}(x  c_i)^2$
$SSB = \sum_i C_i (c  c_i)^2$
$TSS = \sum {i=1} ^K \sum {x \in C_i} (xc)^2$
Silhouette Coefficient:
$a_i:$ average distance of $i$ to the points in its cluster
$b_i$: min (average distance of $i$ to points in another cluster)
$s = \frac{b_i  a_i}{max(a_i,b_i)}$
Classification
Minimum Distance Classifier
$m_j = \frac{1}{Nj} \sum {x \in C_j} x$
The goal is to minimize $D_j(x) = x  m_j$
Which is equivalent with maximizing $d_j(x) = m_j^Tx  \frac{1}{2} m_j^Tm_j$ the decision function for class j
Linear Discriminant Classifier
$g(x) = W^Tx + W_o$
$x$ is assigned to one class if $g(x) \gt 0$ and another class if $g(x) \lt 0$
k Nearest Neighbor
lazy learners, does not build model explicitly
can have arbitary decision boundries
Bayes Classifier
$P(AC) = P(A)P(CA) / P(C)$
A point $x$ belongs to class $i$ if $P(m_i  x) $ is the largest
Normally we are given $P(xm_i)$ and $P(m_i)$ to compute $P(m_ix)$
the decision function is $d_j(x) = p(w_j)p(xw_j)$
How to estimate $p(xw_j)$: $p(xw_j) = P(x_1w_j)...P(x_nw_j)$
文章评论（0）