# 数据挖掘总结

2020年12月22日 51 次阅读 0 人点赞

# Data Mining Cheat Sheet

## Introduction

Definition 1: The nontrivial extraction of implicit, previous unknown, and potentially useful information from data.

Definition 2: Data mining is the process of automatically discovering useful information in large data repositories.

Directed Knowledge Discovery: Goal-oriented. Example: classification

Undirected Knowledge Discovery: Bottom-up approach, no target field. Example: clustering

Classifcation, Regression, Clustering, Association Rule Discovery, Anomaly Detection

## Data Preprocessing

4 types of attributes:

Nominal, Ordinal, Interval and Ratio

Data quality examples:

Missing values, duplication and inconsistent data, noise and outliers

Methods for improving data quality:

Sampling, aggregation, dimensionality reduction, feature subset selection, feature creation, discretization and binarization, attribute transformation

## Clustering

Similarity: often falls in the range [0,1] or [-1,1]

Dissimilarity: upper bound can be 1 or $\infty$

### Different types of similarities

Euclidean Distance: $L_2$ Norm

City Block: $dist(x,y) = \sum ^p _{i=1} |x_m - y_m|$

Minkowski Distance: $dist(x,y) = (\sum ^p _{i = 1}|x_i - y_i|^ \lambda ) ^{\frac{1}{\lambda}}$

Supremum Distance: $dist(x,y) = max_i |x_i - y_i|$

Cosine Similarity: $dist(x,y) = \frac{\sum _{i=1} ^p x_iyi }{[\sum {i=1} ^p xi ^2 \sum ^p {i=1} x_i^2]^{\frac{1}{2}}}$

Pearson's Correlation: $corr(x,y) = \frac{\sum ^p _{i=1} (x_i - \bar{x})(yi - \bar{y})}{[\sum ^p {i=1} (xi - \bar{x})^2 \sum^p {i=1} (y_i - \bar{y})^2]^{\frac{1}{2}}}$

Mahalanobis Distance: $dist(x,y) = (x - y)^TS ^{-1}(x - y)$ $S{jk} = \frac{1}{n-1}\sum ^n {i=1} (X_{ij} - \bar{X}j)(X{ik} - \bar{X}_k)$

### Sequential K-Means

1. make the initial guess for the means $m_1,...,m_k$

2. set the counts $n_1,...,n_k$ to 0

3. until Interrupted

1. acquire the next example $x$
2. if $x$ is closest to $m_i$, then increase $n_i$ and update $m_i$ by $m_i + \frac{1}{n_i}(x - m_i)$
4. end

### Forgetful Sequence K-Means

Replace the counts $n_1,...,n_k$ with a constant $a \in [0,1]$, update $m_i$ by $m_i + a (x - m_i)$

### Expection-Maximization Algorithm (Gaussian Mixture Methods)

Initialize the model parameters

Until there is no change in any parameters

Expectation Step : calculate the probability that each object belongs to each cluster

Maximization Step : find the new estimates of the parameters that maximize the expected likelihood

end until

### Kernel K-Means

Transform datapoint $x_i$ into a feature space $\phi(x_i)$ and then apply the basic K-means

$K(x_i,x_j) = \phi(x_i)^T \phi(x_j)$

$||\phi(x_i) -\phi(x_j)||^2 = K(x_i,x_i) + K(x_j,x_j) - 2K(x_i,x_j)$

Homogeneous polynomial kernel: $K_q(x,y) = (x^Ty)^q$

Inhomogeneous polynomial kernel (c >= 0): $K_q(x,y) = (c + x^Ty)^q$

Gaussian Kernel: $K(x,y) = e^{-\frac{||x-y||^2}{2\sigma ^2}}$

### Self-Organizing Feature Maps

Unlike sequential K-Means, the centres (called units or neurons) have a pre-determined topographic ordering relationship

1. make initial guesses for centroids $m_1,...,m_k$

2. repeat

1. acquire the next example $x$
2. if $m_i$ is closest to $x$, then update $m_i$ by $m_i + a(x - m_i)$; update $m_j$ (neighbor of $m_i$) by $m_j + a_n (x - m_j)$
3. end until centroids do not change or a threshold is exceeded

### Agglomerative Hierarchical Clustering

Start out with each data point forming its own cluster and gradually merge clusters until all points have been gathered together in one big cluster

1. compute the proximity matrix
2. repeat
1. merge the two closest clusters
2. update the proximity matrix to reflect the proximity between the new cluster and the original clusters
3. until only one cluster remains

Different methods to define the distances between clusters:

Single Linkage Clustering: distance between groups is defined as that of the closest pair of data

Pros: can handle non-elliptical shapes Cons: sensitive to noise and outliers

Complete Linkage Clustering: distance between two clusters is given by the distance between their most distant members

Pros: less susceptible to noise and outliers Cons: tend to break large clusters; biased towards globular clusters

Group Average Clustering: distance between two clusters is defined as the average of the distances between all pairs of records

Pros: less susceptible to noise and outliers Cons: biased towards globular clusters

Centroid Clustering: distance between two clusters is defined as the distance between the mean vectors of the two clusters

Median Clustering: distance between two clusters is defined as the distance between the prototypes of the two clusters

the prototype is the data point when there is only one point in the clusterl; When we merge two clusters, the new prototype is the mid point of the two prototypes

Ward's Method: Similarity of two clusters is based on the increase in squared error when two clusters are merged

$\Delta SSE(C_i,C_j) = \frac{n_in_j}{n_i+n_j}||\mu _i - \mu _j||^2$

### DBSCAN

Core point: has more than a specified number of points (MinPts) within Eps

Border point: has fewer than MinPts within Eps, but is in the neighborhood of a core point

Noise point: neither is core point nor border point

1. Label all points as core, border, or noise points
2. Eliminate noise points
3. Put an edge between all core points that are within Eps of each other
4. Make each group of connected core points into a separated cluster
5. Assign each border point to one of the clusters containing a core point from which the border point is directly density- reachable

Pros: resistant to noise, can handle clusters of different shapes and sizes

Cons: sensitive to denses, can not handle high dimensional data well

### Cluster Evaluation

$SSE = \sum i \sum {x \in C_i}(x - c_i)^2$

$SSB = \sum_i |C_i| (c - c_i)^2$

$TSS = \sum {i=1} ^K \sum {x \in C_i} (x-c)^2$

Silhouette Coefficient:

$a_i:$ average distance of $i$ to the points in its cluster

$b_i$: min (average distance of $i$ to points in another cluster)

$s = \frac{b_i - a_i}{max(a_i,b_i)}$

## Classification

### Minimum Distance Classifier

$m_j = \frac{1}{Nj} \sum {x \in C_j} x$

The goal is to minimize $D_j(x) = ||x - m_j||$

Which is equivalent with maximizing $d_j(x) = m_j^Tx - \frac{1}{2} m_j^Tm_j$ the decision function for class j

### Linear Discriminant Classifier

$g(x) = W^Tx + W_o$

$x$ is assigned to one class if $g(x) \gt 0$ and another class if $g(x) \lt 0$

### k Nearest Neighbor

lazy learners, does not build model explicitly

can have arbitary decision boundries

### Bayes Classifier

$P(A|C) = P(A)P(C|A) / P(C)$

A point $x$ belongs to class $i$ if $P(m_i | x)$ is the largest

Normally we are given $P(x|m_i)$ and $P(m_i)$ to compute $P(m_i|x)$

the decision function is $d_j(x) = p(w_j)p(x|w_j)$

How to estimate $p(x|w_j)$: $p(x|w_j) = P(x_1|w_j)...P(x_n|w_j)$