All of this series is mainly based on the Machine Learning course given by Andrew Ng, which is hosted on cousera.org.

Unsupervised Learning – ClusteringPermalink

K-means AlgorithmsPermalink

 K-means algorithm  Input: K( number of clusters)  Training set {x(1),x(2),,x(m)}x(i)Rn( drop x0=1 convention )

Algorithm:

 K-means algorithm  Randomly initialize K cluster centroids μ1,μ2,,μKRn Repeat {c(i):= index ( from 1 to K) of cluster centroid  closest to x(i)μk:= average (mean) of points assigned to cluster k

K-means for Non-separated clusters

-w652

Optimization Objective for K-meansPermalink

J(c(1),,c(m),μ1,,μK)=1mmi=1x(i)μc(i)2minc(1),,c(m)J(c(1),,c(m),μ1,,μK)μ1,,μK

where c(i)= index of cluster (1,2,,K) to which example x(i) is currently  assigned μk= cluster centroid k(μkRn)

μc(i)= cluster centroid of cluster to which example x(i) has been  assigned 

-w500

It is equivalent to repeatedly solving the optimization problem for C(i)for i=1,,m first, then for μk for k=1,,K

Initial Guesses - Random InitializationPermalink

Bad starting point can lead to local optimum:

-w895

Solution: more trials with different starting point. Then pick the lowers cost function.

Choose the number of clustersPermalink

-w510

Evaluate it based on the later/downstream purposes.

-w1039

Dimensionality ReductionPermalink

MotivationPermalink

  • Data
  • Data Visualization

-w916

-w542

But what is the meaning of these new features? -> it is a difficult to explain.

Principle Component AnalysisPermalink

-w572

 Reduce from n-dimension to k-dimension: Find k vectors u(1),u(2),,u(k) onto which to project the data, so as to minimize the projection error. 

-w800

For the linear regression: min||Axy||2 since Ax=(a1,a2,,an)x, the linear regression is essentially the projection of y onto the linear space of column vectors of A.

For the PCA, the first principle component is the one that: max||Aμ||2

Mathematically, for the linear regression, it is a problem of project a point in Rm into the linear space of x0,x1,,xn. In other words, the base is given.

for the PCA, however, it is a problem of find a sub-space for the data given, so that most of the information can be maintained. In other words, the task is to find the base of the sub-space. (μ1,,μkRn)

Details in MathematicsPermalink

First Component

In order to maximize variance, the first weight vector w(1) thus has to satisfy:

w(1)=argmaxw=1{i(t1)2(i)}=argmaxw=1{i(x(i)w)2}

Equivalently, writing this in matrix form gives

w(1)=argmaxw=1{Xw2}=argmaxw=1{wTXTXw}

Since w(1) has been defined to be a unit vector, it equivalently also satisfies

w(1)=argmaxw=1{Xw2}=argmaxw=1{wTXTXw}

The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as XTX is that the quotient’s maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector.

With w(1) found, the first principal component of a data vector x(i) can then be given as a score t1(i)=x(i)w(1) in the transformed co-ordinates, or as the corresponding vector in the original variables t1(i)w(1)=<x(i)w(1)>w(1)

Further Component

The kth component can be found by subtracting the first k − 1 principal components from X:

ˆXk=Xk1s=1Xw(s)wT(s)

and then finding the weight vector which extracts the maximum variance from this new data matrix

w(k)=argmaxw=1{ˆXkw2}=argmax{wTˆXTkˆXkwwTw}

It turns out that this gives the remaining eigenvectors of XTX, with the maximum values for the quantity in brackets given by their corresponding eigenvalues. Thus the weight vectors are eigenvectors of XTX. The full principal components decomposition of X can therefore be given as T=XW where W is a p-by-p matrix of weights whose columns are the eigenvectors of XTX. The transpose of W is sometimes called the whitening or sphering transformation. Columns of W multiplied by the square root of corresponding eigenvalues, i.e. eigenvectors scaled up by the variances, are called loadings in PCA or in Factor analysis.

PCA algorithmPermalink

Step 1: Data Preprocessing Mean Normalization, so that the features centers in the original point. Feature scaling is necessary;

X=[Xmean(X)]std(X)

-w880

-w507

Vectorization:Permalink

Mean normalization and optionally feature scaling:

X=bsxfun(@minus,X,mean(X,1)) =1mXTX [U,S,V]=svd()

Then we have:

U=[|||u(1)u(2)u(n)|||]Rn×n

xRnzRk:

Ureduce=U( : ,1:k)

z=UreduceTX

Note 1: xi00 for this convention. Note 2: U is from USV=XTX, therefore U is Rn×n. It is the eigenvector of X.

Applying PCAPermalink

Reconstruction from compressed representationPermalink

z=UreduceTXXApprox=Ureducez

here XApproxRn

Choose the number of principle componentPermalink

-w594

Since S is the eigenvalues for XTX, sii is actually the square of si, which is the eigenvalue for X.

 Choosing k (number of principal components) [U,S,V]=svd( Sigma ) Pick smallest value of k for which ki=1Siimi=1Sii0.99 (99% of variance retained) 

Advice for applying PCAPermalink

Supervised learning speedupPermalink

-w617

Note: must normalize the X before PCA. feature scaling is optional but recommended if large variance among features.

The same transformation must be done for xval and xtest

In summary, application for PCA:

  • Compression
    • Reduce memory/disk needed for storage of data
    • speed up learning algorithm
  • Visualization

Note: It might work ok, but it is generally a bad application to use PCA to prevent overfitting. Use regularization instead

minθ12mmi=1(hθ(x(i))y(i))2+λ2mnj=1θ2j

If the overfitting is the problem, utilization of PCA would probably throw away some useful information.

People like to utilized PCA for any ML problem:

 Design of ML system:  - Get training set {(x(1),y(1)),(x(2),y(2)),,(x(m),y(m))} - Run PCA to reduce x(i) in dimension to get z(i))} - Train logistic regression on {(z(1),y(1)),,(z(m),y(m))} - Test on test set: Map x(i)test to z(i)test. Run hθ(z) on {(z(1)test,y(1)test),,(z(m)test,y(m)test)}

Before implementing PCA, it is better to try running without the PCA with the original/raw data. Only if it does not do what is desired, then implement PCA.

Anomaly DetectionPermalink

Density EstimationPermalink

Problem MotivationPermalink

-w497

-w474

Algorithm for a Anomaly detectionPermalink

-w529

Building Anomaly Detection SystemPermalink

-w513

Due to possible skewness of the data or label, the accuracy of prediction is not a good measure of the performance of algorithm.

  • Possible evaluation metrics:

Precision:

#True Positives# Total Predicted Positives

Recall:

#True Positives# Total Actual Positives

-w549

F1 Score=2PRP+R

  • Use cross validation set to choose parameter ϵ

Anomaly Detection (Gaussian) VS. Supervised LearningPermalink

Since we have labeled data, why not just use supervised learning to detect to anomaly?

-w592

Few features: use anomaly detection, as the limited number of features make it hard to learn.

-w595

Choosing What Features to Use in the Anomaly DetectionPermalink

  • If the feature is Non-Gaussian, it is better to transform the feature to be more like Gaussian distribution
  • If it is difficult to identify the anomaly with the current features, it will helpful if more features can be added, specially based on observation of anomaly.

    Multivariant Gaussian DistributionPermalink

    In the previous anomaly detection algorithm, the features are assumed to be independent and normally distributed. If we want to consider the dependency or correlation among the n features, we can use multivariant gaussian distribution.

-w908

Note: use PCA to capture the normal?

 Multivariate Gaussian (Normal) distribution xRn. Don't model p(x1),p(x2),, etc. separately.  Model p(x) all in one go.  Parameters: μRn,ΣRn×n (covariance matrix)  fX(x1,,xk)=exp(12(xμ)TΣ1(xμ))(2π)k|Σ|  Parameters μ,Σp(x;μ,Σ)=1(2π)n2|Σ|12exp(12(xμ)TΣ1(xμ))
  • Parameter Filtering: Given training set x(1),x(2),,x(m), we can calculate the parameters:
μ=1mmi=1x(i)Σ=1mmi=1(x(i)μ)(x(i)μ)T

here μ,x(i)Rn, in other words: =E(Xμ)T(Xμ) here X=[x(1);x(2);;x(m)] therefore we have: X=[x(1),x(2),,x(m)]

  • Algorithm:

-w548

Note that for the multivariate gaussion:

  • the n features Rm must be linearly independent, i.e. full rank of n:==> must have mn, otherwise we have some xTx=0 which means is non-invertible.
  • Computationally more expensive

Comments