All of this series is mainly based on the Machine Learning course given by Andrew Ng, which is hosted on cousera.org.
Unsupervised Learning – ClusteringPermalink
K-means AlgorithmsPermalink
K-means algorithm Input: −K( number of clusters) − Training set {x(1),x(2),…,x(m)}x(i)∈Rn( drop x0=1 convention )Algorithm:
K-means algorithm Randomly initialize K cluster centroids μ1,μ2,…,μK∈Rn Repeat {c(i):= index ( from 1 to K) of cluster centroid closest to x(i)μk:= average (mean) of points assigned to cluster kK-means for Non-separated clusters
Optimization Objective for K-meansPermalink
J(c(1),…,c(m),μ1,…,μK)=1m∑mi=1‖x(i)−μc(i)‖2minc(1),…,c(m)J(c(1),…,c(m),μ1,…,μK)μ1,…,μKwhere c(i)= index of cluster (1,2,…,K) to which example x(i) is currently assigned μk= cluster centroid k(μk∈Rn)
μc(i)= cluster centroid of cluster to which example x(i) has been assignedIt is equivalent to repeatedly solving the optimization problem for C(i)~for i=1,…,m first, then for μk for k=1,…,K
Initial Guesses - Random InitializationPermalink
Bad starting point can lead to local optimum:
Solution: more trials with different starting point. Then pick the lowers cost function.
Choose the number of clustersPermalink
Evaluate it based on the later/downstream purposes.
Dimensionality ReductionPermalink
MotivationPermalink
- Data
- Data Visualization
But what is the meaning of these new features? -> it is a difficult to explain.
Principle Component AnalysisPermalink
For the linear regression: min||Ax−y||2 since Ax=(a1,a2,…,an)x, the linear regression is essentially the projection of y onto the linear space of column vectors of A.
For the PCA, the first principle component is the one that: max||Aμ||2
Mathematically, for the linear regression, it is a problem of project a point in Rm into the linear space of x0,x1,…,xn. In other words, the base is given.
for the PCA, however, it is a problem of find a sub-space for the data given, so that most of the information can be maintained. In other words, the task is to find the base of the sub-space. (μ1,…,μk∈Rn)
Details in MathematicsPermalink
First Component
In order to maximize variance, the first weight vector w(1) thus has to satisfy:
w(1)=argmax‖w‖=1{∑i(t1)2(i)}=argmax‖w‖=1{∑i(x(i)⋅w)2}Equivalently, writing this in matrix form gives
w(1)=argmax‖w‖=1{‖Xw‖2}=argmax‖w‖=1{wTXTXw}Since w(1) has been defined to be a unit vector, it equivalently also satisfies
w(1)=argmax‖w‖=1{‖Xw‖2}=argmax‖w‖=1{wTXTXw}The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as XTX is that the quotient’s maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector.
With w(1) found, the first principal component of a data vector x(i) can then be given as a score t1(i)=x(i)⋅w(1) in the transformed co-ordinates, or as the corresponding vector in the original variables t1(i)w(1)=<x(i)∗w(1)>w(1)
Further Component
The kth component can be found by subtracting the first k − 1 principal components from X:
ˆXk=X−k−1∑s=1Xw(s)wT(s)and then finding the weight vector which extracts the maximum variance from this new data matrix
w(k)=argmax‖w‖=1{‖ˆXkw‖2}=argmax{wTˆXTkˆXkwwTw}It turns out that this gives the remaining eigenvectors of XTX, with the maximum values for the quantity in brackets given by their corresponding eigenvalues. Thus the weight vectors are eigenvectors of XTX. The full principal components decomposition of X can therefore be given as T=XW where W is a p-by-p matrix of weights whose columns are the eigenvectors of XTX. The transpose of W is sometimes called the whitening or sphering transformation. Columns of W multiplied by the square root of corresponding eigenvalues, i.e. eigenvectors scaled up by the variances, are called loadings in PCA or in Factor analysis.
PCA algorithmPermalink
Step 1: Data Preprocessing Mean Normalization, so that the features centers in the original point. Feature scaling is necessary;
X=[X−mean(X)]std(X)Vectorization:Permalink
Mean normalization and optionally feature scaling:
X=bsxfun(@minus,X,mean(X,1)) ∑=1mXTX [U,S,V]=svd(∑)Then we have:
U=[|||u(1)u(2)…u(n)|||]∈Rn×nx∈Rn→z∈Rk:
Ureduce=U( : ,1:k)
z=UreduceT∗X
Note 1: xi0≠0 for this convention. Note 2: U is from USV∗=XTX, therefore U is Rn×n. It is the eigenvector of X.
Applying PCAPermalink
Reconstruction from compressed representationPermalink
z=UreduceT∗X⟹XApprox=Ureduce∗zhere XApprox∈Rn
Choose the number of principle componentPermalink
Since S is the eigenvalues for XTX, sii is actually the square of si, which is the eigenvalue for X.
Choosing k (number of principal components) [U,S,V]=svd( Sigma ) Pick smallest value of k for which ∑ki=1Sii∑mi=1Sii≥0.99 (99% of variance retained)Advice for applying PCAPermalink
Supervised learning speedupPermalink
Note: must normalize the X before PCA. feature scaling is optional but recommended if large variance among features.
The same transformation must be done for xval and xtest
In summary, application for PCA:
- Compression
- Reduce memory/disk needed for storage of data
- speed up learning algorithm
- Visualization
Note: It might work ok, but it is generally a bad application to use PCA to prevent overfitting. Use regularization instead
minθ12mm∑i=1(hθ(x(i))−y(i))2+λ2mn∑j=1θ2jIf the overfitting is the problem, utilization of PCA would probably throw away some useful information.
People like to utilized PCA for any ML problem:
Design of ML system: - Get training set {(x(1),y(1)),(x(2),y(2)),…,(x(m),y(m))} - Run PCA to reduce x(i) in dimension to get z(i))} - Train logistic regression on {(z(1),y(1)),…,(z(m),y(m))} - Test on test set: Map x(i)test to z(i)test. Run hθ(z) on {(z(1)test,y(1)test),…,(z(m)test,y(m)test)}Before implementing PCA, it is better to try running without the PCA with the original/raw data. Only if it does not do what is desired, then implement PCA.
Anomaly DetectionPermalink
Density EstimationPermalink
Problem MotivationPermalink
Algorithm for a Anomaly detectionPermalink
Building Anomaly Detection SystemPermalink
Due to possible skewness of the data or label, the accuracy of prediction is not a good measure of the performance of algorithm.
- Possible evaluation metrics:
Precision:
#True Positives# Total Predicted PositivesRecall:
#True Positives# Total Actual PositivesF1 Score=2PRP+R
- Use cross validation set to choose parameter ϵ
Anomaly Detection (Gaussian) VS. Supervised LearningPermalink
Since we have labeled data, why not just use supervised learning to detect to anomaly?
Few features: use anomaly detection, as the limited number of features make it hard to learn.
Choosing What Features to Use in the Anomaly DetectionPermalink
- If the feature is Non-Gaussian, it is better to transform the feature to be more like Gaussian distribution
- If it is difficult to identify the anomaly with the current features, it will helpful if more features can be added, specially based on observation of anomaly.
Multivariant Gaussian DistributionPermalink
In the previous anomaly detection algorithm, the features are assumed to be independent and normally distributed. If we want to consider the dependency or correlation among the n features, we can use multivariant gaussian distribution.
Note: use PCA to capture the normal?
Multivariate Gaussian (Normal) distribution x∈Rn. Don't model p(x1),p(x2),…, etc. separately. Model p(x) all in one go. Parameters: μ∈Rn,Σ∈Rn×n (covariance matrix) fX(x1,…,xk)=exp(−12(x−μ)TΣ−1(x−μ))√(2π)k|Σ| Parameters μ,Σp(x;μ,Σ)=1(2π)n2|Σ|12exp(−12(x−μ)TΣ−1(x−μ))- Parameter Filtering: Given training set x(1),x(2),…,x(m), we can calculate the parameters:
here μ,x(i)∈Rn, in other words: ∑=E(X−μ)T(X−μ) here X=[x(1)′;x(2)′;…;x(m)′] therefore we have: X′=[x(1),x(2),…,x(m)]
- Algorithm:
Note that for the multivariate gaussion:
- the n features ∈Rm must be linearly independent, i.e. full rank of n:==> must have m≥n, otherwise we have some xT∑x=0 which means ∑ is non-invertible.
- Computationally more expensive
Comments