Hamming distance - hamming-distance

My work is in genetics and I'm using the Hamming distance (in Matlab) to calculate the genetic distance between genotypes of a virus.
For example: Type 1 has structure 01234 and Type 2 has structure 21304 etc. Obviously there are many genotypes present. Because the genotypes have the same length, I thought using the Hamming distance would be fine.
My question is this: How can I order the genotypes based on the Hamming distance. Another way of putting this: how can I sort the genotypes into clusters based on the Hamming distance between them?
Thanks

You can use severel methodes to cluster such data.
Based on the distance matrix you can use UPGMA or neighbor joining
Single linkage or complete linkage are also distance based cluster methodes.

Related

Discriminant analysis method to classify data

my aim is to classify the data into two sections- upper and lower- finding the mid line of the peaks.
I would like to apply machine learning methods- i.e. Discriminant analysis.
Could you let me know how to do that in MATLAB?
It seems that what you are looking for is GMM (gaussian mixture model). With K=2 (number of mixtures) and dimension equal 1 this will be simple, fast method, which will give you a direct solution. Given components it is easy to analytically find a local minima (which is just a weighted average of means, with weights proportional to the std's).

String clustering using matlab?

I have a cell array of ~200k entries containing relatively small strings (1-2 words). I'm trying to cluster them based on string similarity. I've tried using levenshtein distances to create a distance matrix (using a loop to compare each string to all other strings) to use hierarchical or kmeans clustering on it but am confused on how to use that once the distance matrix is formed (specifically in matlab). If anyone has any ideas or suggestions they would be greatly appreciated.
k-means cannot operate on distance matrixes
It uses means, and squared deviation (=variance) from the mean only.
hierarchical clustering works fine on distance matrixes. See the documentation for how to pass a precomputed distance matrix.

Use Absolute Pearson Correlation as Distance in K-Means Algorithm (MATLAB)

I need to do some clustering using a correlation distance but instead of using the built-in 'distance' 'correlation' which is defined as d=1-r i need the absolute pearson distance.In my aplication anti-correlated data should get the same cluter ID. And now when using the kmeans() function im getting centroids that are highly anticorreleted wich i would like to avoid by combineing them. Now, im not that fluent in matlab yet and have some problems reading the kmeans function. Would it be possible to edit it for my pourpose?
Example:
Row 1 and 2 should get the same cluster ID when using the correlation distance as metrics.
I did some attempts to edit the built-in matlab function ( open kmeans- >line 775)
but whats weird - when i change the distance function im getting a valid distance matrix but wrong cluster indexes, cant find the reason for it.
Would love to get some tips! all best!
This is a good example of why you should not use k-means with other distance functions.
k-means does not minimize distances. It minimizes the sum of squared 1-dimensional deviations (SSQ).
Which is mathematically equivalent to squared Euclidean distance, so it does minimize Euclidean distances, as a mathematical side effect. It does not minimize arbitrary other distances, which are not equivalent to variance minimization.
In your case, it's pretty nice to see why it fails; I have to remember this as a demo case.
As you may know, k-means (Lloyds, that is) consists of two steps: assign by minimum squared deviation and then recompute the means.
Now the problem is, recomputing the mean is not consistent with absolute pearson correlation.
Let's take two of your vectors, which are -1 correlated:
+1 +2 +3 +4 +5
-1 -2 -3 -4 -5
and compute the mean:
0 0 0 0 0
Boom. They are not at all correlated to their mean. In fact, Pearson correlation is not even well-defined for this vector anymore, because it has zero variance...
Why does this happen? Because you misinterpreted k-means as distance based. It's actually as much arithmetic mean based. The arithmetic mean is a least-squares (!!) estimator - it minimizes the sum of squared deviations. And that is why squared Euclidean distance works: it optimizes the same quantity as recomputing the mean. Optimizing the same objective in both steps makes the algorithm converge.
See also this counter-example for Earth-movers distance, where the mean step of k-means yields suboptimal results (although probably not as bad as with absolute pearson)
Instead of using k-means, consider using k-medoids aka PAM, which does work for arbitrary distances. Or one of the many other clustering algorithms, including DBSCAN and OPTICS.
You can try to modify another version of kmeans: This version is also efficient, but much more simple (around 10 lines of code).
Here you have the exmplanation of the code.

How do I choose k when using k-means clustering with Silhouette function?

I've been studying about k-means clustering, and one big thing which is not clear is what Silhouette function really tell to me?
i know it shows that what appropriate k should be detemine but i cant understand what mean of silhouette function really say to me?
i read somewhere, if the mean of silhouette is less than 0.5 your clustering is not valid.
thanks for your answers in advance.
From the definition of silhouette :
Silhouette Value
The silhouette value for each point is a measure of how similar that
point is to points in its own cluster compared to points in other
clusters, and ranges from -1 to +1.
The silhouette value for the ith point, Si, is defined as
Si = (bi-ai)/ max(ai,bi) where ai is the average distance from the ith
point to the other points in the same cluster as i, and bi is the
minimum average distance from the ith point to points in a different
cluster, minimized over clusters.
This method just compares the intra-group similarity to closest group similarity. If any data member average distance to other members of the same cluster is higher than average distance to some other cluster members, then this value is negative and clustering is not successful. On the other hand, silhuette values close to 1 indicates a successful clustering operation. 0.5 is not an exact measure for clustering.
#fatihk gave a good citation;
additionally, you may think about the Silhouette value as a degree of
how clusters overlap with each other, i.e. -1: overlap perfectly,
+1: clusters are perfectly separable;
BUT low Silhouette values for a particular algorithm does NOT mean that there are no clusters, rather it means that the algorithm used cannot separate clusters and you may consider tune your algorithm or use a different algorithm (think about K-means for concentric circles, vs DBSCAN).
There is an explicit formula associated with the elbow method to automatically determine the number of clusters. The formula tells you about the strength of the elbow(s) being detected when using the elbow method to determine the number of clusters, see here. See illustration here:
Enhanced Elbow rule

Passing Custom Distance Functions in K-Means

Is there a way of passing custom distance functions (e.g. jaccard distance) in MATLAB k-means
implementation?
jaccard distance function
D = pdist(X,'jaccard');
What you need to do is break down your distance matrix into a feature space using SVD, then perform kmeans on the new feature space represented by the scores of the SVD. See Elements of Statistical Learning by Rob Tibshirani.
Or you can do k mediods which works with a distance matrix - as.dist() in R will convert a matrix to a dist object that you can then do K-mediods on.
From the documentation, we learn that we can pass a 'distance' option to kmeans:
'distance'
Distance measure, in p-dimensional space. kmeans minimizes with
respect to this parameter. kmeans computes centroid clusters
differently for the different supported distance measures.
'sqEuclidean'
Squared Euclidean distance (default). Each centroid is the mean of the
points in that cluster.
'cityblock'
Sum of absolute differences, i.e., the L1 distance. Each centroid is
the component-wise median of the points in that cluster.
'cosine'
One minus the cosine of the included angle between points (treated as
vectors). Each centroid is the mean of the points in that cluster,
after normalizing those points to unit Euclidean length.
'correlation'
One minus the sample correlation between points (treated as sequences
of values). Each centroid is the component-wise mean of the points in
that cluster, after centering and normalizing those points to zero
mean and unit standard deviation.
'Hamming'
Percentage of bits that differ (only suitable for binary data). Each
centroid is the component-wise median of points in that cluster.
So, for example:
[idx,ctrs] = kmeans(X,2, 'Distance','cityblock');
As for custom functions (i.e., user-implemented): AFAIK, this is not possible without hacking the relevant m-files.

Resources