Friday, May 27, 2011

One-liner R: remove data frame columns by names

Sometime, I want to remove some columns from my data. What I did in that case, is found out what are the unwanted column ids and removed those. It's just a couple of lines in R. But why not write a function to do the job and also some error handling. So next time, I just copy and paste!
rm.col.by.name <- function(dat, colname.list){

l1 = length(colname.list)
l2 = which(colnames(dat) %in% colname.list)

if(length(l2) == 0) {
    warning('Did not find any matching columns. Check input column names, please.')
}

if(length(l2) > 0 & length(l2) < l1) {
    warning('Did not find all matching columns. Only those found are removed.')
    return(dat[,-c(l2)])
} 

if(length(l2) > 0 & length(l2) == l1) {
    return(dat[,-c(l2)])
}

}

Wednesday, May 25, 2011

Canopy Clustering

Lately I was introduced to a clustering algorithm called "canopy clustering". In plain English, like all other clustering techniques, it is designed to divide observations into segments (subsets, clusters, groups, buckets, whatever you call it), so that observations inside each segment are "similar" in terms of certain measurement criteria. Those criteria are distances or dissimilarities.

There are a lot of algorithms that cluster observations. Some are more computation-intensive, like k-medoids (compared to k-means). Some yield hard assignment, i.e., one observation can only belong to one cluster. Some generate soft assignment, i.e., one observation can belong to multiple clusters, like LSH (locality sensitive hashing) and canopy clustering. Some scale better than others.

Canopy clustering procedure is relatively straightforward. The key ingredients are two distance thresholds T1 and T2, where T1 > T2. For a list of data points, if a point is within T2 distance to any of the existing canopy centers, then that point cannot be a new canopy center; otherwise, it should be added to the canopy center set. With this rule, a set of canopy centers are obtained. Then if any of the input data points is within T1 distance of the existing centers, then the point belong to the canopy formed by that center. As one can see, a data point could belong to multiple canopies. The following plot illustrates the concept.



Usually, canopy clustering uses "cheap" distance metrics (easy and fast to calculate). So the process takes shorter time compared to k-means clustering. This is a typical trade-off between precision and speed. Thus people combine k-means and canopy together to cluster huge amount of data. The way the two works together is:
1. Canopy uses cheaper distance metric (however in the same direction of k-means distance, meaning if two points are close in terms of k-means distance, they should be close in terms of canopy distance too.), and partition data into canopies.
2. K-means still try to cluster all the data points, however distance between an input data point and any k-mean centroid is calculated if and only if they are in the same canopy. In this way, a lot of "unnecessary" calculations are avoided.

There are some interesting variations around combining those two or simply use canopy clustering to generate k-means initial centroids. If serving canopy clustering as a k-means centroid feed, one can only perform the center generation part, or perform both the center generation and membership assignment parts, and then re-calculate the center of the canopy clusters. If utilizing canopy clustering to partition the data for k-means first, keeping the ratio between number of canopies and k to 1 and 10 has been suggested. That means you always want the number of canopies be smaller than k. So that each point is likely to be in the cluster with at least one k-mean centroid. But how to find out how many canopies you are going to get? The trick is you have to play with T1 and T2.

When implementing in map-reduce framework, this blog post is very clear. One thing that is worth pointing out is that after canopy centers are generated using one map-reduce job, the next map-reduce job is to assign membership of all the input data to the canopy centers, at that time, some data may no longer be within T1 of any canopy centers, solutions in this case includes forming a new canopy with this point as center or assigning it the nearest center or simply discarding that point.