Thursday, June 16, 2011

One-linear R: multiple ggplots on one page

First, par(mfrow=c(nrow,ncol)) does not go together with ggplot2 plots. Too bad, the simplest way I know of just wouldn't work.

There are two solutions I found out. The first is posted here. It's a smart function that performs par(mfrow) tasks. I just copied the code here
# Source: http://gettinggeneticsdone.blogspot.com/2010/03/arrange-multiple-ggplot2-plots-in-same.html

# Function used below in the function 'arrange'
vp.layout <- function(x, y) viewport(layout.pos.row=x, layout.pos.col=y)

# Function to plot multiple ggplots together
arrange <- function(..., nrow=NULL, ncol=NULL, as.table=FALSE) {
dots <- list(...)
n <- length(dots)
if(is.null(nrow) & is.null(ncol)) { nrow = floor(n/2) ; ncol = ceiling(n/nrow)}
if(is.null(nrow)) { nrow = ceiling(n/ncol)}
if(is.null(ncol)) { ncol = ceiling(n/nrow)}

## NOTE see n2mfrow in grDevices for possible alternative
grid.newpage()
pushViewport(viewport(layout=grid.layout(nrow,ncol) ) )
ii.p <- 1
for(ii.row in seq(1, nrow)){
ii.table.row <- ii.row
if(as.table) {ii.table.row <- nrow - ii.table.row + 1}

for(ii.col in seq(1, ncol)){
ii.table <- ii.p
if(ii.p > n) break
print(dots[[ii.table]], vp=vp.layout(ii.table.row, ii.col))
ii.p <- ii.p + 1
}}}

Another solution would be to use the 'grid.arrange' function in 'gridExtra' package. This function cooperates with not only plots, but also tables ('tableGrob'). So a plot can have both plots and text tables in it. And it has more controls over the title and sub titles. Very slick!

# Test
x <- qplot(mpg, wt, data=mtcars)
y <- qplot(1:10, letters[1:10])
arrange(x, y, nrow=1)
grid.arrange(x, y, ncol=1)

Wednesday, June 15, 2011

Use LDA Models on User Behaviors

LDA (Latent Dirichlet Allocation) model is a Bayesian statistical technique that is designed to find hidden (latent) groups of the data. For example, LDA helps to find topics (latent groups of individual words) among large document corpora. LDA is a probabilistic generative model, i.e., it assumes a generating mechanism underneath the observations and then use the observations to infer the parameters for the underlying mechanism.

When trying to find the topics among documents, LDA makes the following assumptions:
  • ignoring the order of words, documents are simply bags of words and documents exhibits multiple topics.
  • topics have distributions over words. So each word could appear in all the topics with different probabilities though. Say "transmission" will have a higher probability in the topic about auto repair, but lower probability in the topic about child education.
  • each document is obtained by choosing some topics proportions, say {(topic A: .55), (topic B: .20), (topic C: 25)}.
  • and for each word inside a document, choosing a topic from those proportions (say you randomly got number .6 (between 0 and 1 of course), the topic got chosen is B) then looking at the topic over words distribution and drawing a word from that topic (say word X).

Only the actual words are visible to us, so inference has to be made on per-word topic assignment, per-document topic proportion and per-corpus topic distribution. Dirichlet distributions are chosen to be the prior distributions for the latter two distributions. And various algorithms are used to make the inference, like Gibbs sampling (implemented in R package 'topicmodels').

LDA assumes a fixed number of topics existing in a document corpora and assigns soft membership of the topics to documents (i.e, a topic could be exhibited in multiple documents). Indeed, LDA is under a bigger umbrella called "mixed membership model framework" [refer to source [1] for more) .

The soft assignment and mixed membership nature of LDA models make it a reasonable candidate for user behavior analysis, particularly user behavior through a certain period of time. Say one has a log of user behavior at different stage of user life cycle (1st day, 2nd day, etc of a user's life). Then one can treat behavior as words, each day's users aggregated behavior log as document, and let LDA figure out the topics (group of behaviors that tend to appear together, 'co-appearance') and the proportions of the topics. And that will give some idea on how behavior of the group evolves over time. A R visualization could look like this (each tick on the y axis represents a topic).



However, I do find that sometime LDA tries too hard on discovering topics with niche words (after filtering out stop words and rare words). Depending on application, this could be a bless or a curse.

sources:
[1] http://www.cs.cmu.edu/~lafferty/pub/efl.pdf
[2] http://videolectures.net/mlss09uk_blei_tm/
[3] http://cran.r-project.org/web/packages/topicmodels/index.html
[4] http://www.cs.princeton.edu/~blei/

Tuesday, June 14, 2011

One-linear R: neat way to slice the data

Recently found out that library 'caret' has a few interesting functions, which are very handy to do the following tasks:
series of test/training partition (createDataPartition function)
create one or more bootstrap samples (createResample function)
split the data into k groups (createFolds function)

'Caret' is a super wrapper package. One can run different type of models by calling functions in this package only.