Friday, July 9, 2010

R functions that handle binary response

Lately, I have been working with data that has binary response in R.

The project's focus is to predict user's click probability given Two main characters about My predictors this time are all numerical.

To me, R has evolved to a point where serious summary and information aggregation are needed. That's why I think r-bloggers is a terrific idea and highly recommend it to every R beginner like myself.

Well, this post is meat to be a summary page on some of the R functions/packages that I found would handle binary response data and maybe also perform variable selection at the same time. Of course, they might handle other type of responses also, like Gaussian or Poisson. But this post is written from the perspective of binary responses or success/failure counts

Next, I am going to list them by function name, and give a brief example-like syntax template.


# glm
** package: default
** input format: glm(Y~X1+X2, data, family='binomial', ...) or glm(cbind(success, total-success)~X1+X2, data, family='binomial')
** variable selection: no, unless manual tweaking or use functions from other packages to help selecting predictors, like 'stepAIC' in MASS package. It will have an ANOVA table for predictors, however won't necessarily choose the best set.

# gbm
** package: gbm (Generalized Boosted Regression Models)
** input format: gbm(Y~X1+X2, data, distribution=c('bernoulli', 'adaboost'), ...)
** variable selection: yes.

# gl1ce
** package: lasso2 (L1 Constrained Estimation)
** input format: similar to glm
** variable selection: yes.

# earth
** package: earth (Multivariate Adaptive Regression Spline Models)
** input format: earth(Y~X1+X2, data, glm=list(family=binomial), ...) or earth(cbind(success, total-success)~X1+X2, glm=list(family=binomial), ...)
** variable selection: yes.

# bestglm
** package: bestglm (Best subset glm using AIC, BIC, EBIC, BICq or Cross-Validation)
** input format: bestglm(Xy, family='binomial', IC=..., ...) (Dataframe containing the design matrix X and the output variable y. All columns must be named. I think y must be binary.
** variable selection: yes.

# step.plr (L2 Penalized Logistic Regression with a Stepwise Variable Selection)
** package: stepPlr
** input format: step.plr(X, Y, type=c("both", "forward", "forward.stagewise"), ...)
** variable selection: yes.

No comments:

Post a Comment