Friday, July 30, 2010

Exists and Correlated Subquery are tricky

Today is the second of our sql training. The stuff was too trivial yesterday. But it starts to get interesting now.

The subquery refers to those SELECT statements nested inside another (SELECT) query. Then the SQL got the structured. The nested query is normally called "sub query", "inner query" and "lower query" relative to "main query", "out query" and "upper query".

There is a lot going on in subquery. It's flexible, however, a lot of subqueries can be replaced by joins, which are more efficient.

A tricky example I found during the class is this one:

SELECT empno, bonus, comm
FROM emp
WHERE EXISTS (SELECT 'x' FROM dept WHERE mgrno=EMP.empno);

This query is designed to return employee number, bonus and commission for any employees who are listed as a manager on dept table.

There are 4 things that draw my attention:
(1) emp table and dept table are correlated in the sub select query.
(2) EXISTS is used. This is an operator I barely use.
(3) An alphabetic constant 'x' is used. It can be replaced with anything constant like 1 or 'y'. Because they simply don't make differences, since EXISTS only serves as "TRUE"/"FALSE".
(4) The way those queries became correlated is quite weird. You just call the emp table directly in the subquery. However if you try to run the subquery (SELECT 'x' FROM dept WHERE mgrno=EMP.empno) itself, you will get syntax error.

One thing needs to be pointed out is, if the two table are not correlated in the subquery, EXISTS will always return as TRUE, so you end up selecting every row in emp table.

Tuesday, July 20, 2010

online vs cross-media marketing

The conversation between me and the interviewer (right, I am looking for another opportunity) yesterday made me think a lot about how advertisers look at which media and how much to invest, and how the marketing as a whole still needs a lot of improvement in terms of efficiency.

Like on the online advertising side, you have google, yahoo and microsoft which make up the head of the long tail curve. They utilized their resources (machines, technology, people) just to figure out what the user is really interested, what he is going to click next and also of course to grab as much market share as possible in the online world. Machine learning, collaborative filter, etc, etc wowed a lot of us. The competition is so intense that you almost feel unable to breathe for even a second!

On the other side, when advertisers look at the budgets in their pocket, internet is just one channel for them to broadcast their products, their brand. They probably rely on the analysis done by firms that do marketing mix analysis. Those people seem to come from complete different background with those in the online world. They often have training in economics and marketing. The model they use to analyze lift from different media and recommend mix, is more or less in the same level of linear regression.

So think about this, it is those "simple" models that recommend how much money actually going into this industry. It may not be very efficient as a very important starting point. And the internet firms use all the "fancy" stuff to make sure those money got spend efficiently. How ironic this is! Of course, there is nothing wrong using simple models. What shocked me is really the contrast!

Friday, July 9, 2010

R functions that handle binary response

Lately, I have been working with data that has binary response in R.

The project's focus is to predict user's click probability given Two main characters about My predictors this time are all numerical.

To me, R has evolved to a point where serious summary and information aggregation are needed. That's why I think r-bloggers is a terrific idea and highly recommend it to every R beginner like myself.

Well, this post is meat to be a summary page on some of the R functions/packages that I found would handle binary response data and maybe also perform variable selection at the same time. Of course, they might handle other type of responses also, like Gaussian or Poisson. But this post is written from the perspective of binary responses or success/failure counts

Next, I am going to list them by function name, and give a brief example-like syntax template.

# glm
** package: default
** input format: glm(Y~X1+X2, data, family='binomial', ...) or glm(cbind(success, total-success)~X1+X2, data, family='binomial')
** variable selection: no, unless manual tweaking or use functions from other packages to help selecting predictors, like 'stepAIC' in MASS package. It will have an ANOVA table for predictors, however won't necessarily choose the best set.

# gbm
** package: gbm (Generalized Boosted Regression Models)
** input format: gbm(Y~X1+X2, data, distribution=c('bernoulli', 'adaboost'), ...)
** variable selection: yes.

# gl1ce
** package: lasso2 (L1 Constrained Estimation)
** input format: similar to glm
** variable selection: yes.

# earth
** package: earth (Multivariate Adaptive Regression Spline Models)
** input format: earth(Y~X1+X2, data, glm=list(family=binomial), ...) or earth(cbind(success, total-success)~X1+X2, glm=list(family=binomial), ...)
** variable selection: yes.

# bestglm
** package: bestglm (Best subset glm using AIC, BIC, EBIC, BICq or Cross-Validation)
** input format: bestglm(Xy, family='binomial', IC=..., ...) (Dataframe containing the design matrix X and the output variable y. All columns must be named. I think y must be binary.
** variable selection: yes.

# step.plr (L2 Penalized Logistic Regression with a Stepwise Variable Selection)
** package: stepPlr
** input format: step.plr(X, Y, type=c("both", "forward", "forward.stagewise"), ...)
** variable selection: yes.