捡芝麻拾谷子: 2010

Wednesday, November 3, 2010

Convert R "xtabs" object to a dataframe

Lately, I used R's "xtabs" command to generate a 2 dimension contingency table. Then I tried to merge the result with a dataframe object I created earlier. However "merge" function does not like the miss-match input types, meaning I have to transform the xtabs result to a dataframe. The challenges are (1) the regular as.data.frame did not work. I have to use as.data.frame.matrix (2) the column's naming convention is not the ones I'd like it to be. So I write a little R function to accomplish the task of converting xtabs objects to dataframes.

# define x
x <- data.frame(id=c(123,123), model=c(2,3), value=c(2.73,0.36))

# > x
# id model value
# 1 123 2 2.73
# 2 123 3 0.36

# create 2 dimension xtab variable
aa <- xtabs(value~id+model, data=x)

xtabs_2_dataframe <- function(aa){
# figure out column names that are originally used in the xtabs call
nm_tmp <- attributes(dimnames(aa))$names
if(length(nm_tmp) != 2){
# this function only handles 2 dimension xtabs object
cat('Error: the input xtabs object has to have at most 2 dimensions!\n')
#return(NULL)
} else {
bb <- as.data.frame.matrix(aa)

# playing aroun d
colnames(bb) <- paste(nm_tmp[2], colnames(aa), sep='_')
bb$newcol <- dimnames(bb)[[1]]
colnames(bb)[ncol(bb)] <- nm_tmp[1]
rownames(bb) <- NULL

cc=data.frame(bb[, ncol(bb)], bb[, 1:(ncol(bb)-1) ], stringsAsFactors = F)
colnames(cc)[1]=nm_tmp[1]
return(cc)
} }

# > xtabs_2_dataframe(aa)
# id model_2 model_3
# 1 123 2.73 0.36
# > str(xtabs_2_dataframe(aa))
# 'data.frame': 1 obs. of 3 variables:
# $ id : chr "123"
# $ model_2: num 2.73
# $ model_3: num 0.36

Of course, you may ask why not use cast function in reshape package? Well, I tried that and got a list object. So if I want a data frame, I have to process that too, for example using do.call(cbind,...) and assign the proper column names.

# > cast(x, id~model+value)
# id 2_2.73 3_0.36
# 1 123 2.73 0.36
# > str(y<-cast(x, id~model+value))
# List of 3
# $ id : num 123
# $ 2_2.73: num 2.73
# $ 3_0.36: num 0.36
# - attr(*, "row.names")= int 1
# - attr(*, "idvars")= chr "id"
# - attr(*, "rdimnames")=List of 2
# ..$ :'data.frame': 1 obs. of 1 variable:
# .. ..$ id: num 123
# ..$ :'data.frame': 2 obs. of 2 variables:
# .. ..$ model: num [1:2] 2 3
# .. ..$ value: num [1:2] 2.73 0.36

Thursday, October 21, 2010

Thinking about self-join? Well, think again!

This week, the greenplum people came on-site and talked about a few tricks that I am very new to. The ones I have most feeling about are

(1) about function overloading

So PostgreSQL allows you to have functions with exact names but different inputs or outputs, even different number of inputs. This is called function overloading. It's a feature in various programming languages like C, Java etc. And since Greenplum is written in C, it has this feature by birth. And this is exactly WHY you have to specify the inputs when deleting a function in greenplum!

I know, this could be very annoying! So they should really have a switch, so you can choose whatever options you want, either follow the overloading rule or not. So if you refuse to follow overloading and you try to name a new function the same as an existing one, greenplum will throw an error on you. C language has the overloading thing. That's fine. But greenplum could have given user options.

(2) about "STRICT"

When writing user defined functions, by adding "STRICT" option at the end, you are telling PostgreSQL to return a null automatically when someone inputs a null to the function. The function will not even try to run with null inputs, it will return null immediately. This could be a nice option to speed things up.

(3) about windowing function

Windowing function is not new concept to me. But the instructor's comment about all self-join could be replaced by windowing function is quite new. It's kind of surprising to me. Of course, I believe it's not hard, just takes time to get used to, particularly when one has been so familiar with table self-joins.

Windowing function allows user to access a set of rows associated with the current row. It can perform various calculations, like sum, rank, average, max/min etc. PostgreSql has a very good tutorial on windowing functions. There are two things I like particularly, running totals and naming windows.

So now the running total revenue calculation of campaigns is

SELECT campaign_id
, revenue
, SUM(revenue) OVER (w) as running_total
, SUM(revenue) OVER() AS overall_total
,(SUM(revenue) OVER (w)) / (SUM(revenue) OVER()) AS pct
FROM rev_table
WINDOW w AS (ORDER BY revenue DESC)

Very nice and clean, right?

Last but not least, first_value, last_value or nth_value could be extremely useful sometime, for example, I'd like to know that's the last campaign a user have clicked on in the past week.

SELECT enc_email
, FIRST_VALUE(campaign_id) OVER(PARTITION BY enc_email ORDER BY click_date DESC) AS last_campaign
FROM user_history
LIMIT 1;

The "limit 1" at the end is very important. Without it, the end result will has as many rows as the original table filling with exact same rows.

Thursday, August 12, 2010

Extracting "gems" from web pages using BeautifulSoup and Python - extracting II

**** step 4 ****
extract the business address (suppose tt=results[0]).
The address itself is closed to a tag called "address", for example

<address>
3002 W 47th Ave <br/>Kansas City, MO 64103 <br/>
</address>

Since I want the address and zip code separate, I used

# address
>> address_tmp=tt.find('address')
>> address_tmp=BeautifulSoup(re.sub('(<.*?/>)',' ',str(address_tmp))).text
>> address=address_tmp.rstrip('1234567890 ')
>> zipcode=re.search(r'\d+$',address_tmp).group()

The first line is saying "find the tag with name address". Because of the
stuff in the middle of the string, I have to use regular expression to replace them with a single space. Then I change the string back to BeautifulSoup object and finally get the text between the "address" tages.

u'3002 W 47th Ave Kansas City, KS 66103'

The third line tends to trim off zipcode and any space after the state abbreviation. On the contrary, the fourth line utilizes regular expression to extract the zipcodes (which are few digits at the end of the string).

There is one thing that needs to pay attention is to use "?" to make the search non-greedy, meaning as long as the search find the pattern, it will return it. see here Google Python Class for more details.

**** step 5 ****
Oops, did I forget to get the business name first? All right, if you realize the name part is heading in a div tag with it's attributes called "itemheading", you will get the entire piece and the name part easily through

>> name_rank=tt.find('div', attrs={'class':'itemheading'}).text
>> name_rank
u"1.\n \tOklahoma Joe's BBQ & Catering"
>> names=name_rank.lstrip('1234567890. ').replace('\n \t', '')
>> names
"Oklahoma Joe's BBQ & Catering"

Very similar to the address part, the new piece added is the replace function operated on strings to remove '\n \t'.

**** step 6 ****
Next, I'd like to know the category or categories for those restaurants. Those are stored in the text of 'a' tags with class label of "category". The 2nd line below join all category labels using ',' to make them one string. And the 3rd line removes '\n' from the string.

# multiple categories so
>> cats=tt.findAll('a',attrs={'class':'category'})
>> category=','.join([x.text for x in cats])
>> category=str(category).replace('\n', '')
>> category
u'Barbeque'

**** step 7 ****
Similar idea applies to get number of reviews and rating as following

# reviews '217 reviews', have to parse the number part using regular expression
>> review_tmp=tt.find('a',attrs={'class':'reviews'}).text
>> rev_count=re.search(r'^\d+', review_tmp).group()
>> rev_count
u'217'
# rating scale
>> rating_tmp=tt.find('div',attrs={'class':'rating'})
>> rating=re.search(r'^\w+\.*\w*',rating_tmp('img')[0]['title']).group()
>> rating
u'4.5'

After all these 7 steps, I write them into a csv file through

out = open('some.csv','a') # append and write
TWriter = csv.writer(out, delimiter='|', quoting=csv.QUOTE_MINIMAL)
TWriter.writerow([ranks, names, category, address, zipcode, rev_count,rating])
out.close()

Tuesday, August 3, 2010

Extracting "gems" from web pages using BeautifulSoup and Python - extracting I

The web crawling piece I talked previously seems a little bit easier. So I will start from there. There are actually two pieces involved: first one is to crawl the website and download its related webpages, second is to extract information out of the saved html pages.

I will start from the 2nd half of work and show step by step how I work with the sample webpages from yelp.com and process them in Python using BeautifulSoup.

Ok, let's get started. First of all, give you an idea of what does yelp.com look like. Below is a small screen shot on the restaurant search result for Kansas City metro area. What I really want to extract from this page is: the business unit name, address, zip, category, number of reviews and rating.

This is a typical search result from yelp website, which includes something totally unrelated with what we need (like the sponsor ads for "The Well"), then the stuff we care, and finally stuff totally irrelevant again. Not that bad, right? But when you look at the underneath html codes, you will see this

Really messy!!! How can you clearly see the stuff you want, let along extract anything them from such a html page? Here is when you should have thought about using BeautifulSoup, which is a terrific, flexible HTML/XML parser for Python.

Now, it's time to get hands dirty with some python codes.
**** step 1 ****
inside python (under windows), load the modules that will be used

import re # regular expression module
import os # operation system module
import csv
from BeautifulSoup import BeautifulSoup

**** step 2 ****
point python the html file I saved and open it of course

os.chdir('C:\\Documents and Settings\\v\\Desktop\\playground\\yelp data')
f=open('rest3.htm','r')

**** step 3 ****
create a soup object and extract the business units

soup=BeautifulSoup(f)
results=soup.findAll('div', attrs={'class':'businessresult clearfix'})

After the "BeautifulSoup" command applied to file f, the output "soup" becomes a soup object (basically just a text file). Observing the original html pages, you will find yelp organize the search result using a div with class="businessresult clearfix". So the second line is really saying give me all the div sections which has class="businessresult clearfix". In this case, "results" is a list of 40 text stings, each one contains the information we want about a single restaurant. Next I will loop through each element in the list ("for tt in results:") and extract "gems" one by one.

Extracting "gems" from web pages using BeautifulSoup and Python - BI part

Lately, I am trying to find a way to answer this question: what's the best local restaurants or spas? So we are trying to figure out which local businesses will worth the most to go after.

Ok, that could be a supervised or unsupervised learning problem. If I can find some data that indicates brand awareness, and use some metrics as predictors, then it's a supervised problem. Well, if I cannot find the indicator kind of response, then I have to figure out another way to sort of build that index using whatever predictive metrics I can find online and for free.

The first thing came to my mind was yelp.com, which is a relatively comprehensive rating website on local businesses. Naturally what I want to do is to crawl that website and get some useful information out, like location of business (in order to calculate distance of that business to a particular user), number of ratings (one indicator of brand awareness), rating itself (brand quality), etc.

Secondly, I want to see if I can get some twitter data. If I happened to know the ip of the twitter user, I can figure out his/her location and just see how many tweets a local business can have. Similarly, google analytics might be helpful too, but I am not sure.

Friday, July 30, 2010

Exists and Correlated Subquery are tricky

Today is the second of our sql training. The stuff was too trivial yesterday. But it starts to get interesting now.

The subquery refers to those SELECT statements nested inside another (SELECT) query. Then the SQL got the structured. The nested query is normally called "sub query", "inner query" and "lower query" relative to "main query", "out query" and "upper query".

There is a lot going on in subquery. It's flexible, however, a lot of subqueries can be replaced by joins, which are more efficient.

A tricky example I found during the class is this one:

SELECT empno, bonus, comm
FROM emp
WHERE EXISTS (SELECT 'x' FROM dept WHERE mgrno=EMP.empno);

This query is designed to return employee number, bonus and commission for any employees who are listed as a manager on dept table.

There are 4 things that draw my attention:
(1) emp table and dept table are correlated in the sub select query.
(2) EXISTS is used. This is an operator I barely use.
(3) An alphabetic constant 'x' is used. It can be replaced with anything constant like 1 or 'y'. Because they simply don't make differences, since EXISTS only serves as "TRUE"/"FALSE".
(4) The way those queries became correlated is quite weird. You just call the emp table directly in the subquery. However if you try to run the subquery (SELECT 'x' FROM dept WHERE mgrno=EMP.empno) itself, you will get syntax error.

One thing needs to be pointed out is, if the two table are not correlated in the subquery, EXISTS will always return as TRUE, so you end up selecting every row in emp table.

Tuesday, July 20, 2010

online vs cross-media marketing

The conversation between me and the interviewer (right, I am looking for another opportunity) yesterday made me think a lot about how advertisers look at which media and how much to invest, and how the marketing as a whole still needs a lot of improvement in terms of efficiency.

Like on the online advertising side, you have google, yahoo and microsoft which make up the head of the long tail curve. They utilized their resources (machines, technology, people) just to figure out what the user is really interested, what he is going to click next and also of course to grab as much market share as possible in the online world. Machine learning, collaborative filter, etc, etc wowed a lot of us. The competition is so intense that you almost feel unable to breathe for even a second!

On the other side, when advertisers look at the budgets in their pocket, internet is just one channel for them to broadcast their products, their brand. They probably rely on the analysis done by firms that do marketing mix analysis. Those people seem to come from complete different background with those in the online world. They often have training in economics and marketing. The model they use to analyze lift from different media and recommend mix, is more or less in the same level of linear regression.

So think about this, it is those "simple" models that recommend how much money actually going into this industry. It may not be very efficient as a very important starting point. And the internet firms use all the "fancy" stuff to make sure those money got spend efficiently. How ironic this is! Of course, there is nothing wrong using simple models. What shocked me is really the contrast!

Friday, July 9, 2010

R functions that handle binary response

Lately, I have been working with data that has binary response in R.

The project's focus is to predict user's click probability given Two main characters about My predictors this time are all numerical.

To me, R has evolved to a point where serious summary and information aggregation are needed. That's why I think r-bloggers is a terrific idea and highly recommend it to every R beginner like myself.

Well, this post is meat to be a summary page on some of the R functions/packages that I found would handle binary response data and maybe also perform variable selection at the same time. Of course, they might handle other type of responses also, like Gaussian or Poisson. But this post is written from the perspective of binary responses or success/failure counts

Next, I am going to list them by function name, and give a brief example-like syntax template.

# glm
** package: default
** input format: glm(Y~X1+X2, data, family='binomial', ...) or glm(cbind(success, total-success)~X1+X2, data, family='binomial')
** variable selection: no, unless manual tweaking or use functions from other packages to help selecting predictors, like 'stepAIC' in MASS package. It will have an ANOVA table for predictors, however won't necessarily choose the best set.

# gbm
** package: gbm (Generalized Boosted Regression Models)
** input format: gbm(Y~X1+X2, data, distribution=c('bernoulli', 'adaboost'), ...)
** variable selection: yes.

# gl1ce
** package: lasso2 (L1 Constrained Estimation)
** input format: similar to glm
** variable selection: yes.

# earth
** package: earth (Multivariate Adaptive Regression Spline Models)
** input format: earth(Y~X1+X2, data, glm=list(family=binomial), ...) or earth(cbind(success, total-success)~X1+X2, glm=list(family=binomial), ...)
** variable selection: yes.

# bestglm
** package: bestglm (Best subset glm using AIC, BIC, EBIC, BICq or Cross-Validation)
** input format: bestglm(Xy, family='binomial', IC=..., ...) (Dataframe containing the design matrix X and the output variable y. All columns must be named. I think y must be binary.
** variable selection: yes.

# step.plr (L2 Penalized Logistic Regression with a Stepwise Variable Selection)
** package: stepPlr
** input format: step.plr(X, Y, type=c("both", "forward", "forward.stagewise"), ...)
** variable selection: yes.

捡芝麻拾谷子