Tuesday, August 3, 2010

Extracting "gems" from web pages using BeautifulSoup and Python - BI part

Lately, I am trying to find a way to answer this question: what's the best local restaurants or spas? So we are trying to figure out which local businesses will worth the most to go after.

Ok, that could be a supervised or unsupervised learning problem. If I can find some data that indicates brand awareness, and use some metrics as predictors, then it's a supervised problem. Well, if I cannot find the indicator kind of response, then I have to figure out another way to sort of build that index using whatever predictive metrics I can find online and for free.

The first thing came to my mind was yelp.com, which is a relatively comprehensive rating website on local businesses. Naturally what I want to do is to crawl that website and get some useful information out, like location of business (in order to calculate distance of that business to a particular user), number of ratings (one indicator of brand awareness), rating itself (brand quality), etc.

Secondly, I want to see if I can get some twitter data. If I happened to know the ip of the twitter user, I can figure out his/her location and just see how many tweets a local business can have. Similarly, google analytics might be helpful too, but I am not sure.

No comments:

Post a Comment