Thursday, August 12, 2010

Extracting "gems" from web pages using BeautifulSoup and Python - extracting II

**** step 4 ****
extract the business address (suppose tt=results[0]).
The address itself is closed to a tag called "address", for example

<address>
3002 W 47th Ave <br/>Kansas City, MO 64103 <br/>
</address>

Since I want the address and zip code separate, I used

# address
>> address_tmp=tt.find('address')
>> address_tmp=BeautifulSoup(re.sub('(<.*?/>)',' ',str(address_tmp))).text
>> address=address_tmp.rstrip('1234567890 ')
>> zipcode=re.search(r'\d+$',address_tmp).group()

The first line is saying "find the tag with name address". Because of the
stuff in the middle of the string, I have to use regular expression to replace them with a single space. Then I change the string back to BeautifulSoup object and finally get the text between the "address" tages.

u'3002 W 47th Ave Kansas City, KS 66103'

The third line tends to trim off zipcode and any space after the state abbreviation. On the contrary, the fourth line utilizes regular expression to extract the zipcodes (which are few digits at the end of the string).

There is one thing that needs to pay attention is to use "?" to make the search non-greedy, meaning as long as the search find the pattern, it will return it. see here Google Python Class for more details.

**** step 5 ****
Oops, did I forget to get the business name first? All right, if you realize the name part is heading in a div tag with it's attributes called "itemheading", you will get the entire piece and the name part easily through

>> name_rank=tt.find('div', attrs={'class':'itemheading'}).text
>> name_rank
u"1.\n \tOklahoma Joe's BBQ & Catering"
>> names=name_rank.lstrip('1234567890. ').replace('\n \t', '')
>> names
"Oklahoma Joe's BBQ & Catering"

Very similar to the address part, the new piece added is the replace function operated on strings to remove '\n \t'.

**** step 6 ****
Next, I'd like to know the category or categories for those restaurants. Those are stored in the text of 'a' tags with class label of "category". The 2nd line below join all category labels using ',' to make them one string. And the 3rd line removes '\n' from the string.

# multiple categories so
>> cats=tt.findAll('a',attrs={'class':'category'})
>> category=','.join([x.text for x in cats])
>> category=str(category).replace('\n', '')
>> category
u'Barbeque'


**** step 7 ****
Similar idea applies to get number of reviews and rating as following

# reviews '217 reviews', have to parse the number part using regular expression
>> review_tmp=tt.find('a',attrs={'class':'reviews'}).text
>> rev_count=re.search(r'^\d+', review_tmp).group()
>> rev_count
u'217'
# rating scale
>> rating_tmp=tt.find('div',attrs={'class':'rating'})
>> rating=re.search(r'^\w+\.*\w*',rating_tmp('img')[0]['title']).group()
>> rating
u'4.5'



After all these 7 steps, I write them into a csv file through

out = open('some.csv','a') # append and write
TWriter = csv.writer(out, delimiter='|', quoting=csv.QUOTE_MINIMAL)
TWriter.writerow([ranks, names, category, address, zipcode, rev_count,rating])
out.close()

Tuesday, August 3, 2010

Extracting "gems" from web pages using BeautifulSoup and Python - extracting I

The web crawling piece I talked previously seems a little bit easier. So I will start from there. There are actually two pieces involved: first one is to crawl the website and download its related webpages, second is to extract information out of the saved html pages.

I will start from the 2nd half of work and show step by step how I work with the sample webpages from yelp.com and process them in Python using BeautifulSoup.

Ok, let's get started. First of all, give you an idea of what does yelp.com look like. Below is a small screen shot on the restaurant search result for Kansas City metro area. What I really want to extract from this page is: the business unit name, address, zip, category, number of reviews and rating.



This is a typical search result from yelp website, which includes something totally unrelated with what we need (like the sponsor ads for "The Well"), then the stuff we care, and finally stuff totally irrelevant again. Not that bad, right? But when you look at the underneath html codes, you will see this



Really messy!!! How can you clearly see the stuff you want, let along extract anything them from such a html page? Here is when you should have thought about using BeautifulSoup, which is a terrific, flexible HTML/XML parser for Python.

Now, it's time to get hands dirty with some python codes.
**** step 1 ****
inside python (under windows), load the modules that will be used

import re # regular expression module
import os # operation system module
import csv
from BeautifulSoup import BeautifulSoup

**** step 2 ****
point python the html file I saved and open it of course

os.chdir('C:\\Documents and Settings\\v\\Desktop\\playground\\yelp data')
f=open('rest3.htm','r')

**** step 3 ****
create a soup object and extract the business units

soup=BeautifulSoup(f)

results=soup.findAll('div', attrs={'class':'businessresult clearfix'})

After the "BeautifulSoup" command applied to file f, the output "soup" becomes a soup object (basically just a text file). Observing the original html pages, you will find yelp organize the search result using a div with class="businessresult clearfix". So the second line is really saying give me all the div sections which has class="businessresult clearfix". In this case, "results" is a list of 40 text stings, each one contains the information we want about a single restaurant. Next I will loop through each element in the list ("for tt in results:") and extract "gems" one by one.

Extracting "gems" from web pages using BeautifulSoup and Python - BI part

Lately, I am trying to find a way to answer this question: what's the best local restaurants or spas? So we are trying to figure out which local businesses will worth the most to go after.

Ok, that could be a supervised or unsupervised learning problem. If I can find some data that indicates brand awareness, and use some metrics as predictors, then it's a supervised problem. Well, if I cannot find the indicator kind of response, then I have to figure out another way to sort of build that index using whatever predictive metrics I can find online and for free.

The first thing came to my mind was yelp.com, which is a relatively comprehensive rating website on local businesses. Naturally what I want to do is to crawl that website and get some useful information out, like location of business (in order to calculate distance of that business to a particular user), number of ratings (one indicator of brand awareness), rating itself (brand quality), etc.

Secondly, I want to see if I can get some twitter data. If I happened to know the ip of the twitter user, I can figure out his/her location and just see how many tweets a local business can have. Similarly, google analytics might be helpful too, but I am not sure.