Thursday, July 7, 2011

A Few Things Learned on Hadoop Streaming

In the past couple of days, I tried to run some map-reduce jobs on EMR through python streaming. The API I used is boto. It's really basic and not very documented. I was only able to find one example. One thing I learned the hard way is about the data coming out of hive. Surprising, no matter what input format (in terms of separators), the data out of hive is always 'ctrl-A' separated. Check this.

No comments:

Post a Comment