Monday, December 17, 2012

End of year data thinking

Lately, Big Aanalytics 2012, NY hosted an interesting live panel discussion. The name was quite catchy, "Do you believe in Santa? How about Data Scientist?” Four guests were invited to the panel, including Geoffrey Guerdat, Director of Data Engineering Group from Gilt.
Guests and audience had heated discussion around the topic. Personally, I enjoyed almost all of Geof's comments and opinions. He precisely described what I have been observing and thinking this year. In the following, I just highlighted a few points that really touched my heart.
  •  Team building

The Data Engineering group has 12 members, majoring in 3 main areas - Business Intelligence, Data Engineering and Data Science. This is the mix I'd like to see and be involved with as well. Data Engineers are extremely important, in my opinion, sometimes even more important than both BI and DS analyst. They are in charge of plumbing (ETL) and making sure everything works. Organizations normally start their data team with BI. So BI has longer history and more "credibility and reputation" than the fancy DS. Having a good BI sub-team, ensures the companies to have access to vital measurements and make smarter decision. The DS sub-team is crucial as well. As Geof pointed out, with the amount of data, and time it takes to process them, DS analysts kind of bridge the gap between BI and DE. And they are aware of more techniques/tools than traditional BI analysts.

Ideally, I'd like to see the mix of those 3 functions change over time. At beginning, one might want more BI and DE people, but way less DS people (definitely not completely missing. DS people need to get trained on company's data over time.) This mix will focus on sorting things out and serve other departments inside an organization. As things got more stable, one would get more DS but less BI people. Thus, the team could work closely with a few teams to solve harder problems.

Geof constructs his team around two Data Scientists, one is strong in Statistics and one is strong in Computer Science. They provide guidance and act as quarterbacks. The solution sounds very clever to me. My ideal team would include 1 director, who is very good at working inside an organization (aka, politics, as someone call it), 2 tech leads (1 stat and 1 cs). All other team members are acquired around this golden triangle. However, I see many companies hire "managers" to manager data teams, who have never written a single line of code. They had a hard time identifying problems and bottlenecks; they even had hard time recognizing/accepting suggestions from the data scientist inside the group. All is because that, they often don't know data as well as the people who work with data 40 hours a week.

I'd appreciate data managers to have good "listening" and "summarizing" skills, and data scientists to have the nature of curiosity and ability to prove or implement their own thoughts.

  • As more people become data scientists by clicking buttons inside "tools", is it good or bad?

Some companies who are in the tool business, actually aimed their goal to be "let everyone be data scientist". Please allow me to frown to such claims. It's very dangerous if everybody were a data scientist. "Data People" are armed with more and more powerful data mining weapons. They are capable of doing more harm. And not only they need to understand the underlying models to explain to others, they also need to know well enough to recognize the spots to reconstruct and optimize their models. People need to get trained on understand the input data (meaning familiar with the business, and knowing what's available), and output data (to identify how to act upon the insights).

  • How can someone tie a dollar amount to Data teams?

All the panelists agreed that it's hard to do so. Well, my response is "don't even get yourself there". In both of my jobs, companies tried to tie revenue goal/gain to data teams. They both failed. Whenever I see companies try to put a price tag on data teams, it only occurs to me that they haven't realized the value of their data, or truly recognized the fact that data team provides guidance and advice is helpful and important. They probably still think data as accessories, something supplementary not necessary. However, in my opinion, data should be treated as one of the organization's product lines. It's as important as all other products. With the amount of data we have on our users, and the amount of insights we know about them, we have just started the data journey.

  • Tools data scientists use

Geof mentioned R/sql/vi/emacs/shell/java. It seems rather primitive. However, they are really powerful. I hate teams become tool-dependant, which creates bottlenecks naturally. Because it's hard for others to maintain the system and make changes, particularly when the tool experts are not around.

  • What makes good data scientists?

"Moving the info around, reconstructing info in some other way, and making use out of it ...", Geof summarized. This truly describes what I am working in the past few months, to consolidate data in a way that is easy to consume and make sense to both analysts and the entire company. I believe, without solid foundations, no buildings on top should be called "success". So data plumbing and pumping is really the key to everything.

One of the audiences raised an interesting point, the short of data scientists is just a gap of education. Right now some schools are teaching statistics to elementaries. It's going to be fun to teach my kindergartener "averages" and tell him that "average American kindergartener" actually doesn't not exist! 

Sunday, August 26, 2012

One of the biggest challenges as a data professional...

Lately, when I heard others had heated discussion on algorithms or models. My heart felt a little bit bitter(and no kidding, that is the taste of jealousness). How come? I feel like I am facing one of the biggest challenges as a data professional, which is to fight against other people in order to help them and show them the truth using data.

This sounds rather odd, doesn't it? I have seen people, working really hard to add new features to a product, and setting up their success metric to be X% overall revenue lift Y days after rolling-out. Ignoring the fact that the feature has to be activated via clicking an icon of 4mm x 4mm in size, after the users moving their mouse over that otherwise invisible icon, the success metric does not seem be a bad one, right?

First of all, it's a good thing that people try to set up some metrics to measure the success of their project, before they actually implement it. However, in my opinion, the metric still has a few places that need to be re-considered and validated.
  1.  From product integrity's perspective, UI design needs to "promote" the new feature, at least doing no harm. Making it so invisible is very unfortunate. Other functionality team better work together on this as well. For example, sending emails, messages, etc. 
  2. Everybody wants revenue lift. Who doesn't? But not everyone realizes that there are 99 steps before the goal could be reached. For example, users need time to discovery new things, time to learn, time to use, and, time to increase usage if possible. The entire process is "time"-consuming. Will that Y days be enough? If not, then setting up a metric far down the road with limited time constrain, does not seem to be a smart move. Likely one is going to fail the project according to this metric.
  3. An organization, with some test system, is going to make the feature available in "rolling-out" fashion, instead of in front of all users' face at the same time. If during the entire Y days, only small amount of customers are in the test, it's very likely that one won't be able to see that X% lift. 
  4. If the feature is new and only going to affect a specific group of users, then the baseline revenue needs to be carefully chosen. And a historical data set should be examined to obtain some idea on how the baseline changes over time. Let it had a bigger variation than the X%, very unlikely one is going to detect the changes they desire to see. 
As more people/organizations realize the value of their data, some of them need to be "educated" on how to make sense out of it. That's one of the roles data scientists play. Just as data professionals come with all sorts of sizes and shapes (I mean background and trainings :P), their jobs varies in terms of the percentage mixes among the roles: "interpreter, teacher, visualizer, programmer and data cruncher".

Thursday, April 26, 2012

Visualizing too much data

This week, I got a chance to study a dataset of 13 million rows. Luckily, the file has no strings in it. So my (local) R reads it just fine, except taking a longer time. The original idea is to find a model/systematic pattern that describes the relationship between two fields in the dataset. Most of time, it's a good idea to take a peak at things before rolling sleeves. So now I ended facing the problems of visualizing too much data.

Of course, brute force plotting of every point won't work. The points will step on top of each other. The density of data points will get lost in the sea of points. And it takes forever to run the command. The better solution would be to use hexbin plots, which can handle million+ data points. The data points are first assigned to hexagons that covers the plotting area. Then head counts were done for each cell. At the end, the hexagons got plotted on a color ramp. R has a hexbin package to draw hexbin plots and a few more interesting functions. R ggolot2 package also has a stat_binhex function.

hexbinning that 13 million data points 
Quite surprisingly, I did not find a lot of literatures online regarding this binning techniques. But something to worth noting are:

  • Enrico Bertini has a post regarding things that could be done to deal with visualizing a lot data.
  • Zachary Forest Johnson has a post devoted to hexbins exclusively, which is very helpful.
  • Last but not least, the hexbin package documentation talks about why hexagons not squares, and the algorithms to generate those hexagons.

Tuesday, April 17, 2012

Some Python

The other day, I was trying to flag a posted (to the Redis server for quick lookup) recommendation from a dictionary of recommendations. Then the next round, I could do some weighted random sampling among the unposted ones, without actually going through the entire recommendation calculations. Anyway, it's a little bit tricky to neatly flag a 'used' recommendation in a python dictionary. Fortunately, someone has already provided a solution. I'd like to borrow it over for quick references.
>>> x = {'a':1, 'b': 2}
>>> y = {'b':10, 'c': 11}
>>> z = dict(x.items() + y.items())
>>> z
{'a': 1, 'c': 11, 'b': 10}

b's vlaue is properly overwritten by the value in second dictionary. In Python 3, this is suggested
>>> z = dict(list(x.items()) + list(y.items()))
>>> z
{'a': 1, 'c': 11, 'b': 10}

And I saw a cool post about using python's map, reduce, filter function to a dictionary. Once again, something could be cleanly and flexibly accomplished. Python is such a beautiful language.

Wednesday, February 29, 2012

Day 2 @ Strata 2012

The day went by very fast. However not a lot of interesting topics though. The keynote talks by Ben Goldacre and Avinash Kaushik were all right. Then the Netflix one was interesting, the speaker talked about what (quite a lot of) other things Netflix does beyond predicting ratings. The 'science of data visualization' was informative too.

One interesting observation I made today was that during the break hour, the man's room had lines, while the ladies' did not. That's totally different from other places I have been to, for example, the shopping mall. :P

Tuesday, February 28, 2012

Go Strata! Go DATA!

Today I finally walked in the Strata Conference for Data (and thank God that I live in California now.) I was quite excited about this, because there are tons going on in this conference. And people won't think you are a nerd, when you express your passion on ... DATA. Well, in my mind, the entire universe is a big dynamic information system. And what's floating inside the system? Of course, the data! And knowing more about data essentially helps people understand the system better, the universe better! It's so importance that it will become bigger and bigger part of your life. And maybe someday people will think data as vital as water and air :)

Anyway, today is the training day of Strata. I chose the 'Deep Data' track. The speakers were all fantastic! It's a great opportunity to see what others actually do with data and how they do it, instead of the tutorial sections where people just talk about the data. The talks I enjoyed the most are Claudia Perlich's 'From knowing what to understanding why' (she really has no holdout on the practical data mining tips. And I like the fact that she baked a lot of statistics knowledge into problem solving, which in my mind is missing on some of the data scientists. And I really like the assertive attitude when she said 'I will even look at the data, if somebody else pulled it'.), Ben Gimpert's 'The importance of importance: introduction to feature selection' (well, I always like these type of high level summary talks.), and, Matt Biddulph's 'Social network analysis isn't just for people' (the example that most impressed me is he used the fact that developers often listen to music while they write their code, so there is a connection between music and the programing language. Something that seems totally unrelated got brought into the wok and cooked together. Besides, he had some cool visualization using Gephi.)

At the end of day, there is an hour long debate between leading data scientist in the field (most of them came or come from Linkedin). The topic was 'Does domain expertise matters more than machine learning expertise?', meaning when you trying to assemble a team and make hire, do you have the machine learning guy or the domain expert? I personally vote against the statement, and I think the machine learning expertise matters more when I try to make the first hire. Think about it this way: when you have such an opening, you, the company should at least have idea about what you trying to solve (unless you are starting a machine learning consulting company, in which case the first hire better be machine learning people). So at that time, you already have some business domain experts inside your company. Then bringing in data miners will help you solve the problem that a domain expert couldn't solve. For example, your in-house domain expert could complain about data not very accessible, or too many predictors they don't know how and which one to look at. A machine learning person hopefully could provide advice on data storage, data processing, and modeling knowledge to help you sort out the data into some workable format, and systematically tell you that you are spending too much time on the features that do not make any difference and some other features should get more of your attention. To me, it's always an interactive feedback system between your data person and your domain expertise. And it's the way of thinking about business problems systematically in an approachable and organized fashion that values the most, not necessarily how many models or techniques that machine learning candidates knows.

Overall, Strata is a well-organized conference, that I want to attend every year!

Monday, February 13, 2012

Funnel plot, bar plot and R

I just finished my Omniture SiteCatalyst training in Mclean, VA a few days ago. It was ok (somehow boring), we only went through how to click buttons inside SiteCatalyst to generate reports, not necessarily how to implement it and let it track the information we want to track.

I got two impressions out of the class: one is Omniture is great and powerful web analytical tool; another is the funnel plots could be misleading from data visualization perspective. For example, regardless of why the second event 'Reg Form Viewed' has higher frequency than first event 'Web Landing Viewed', the funnel bar for second event is still narrower than the one for first event. Just because it's designed to be the second stage in the funnel report.

This is a typical example of visualization components do not match up the numbers. There could be other types of funnel plots that are misleading as well, as pointed out by Jon Peltier in his blog article. I totally agree with him on using the simple barplot to be an alternative for the funnel plots. And I also like his idea of adding another plot for visualizing some small yet important metric, like purchases as shown in his example.

Then I turned into R to see if I can do some quick poking around on how to display the misleading funnel I have here into something meaningful and hopefully beautiful. Since I always feel like I don't have a good grasp on how to do barplots in R, this is going to be a good exercise for me.

As always, figuring out the 3-letters parameters for base package plot function is painful. And I had to set up appropriate size of margins, so that my category names won't be cut off.

Then I drew the same plot using ggplot2. All the command names make sense. And the plot is built up layer by layer. However, I did not manage to get the x-axis to the top of the plot, which will involve creating new customized geom.

There are some nice R barchart tips on the web, for example on learning_r, stackoverflow, and gglot site. Anyway, this is what I used

##### barchart

dd = data.frame(cbind(234, 334, 82, 208, 68))
colnames(dd) = c('web_landing_viewed', 'reg_form_viewed', 'registration_complete', 'download_viewed', 'download_clicked')
dd_pct = round(unlist(c(1, dd[,2:5]/dd[,1:4]))*100, digits=0)

# plain barchart horizontal
#control outside margin so the text could be equeezed into the plot
#las directions of tick labels for x-y axis, range 0-3, so 4 combinations
mp<-barplot(as.matrix(rev(dd)), horiz=T, col='gray70', las=1, xaxt='n');
tot = paste(rev(dd_pct), '%');
# add percentage numbers
text(rev(dd)+17, mp, format(tot), xpd=T, col='blue', cex=.65)
# axis on top(side=3),'at' ticks location, las: parallel or pertanculiar to axis
axis(side=3,at=seq(from=0, to=30, by=5)*10, las=0)

# with ggplot2
dd2=data.frame(metric=c('web_landing_viewed', 'reg_form_viewed', 'registration_complete', 'download_viewed', 'download_clicked'), value=c(234, 334, 82, 208, 68))

ggplot(dd2, aes(metric, value)) + geom_bar(stat='identity', fill=I('grey50')) + coord_flip() + ylab('') + xlab('') + geom_errorbar(aes(ymin = value+10, ymax = value+10), size = 1) + geom_text(aes(y = value+20, label = paste(dd_pct, '%', sep=' ')), vjust = 0.5, size = 3.5)