Nowadays,virtual goods market is very hot! Experts predict the sales total will be 2.5 billion by 2013.

And there are tons of information out there about virtual goods itself, and what the days will look at going forward. Particularly, I found the post by Maria Korolov , a few posts by Avril Korman and the "inside virtual goods" report by Justin Smith from Inside Network helped me to better understand the market mechanism and the its future.

I also studied IMVU, which is a site that allow you register an account and start to socialize with others in 3D in just a minute. I found their (or in general this type of company's) business model is very self-sufficient. Users generate content (virtual goods), sell the contents to other users, and they collect the money. Then their job would be to reach to more users, retain existing users and reach to even more users. And that's not easy job!

Based on the limited information and experience I have so far. In the next few years, what I like to see more are (1) creating new ways to partner with real life brand owners (what I mean here are retails, manufacturing companies)(2) grab the non-English speaking market (3) have something to offer for users with different taste and expectation.

Can brand name company produced some kind of virtual goods that are both suitable for selling on the sites and related to the products they sell in real life, and sell it at cheaper price than user created goods? So that they could use social sites as a lab to do product development and nurture their potential customers at the same time. Users probably feel more comfortable with brand names in virtual world just as they do in everyday life.

I found Avril's post on fashion very interesting. Then could company start to being more aggressive in this direction, meaning create more products to just improve user experience in certain categories? Now IMVU have a daily outfit contest, can we have a fashion show, and let user decide what they want to see? And can we work with some young designers as partners to see if we can make their label more aware and more sale. We do revenue share with them. By marketing deeper in certain categories, users will get more involved and addictive. It's a win-win for both sides.

How can anybody forget the other side of the world, the more populated continent and more conservative group of people? Conservative here only means their culture kind of educate them to value group more than themselves. These type of people must have stronger wish to express themselves. And virtual world seems to be an ideal choice for them - people don't know who you are and you can whatever you want without worrying the consequences. So I like to see companies like IMVU grabbing international users as fast as they can. It's going to be an asset with unpredictable value.

Lastly, say I have a user, who has no interest in outfits or the digital sex. But that user really want to climb to a cliff and jump down. It does not mean that user want to kill himself. It simply means he want to have other type experience. Will that type of user find something they want in virtual world too?

## Friday, January 28, 2011

## Tuesday, January 18, 2011

### Visual That

Yesterday I opened my business week magazine and saw this picture in an article talking about CES in Las Vegas this year. The picture compares the size and weight of Panasonic new garage-size 3D TV to something people are very familiar with, like Koby Brant's height, length of mini cooper, weight of a cow. So that users reading this article have rough idea how big the TV is.

Precision of measurements is not very important in this case, the familiarity and approximation of measurements are the key. And this is an idea that have sat in my mind for a month now. How desperately I want to realize that idea by building my own website! I actually already have a name for the website. It's called "visualthat.com", isn't it a cool name? Maybe at this moment I don't have the ability to do that now. At the very minimal, I can tape my thoughts here. :)

What I have been thinking is a website that allows users to input the metric they are interested in (for example, height, weight, volume, currency etc) and the amount of the metric in numbers. And the website spits out a visualization of the metric comparing against something average people are familiar with. For example, I am interested in 1 yard. The website spits out a picture of a height of SUV, and visualize how much a yard is comparing to that height. This visualization could be chop some part of the SUV (suppose SUV is taller than 1 yard), or a ruler next to the picture indicating the approximate position of 1 yard. Another example, I am interested in how much 1 dollar is worth in Chinese currency. Well, I input 1 dollar, it shows me a mac burg from McDonald’s dollar menu, and probably 3/4 of the same thing over a map of China or 2 apples. To make it more fun, users are allowed to choose baseline they’d like to compare to. And an app for mobile devices can be created too.

The closest thing I can find on the internet is Wolfram Alpha . However, I don’t quite like (1) no pictures, (2) too many query results crowded in one page. This Scale of Universe from Primax Studio is also cool, but it does not (1) allow user input (2) choose baseline.

Anyway, this is going to be a very fun project, making measurement conversion allowing user input query and visualizing query result. And it’s going to be a challenging one! Hmmm, maybe I should start learning some html and JavaScript.

## Friday, January 14, 2011

### Market Basket Analysis using R

R has a package that deals with association rule mining tasks. Its name is "arules". It implemented Apriori algorithm and Eclat by calling C in the back-end.

I found a small transaction dataset (2k rows) online, which has 2 columns 'order_id' and 'item'. The first few rows look like this

/*

10248 item042

10248 item072

10249 item014

10249 item051

10250 item041

10250 item051

...

*/

Then I performed the following R codes to analyze this data

## call the 'arules' library [if you have not installed it, run "install.packages('arules')" in R.]

library(arules)

## import the data

# read.transactions is a very nice function to read

# in this type of transaction data very quickly.

# I have seen others reading the data using read.table

# and then transform it, which was too much manipulation.

# "format" argument tells R if each line has just one item

# (single) or multiply items (basket) seperated by "sep="

# "cols" argument is a numeric vector of length 2 giving

# the numbers of the columns with the transaction and item ids

# for single format. And it can be a numeric scalar

# giving the number of the column with transaction ids

# for basket format.

trans_data <- read.transactions(file='trans_data.csv', rm.duplicates=F, format='single', sep=',', cols=c(1,2))

# take a peak at what's in each basket

inspect(trans_data)

## after reading the data in, there are some trivial

#functions you can use on it

class(trans_data)

summary(trans_data)

## for each item, you can count the frequency of appearance

# or ratio of appearance relative to total number of transactions

itemFrequency(trans_data, type='absolute')

itemFrequency(trans_data, type='relative')

## item_freq/total_num_of_orders

## there are some visualization functions one can use

# to "see" the data

image(trans_data)

itemFrequencyPlot(trans_data)

## apply Apriori algorithms to mine association rules

# there are a few parameters that you can tweak with,

# see the package manual instruction on 'APparameter-class'

# in this example, I specifically tell R I want 1-way rules that

# has minsupport of .002 and minconfidence of .2 by setting up

# parameters in the following way

rules <- apriori(trans_data, parameter = list(supp = 0.002, conf = 0.2, target = "rules", maxlen=2, minlen=1))

parameter specification:

confidence minval smax arem aval originalSupport support minlen maxlen target

0.2 0.1 1 none FALSE TRUE 0.002 1 2 rules

ext

FALSE

algorithmic control:

filter tree heap memopt load sort verbose

0.1 TRUE TRUE FALSE TRUE 2 TRUE

Warning in apriori(trans_data, parameter = list(supp = 0.002, conf = 0.2, :

You chose a very low absolute support count of 1. You might run out of memory! Increase minimum support.

apriori - find association rules with the apriori algorithm

version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt

set item appearances ...[0 item(s)] done [0.00s].

set transactions ...[77 item(s), 830 transaction(s)] done [0.01s].

sorting and recoding items ... [77 item(s)] done [0.00s].

creating transaction tree ... done [0.00s].

checking subsets of size 1 2 done [0.00s].

writing ... [30 rule(s)] done [0.00s].

creating S4 object ... done [0.00s].

## check rules

inspect(rules[30])

# lhs rhs support confidence lift

# 1 {item021} => {item061} 0.009638554 0.2051282 7.094017

## extract measurments or other stuff you want

str(rules)

attributes(rules)$quality

The same rule showed up in the query mining too. The support, confidence and lift metrics match up between R and sql.

-[ RECORD 1 ]-----+-------------------------

rules | item021 => item061

lhs | item021

rhs | item061

num_lhs_rhs | 8.0

num_basket | 830.0

num_lhs | 39.0

num_rhs | 24.0

num_no_lhs | 791.0

num_no_rhs | 806.0

num_lhs_no_rhs | 31.0

num_no_lhs_rhs | 16.0

num_no_lhs_no_rhs | 775.0

support | 0.0096385542168

confidence | 0.2051282051282

lift | 7.0940170940170

chi_square | 45.253247622206

laplace | 0.2195121951219

conviction | 1.2216867469879

added_value | 0.1762125424776

certainty_factor | 0.1814595660749

j_measure | 0.0114057917394

gini_index | 0.0030619052969

jaccard | 0.1454545454545

shapiro | 0.0082798664537

cosine | 0.2614881801842

correlation | 0.2334994327337

odds_ratio | 12.500000000000

I found a small transaction dataset (2k rows) online, which has 2 columns 'order_id' and 'item'. The first few rows look like this

/*

10248 item042

10248 item072

10249 item014

10249 item051

10250 item041

10250 item051

...

*/

Then I performed the following R codes to analyze this data

## call the 'arules' library [if you have not installed it, run "install.packages('arules')" in R.]

library(arules)

## import the data

# read.transactions is a very nice function to read

# in this type of transaction data very quickly.

# I have seen others reading the data using read.table

# and then transform it, which was too much manipulation.

# "format" argument tells R if each line has just one item

# (single) or multiply items (basket) seperated by "sep="

# "cols" argument is a numeric vector of length 2 giving

# the numbers of the columns with the transaction and item ids

# for single format. And it can be a numeric scalar

# giving the number of the column with transaction ids

# for basket format.

trans_data <- read.transactions(file='trans_data.csv', rm.duplicates=F, format='single', sep=',', cols=c(1,2))

# take a peak at what's in each basket

inspect(trans_data)

## after reading the data in, there are some trivial

#functions you can use on it

class(trans_data)

summary(trans_data)

## for each item, you can count the frequency of appearance

# or ratio of appearance relative to total number of transactions

itemFrequency(trans_data, type='absolute')

itemFrequency(trans_data, type='relative')

## item_freq/total_num_of_orders

## there are some visualization functions one can use

# to "see" the data

image(trans_data)

itemFrequencyPlot(trans_data)

## apply Apriori algorithms to mine association rules

# there are a few parameters that you can tweak with,

# see the package manual instruction on 'APparameter-class'

# in this example, I specifically tell R I want 1-way rules that

# has minsupport of .002 and minconfidence of .2 by setting up

# parameters in the following way

rules <- apriori(trans_data, parameter = list(supp = 0.002, conf = 0.2, target = "rules", maxlen=2, minlen=1))

parameter specification:

confidence minval smax arem aval originalSupport support minlen maxlen target

0.2 0.1 1 none FALSE TRUE 0.002 1 2 rules

ext

FALSE

algorithmic control:

filter tree heap memopt load sort verbose

0.1 TRUE TRUE FALSE TRUE 2 TRUE

Warning in apriori(trans_data, parameter = list(supp = 0.002, conf = 0.2, :

You chose a very low absolute support count of 1. You might run out of memory! Increase minimum support.

apriori - find association rules with the apriori algorithm

version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt

set item appearances ...[0 item(s)] done [0.00s].

set transactions ...[77 item(s), 830 transaction(s)] done [0.01s].

sorting and recoding items ... [77 item(s)] done [0.00s].

creating transaction tree ... done [0.00s].

checking subsets of size 1 2 done [0.00s].

writing ... [30 rule(s)] done [0.00s].

creating S4 object ... done [0.00s].

## check rules

inspect(rules[30])

# lhs rhs support confidence lift

# 1 {item021} => {item061} 0.009638554 0.2051282 7.094017

## extract measurments or other stuff you want

str(rules)

attributes(rules)$quality

The same rule showed up in the query mining too. The support, confidence and lift metrics match up between R and sql.

-[ RECORD 1 ]-----+-------------------------

rules | item021 => item061

lhs | item021

rhs | item061

num_lhs_rhs | 8.0

num_basket | 830.0

num_lhs | 39.0

num_rhs | 24.0

num_no_lhs | 791.0

num_no_rhs | 806.0

num_lhs_no_rhs | 31.0

num_no_lhs_rhs | 16.0

num_no_lhs_no_rhs | 775.0

support | 0.0096385542168

confidence | 0.2051282051282

lift | 7.0940170940170

chi_square | 45.253247622206

laplace | 0.2195121951219

conviction | 1.2216867469879

added_value | 0.1762125424776

certainty_factor | 0.1814595660749

j_measure | 0.0114057917394

gini_index | 0.0030619052969

jaccard | 0.1454545454545

shapiro | 0.0082798664537

cosine | 0.2614881801842

correlation | 0.2334994327337

odds_ratio | 12.500000000000

## Wednesday, January 12, 2011

### Market Basket Analysis using SQL

Essentially, association rule mining is all about counting. So it fits database environment very well, especially big scalable database. With some "group by"s and "join"s, the job can be easily accomplished using SQL.

Here I use Apriori algorithm to generate all 0-way (i.e., {}=>item B) and 1-way (i.e., item A=>item B) rules. The code first generate frequent itemsets that contains single item and have minimum support count bigger than minsup. And then it will generate all 1-way rules that have frequent itemsets with 2 items with support bigger than minsup, and have confidence bigger than some minconf. Rules bigger than 1 way are more complicated and require going through data iteratively until no more frequent itemsets are generated. The following codes are just meant to show how market basket analysis works and what kind of metrics are available to evaluate the rules.

/* Assuming your transaction data looks like this (order_id, item) pair

order 1, item 1

order 1, item 2

order 1, item 32

order 2, item 3

order 2, item 2

order 2, item 5

order 3, item 3

...

and it's already stored in the database with table name trans_data.

*/

-- Generating 0-way frequent itemsets using minsup=5

CREATE TABLE arule_0w AS

SELECT item, SUM(1) AS cnt

FROM trans_data

GROUP BY item

HAVING SUM(1) > 5

DISTRIBUTED BY (item)

;

-- Generate 1-way frequent itemsets,

CREATE TABLE arule_1w AS

SELECT lhs, rhs, COUNT(1) AS num_lhs_rhs

FROM (

SELECT item_lhs.order_id, lhs, rhs

FROM (SELECT a.order_id, a.item AS lhs

FROM trans_data a

JOIN arule_0w USING(item)

GROUP BY order_id, item) item_lhs

JOIN (SELECT b.order_id, b.item AS rhs

FROM trans_data b

JOIN arule_0w USING(item)

GROUP BY order_id, item) item_rhs

ON (item_lhs.order_id=item_rhs.order_id

AND item_lhs.lhs <> item_rhs.rhs)

) rules

GROUP BY lhs, rhs

HAVING COUNT(1) > 5

DISTRIBUTED BY (lhs, rhs)

;

You might have noticed that trans_data joined with the 0-way itemsets table to filter out items that are not frequent for the 0-way table. This is exactly how Apriori algorithm works, at each iteration, filtering out items are not frequent from previous run. Because as we search deeper in the data, the sets will become smaller and smaller, and if items are not frequent at the previous iteration, they won't be in the later iterations.

All right, the above query shows how to generate 0-way and 1-way itemsets. Next chunk of queries will show how to generate rules from itemsets and measurements that are used to evaluate the rules. Besides support and confidence, there are other measurements too, like cosine, Kappa, Jaccard, Laplace, Gini index, J-measure to name a few (for more details, see sample chapter 6 on "introduction to data mining" book website). Most of the interest measurements are calculated based on the following 2X2 contingency table. f11 is the number of times A and B appear together in the same transaction. f10 is the number of transactions that contains A but not B. The column sum f+1 is the support count for B, and f1+ is the support count for A. N is the total number of transactions.

| B | !B |

----|-----|-----|------

A | f11 | f10 | f1+

!A | f01 | f00 | f0+

----|-----|-----|------

| f+1 | f+0 | N

Next, I show how to generate rules and calculate some of those interest measures in query

CREATE TABLE arule_1w_with_measure AS

SELECT a.*

, (power(exp_lhs_rhs - num_lhs_rhs, 2)/exp_lhs_rhs + power(exp_lhs_no_rhs - num_lhs_no_rhs, 2)/exp_lhs_no_rhs + power(exp_no_lhs_rhs - num_no_lhs_rhs,2)/exp_no_lhs_rhs + power(exp_no_lhs_no_rhs - num_no_lhs_no_rhs, 2)/exp_no_lhs_no_rhs) as Chi_Square

, (num_lhs_rhs+1)/(num_lhs+2) as Laplace

, (num_lhs*num_no_rhs)/(num_basket * num_lhs_no_rhs) as Conviction

, (num_lhs_rhs/num_lhs - num_rhs/num_basket) as Added_Value

, (num_lhs_rhs/num_lhs - num_rhs/num_basket) /(1-num_rhs/num_basket) as Certainty_Factor

, (num_lhs_rhs/num_basket * ln((num_basket*num_lhs_rhs)/(num_lhs*num_rhs)) + num_lhs_no_rhs/num_basket * ln((num_basket*num_lhs_no_rhs)/(num_lhs*num_no_rhs))) as J_Measure

, num_lhs/num_basket*(power(num_lhs_rhs, 2)+power(num_lhs_no_rhs, 2))/power(num_lhs, 2)-power(num_rhs/num_basket, 2) + num_no_lhs/num_basket *(power(num_no_lhs_rhs,2) + power(num_no_lhs_no_rhs, 2))/power(num_no_lhs, 2) - power(num_no_rhs/num_basket, 2) as Gini_Index

, num_lhs_rhs/(num_lhs+num_rhs-num_lhs_rhs) as Jaccard

, (num_lhs_rhs/num_basket - num_rhs*num_lhs/power(num_basket, 2)) as Shapiro

, num_lhs_rhs/sqrt(num_lhs*num_rhs) as Cosine

, (num_lhs_rhs*num_basket-num_lhs*num_rhs)/sqrt(num_lhs*num_rhs*num_no_lhs*num_no_rhs) as Correlation

, num_lhs_rhs*num_no_lhs_no_rhs/(num_lhs_no_rhs*num_no_lhs_rhs) as Odds_Ratio

FROM (SELECT lhs_rhs.lhs||' => '|| lhs_rhs.rhs as rules

, lhs_rhs.lhs

, lhs_rhs.rhs

, num_lhs_rhs*1.0 as num_lhs_rhs

, num_basket*1.0 as num_basket

, num_lhs*1.0 as num_lhs

, num_rhs*1.0 as num_rhs

, (num_basket - num_lhs)*1.0 as num_no_lhs

, (num_basket - num_rhs)*1.0 as num_no_rhs

, (num_lhs - num_lhs_rhs)*1.0 as num_lhs_no_rhs

, (num_rhs - num_lhs_rhs)*1.0 as num_no_lhs_rhs

, (num_basket - num_lhs - num_rhs + num_lhs_rhs)*1.0 as num_no_lhs_no_rhs

, num_lhs * num_rhs * 1.0 /num_basket as exp_lhs_rhs

, num_lhs * (1.0 * num_basket - num_rhs) * 1.0 /num_basket as exp_lhs_no_rhs

, (1.0 * num_basket - num_lhs) * num_rhs /num_basket as exp_no_lhs_rhs

, (1.0 * num_basket - num_lhs) * (1.0 * num_basket - num_rhs) /num_basket as exp_no_lhs_no_rhs

, num_lhs_rhs * 1.0 /num_basket as support

, num_lhs_rhs * 1.0 /num_lhs as confidence

, num_lhs_rhs * num_basket * 1.0/(num_lhs * num_rhs) as lift

FROM arule_1w as lhs_rhs

JOIN (SELECT item as lhs

, count(1) as num_lhs

FROM trans_data

GROUP BY lhs) sum_lhs

ON lhs_rhs.lhs=sum_lhs.lhs

JOIN (SELECT item as rhs

, COUNT(1) as num_rhs

FROM trans_data

GROUP BY rhs) sum_rhs

ON lhs_rhs.rhs=sum_rhs.rhs

CROSS JOIN (SELECT COUNT(1) as num_basket

FROM (SELECT order_id FROM trans_data GROUP BY order_id) foo) total

) a

WHERE lift > 1 -- only the good rules

AND confidence > .0001

DISTRIBUTED BY (lhs, rhs)

;

Notice the where clause at the end insures that we only generate rules with confidence bigger than a threshold of .0001 and lift > 1, meaning positively related rules. One trick I used is to multiply 1.0 to integer counts, so later on when you calculate those measure, the SQL won't complain about over floating of integers.

Here I use Apriori algorithm to generate all 0-way (i.e., {}=>item B) and 1-way (i.e., item A=>item B) rules. The code first generate frequent itemsets that contains single item and have minimum support count bigger than minsup. And then it will generate all 1-way rules that have frequent itemsets with 2 items with support bigger than minsup, and have confidence bigger than some minconf. Rules bigger than 1 way are more complicated and require going through data iteratively until no more frequent itemsets are generated. The following codes are just meant to show how market basket analysis works and what kind of metrics are available to evaluate the rules.

/* Assuming your transaction data looks like this (order_id, item) pair

order 1, item 1

order 1, item 2

order 1, item 32

order 2, item 3

order 2, item 2

order 2, item 5

order 3, item 3

...

and it's already stored in the database with table name trans_data.

*/

-- Generating 0-way frequent itemsets using minsup=5

CREATE TABLE arule_0w AS

SELECT item, SUM(1) AS cnt

FROM trans_data

GROUP BY item

HAVING SUM(1) > 5

DISTRIBUTED BY (item)

;

-- Generate 1-way frequent itemsets,

CREATE TABLE arule_1w AS

SELECT lhs, rhs, COUNT(1) AS num_lhs_rhs

FROM (

SELECT item_lhs.order_id, lhs, rhs

FROM (SELECT a.order_id, a.item AS lhs

FROM trans_data a

JOIN arule_0w USING(item)

GROUP BY order_id, item) item_lhs

JOIN (SELECT b.order_id, b.item AS rhs

FROM trans_data b

JOIN arule_0w USING(item)

GROUP BY order_id, item) item_rhs

ON (item_lhs.order_id=item_rhs.order_id

AND item_lhs.lhs <> item_rhs.rhs)

) rules

GROUP BY lhs, rhs

HAVING COUNT(1) > 5

DISTRIBUTED BY (lhs, rhs)

;

You might have noticed that trans_data joined with the 0-way itemsets table to filter out items that are not frequent for the 0-way table. This is exactly how Apriori algorithm works, at each iteration, filtering out items are not frequent from previous run. Because as we search deeper in the data, the sets will become smaller and smaller, and if items are not frequent at the previous iteration, they won't be in the later iterations.

All right, the above query shows how to generate 0-way and 1-way itemsets. Next chunk of queries will show how to generate rules from itemsets and measurements that are used to evaluate the rules. Besides support and confidence, there are other measurements too, like cosine, Kappa, Jaccard, Laplace, Gini index, J-measure to name a few (for more details, see sample chapter 6 on "introduction to data mining" book website). Most of the interest measurements are calculated based on the following 2X2 contingency table. f11 is the number of times A and B appear together in the same transaction. f10 is the number of transactions that contains A but not B. The column sum f+1 is the support count for B, and f1+ is the support count for A. N is the total number of transactions.

| B | !B |

----|-----|-----|------

A | f11 | f10 | f1+

!A | f01 | f00 | f0+

----|-----|-----|------

| f+1 | f+0 | N

Next, I show how to generate rules and calculate some of those interest measures in query

CREATE TABLE arule_1w_with_measure AS

SELECT a.*

, (power(exp_lhs_rhs - num_lhs_rhs, 2)/exp_lhs_rhs + power(exp_lhs_no_rhs - num_lhs_no_rhs, 2)/exp_lhs_no_rhs + power(exp_no_lhs_rhs - num_no_lhs_rhs,2)/exp_no_lhs_rhs + power(exp_no_lhs_no_rhs - num_no_lhs_no_rhs, 2)/exp_no_lhs_no_rhs) as Chi_Square

, (num_lhs_rhs+1)/(num_lhs+2) as Laplace

, (num_lhs*num_no_rhs)/(num_basket * num_lhs_no_rhs) as Conviction

, (num_lhs_rhs/num_lhs - num_rhs/num_basket) as Added_Value

, (num_lhs_rhs/num_lhs - num_rhs/num_basket) /(1-num_rhs/num_basket) as Certainty_Factor

, (num_lhs_rhs/num_basket * ln((num_basket*num_lhs_rhs)/(num_lhs*num_rhs)) + num_lhs_no_rhs/num_basket * ln((num_basket*num_lhs_no_rhs)/(num_lhs*num_no_rhs))) as J_Measure

, num_lhs/num_basket*(power(num_lhs_rhs, 2)+power(num_lhs_no_rhs, 2))/power(num_lhs, 2)-power(num_rhs/num_basket, 2) + num_no_lhs/num_basket *(power(num_no_lhs_rhs,2) + power(num_no_lhs_no_rhs, 2))/power(num_no_lhs, 2) - power(num_no_rhs/num_basket, 2) as Gini_Index

, num_lhs_rhs/(num_lhs+num_rhs-num_lhs_rhs) as Jaccard

, (num_lhs_rhs/num_basket - num_rhs*num_lhs/power(num_basket, 2)) as Shapiro

, num_lhs_rhs/sqrt(num_lhs*num_rhs) as Cosine

, (num_lhs_rhs*num_basket-num_lhs*num_rhs)/sqrt(num_lhs*num_rhs*num_no_lhs*num_no_rhs) as Correlation

, num_lhs_rhs*num_no_lhs_no_rhs/(num_lhs_no_rhs*num_no_lhs_rhs) as Odds_Ratio

FROM (SELECT lhs_rhs.lhs||' => '|| lhs_rhs.rhs as rules

, lhs_rhs.lhs

, lhs_rhs.rhs

, num_lhs_rhs*1.0 as num_lhs_rhs

, num_basket*1.0 as num_basket

, num_lhs*1.0 as num_lhs

, num_rhs*1.0 as num_rhs

, (num_basket - num_lhs)*1.0 as num_no_lhs

, (num_basket - num_rhs)*1.0 as num_no_rhs

, (num_lhs - num_lhs_rhs)*1.0 as num_lhs_no_rhs

, (num_rhs - num_lhs_rhs)*1.0 as num_no_lhs_rhs

, (num_basket - num_lhs - num_rhs + num_lhs_rhs)*1.0 as num_no_lhs_no_rhs

, num_lhs * num_rhs * 1.0 /num_basket as exp_lhs_rhs

, num_lhs * (1.0 * num_basket - num_rhs) * 1.0 /num_basket as exp_lhs_no_rhs

, (1.0 * num_basket - num_lhs) * num_rhs /num_basket as exp_no_lhs_rhs

, (1.0 * num_basket - num_lhs) * (1.0 * num_basket - num_rhs) /num_basket as exp_no_lhs_no_rhs

, num_lhs_rhs * 1.0 /num_basket as support

, num_lhs_rhs * 1.0 /num_lhs as confidence

, num_lhs_rhs * num_basket * 1.0/(num_lhs * num_rhs) as lift

FROM arule_1w as lhs_rhs

JOIN (SELECT item as lhs

, count(1) as num_lhs

FROM trans_data

GROUP BY lhs) sum_lhs

ON lhs_rhs.lhs=sum_lhs.lhs

JOIN (SELECT item as rhs

, COUNT(1) as num_rhs

FROM trans_data

GROUP BY rhs) sum_rhs

ON lhs_rhs.rhs=sum_rhs.rhs

CROSS JOIN (SELECT COUNT(1) as num_basket

FROM (SELECT order_id FROM trans_data GROUP BY order_id) foo) total

) a

WHERE lift > 1 -- only the good rules

AND confidence > .0001

DISTRIBUTED BY (lhs, rhs)

;

Notice the where clause at the end insures that we only generate rules with confidence bigger than a threshold of .0001 and lift > 1, meaning positively related rules. One trick I used is to multiply 1.0 to integer counts, so later on when you calculate those measure, the SQL won't complain about over floating of integers.

## Tuesday, January 11, 2011

### Association Rule Mining (Market Basket Analysis)

Association rule mining is a well researched classical data mining technique. It's designed to find hidden pattern or relationship in large data set, pattern referring to co-appearance of high frequency different items given the existence of some item. Particularly, it's well suited for analyzing commercial transaction data. In that scenario (where it's usually called "market basket analysis"), the presence of items can be expressed using a binary variable taking value from {0,1}. For example, the transaction data at the checkout counters (each customer holding a basket of one or more items), shows that customer who buys A also purchases B. By examining huge amount of transaction data, interesting relationships/rules could be revealed, like people buy diaper also buy wipes (this is a trivial rule, not very interesting), or people buy diaper also buy beer (well, this is a lot more interesting than the previous rule). Rules like this can be very helpful for cross-marketing, catalog design, shelf organizing, products recommendation at the place-order page etc. Of course, besides this kind of consumer preference analysis, association rule mining works for other types of problems too, e.g., human resource management (employees who said positive things about initiative A also frequently complain about issue B), and the history of language.

Because association rule mining does not require users to provide a whole lot prior information and it deals with single values, words or items, it's well suited for data or text mining in large databases. In next posts, I will demonstrate how to do market basket analysis using SQL and R.

The rules can be denoted as A => B, meaning A implies B. The most commonly used measurements for association rule mining is support and confidence, where support =count(A & B together)/ count(total transactions) (joint probability for A and B), confidence =count(A & B together)/count(A) (conditional probability of B given A) and count(A) counts the number of transaction for event A. A lot of algorithms use support and confidence to prune the uninteresting less confident rules. For example, the well-known Apriori algorithm rapidly processes data based on preferred "threshold" values.

Here Professor Kumar showed some sample chapters for his book "Introduction to Data Mining", including the chapter for association rule mining.

Because association rule mining does not require users to provide a whole lot prior information and it deals with single values, words or items, it's well suited for data or text mining in large databases. In next posts, I will demonstrate how to do market basket analysis using SQL and R.

The rules can be denoted as A => B, meaning A implies B. The most commonly used measurements for association rule mining is support and confidence, where support =count(A & B together)/ count(total transactions) (joint probability for A and B), confidence =count(A & B together)/count(A) (conditional probability of B given A) and count(A) counts the number of transaction for event A. A lot of algorithms use support and confidence to prune the uninteresting less confident rules. For example, the well-known Apriori algorithm rapidly processes data based on preferred "threshold" values.

Here Professor Kumar showed some sample chapters for his book "Introduction to Data Mining", including the chapter for association rule mining.

Subscribe to:
Posts (Atom)