algorithm - Records correlation/clustering using Hadoop -


our hadoop cluster ingests several terabytes of web logs daily. each log record contains information user ip address, cookie id , on. however, different ip addresses , cookie ids can correspond 1 physical user (home/work computers etc). designed function calculates matching score pair of records, higher score means higher probability both records correspond 1 physical user.

the goal split records groups presumably correspond 1 physical user using scoring function , mark records in group unique group id (i.e. physical user id). best way implement logic using hadoop/mahout?

for start, i'm going assume know how chain mapreduce jobs. if not, see http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining details.

second, assume have distributed key/value store available you, cassandra.

third, scoring function not make sense me. wouldn't think "one record here, 1 record there" let know same person. believe "records here compared records there = estimate of whether or not same person". assume, contrary description, how scoring function works.

now theoretically nice way solve problem?

  1. process logs, put store map of unique machine identifier (ip address + cookie) + date range logged events.

  2. extract out list of unique machine identifiers. store well.

  3. perform mapreduce map takes machine identifier, grabs list of of others, , emits pairs of different machine identifiers first less second. reduce queries logged events of each, computes score, , if score on threshold emits data point larger machine identifier maps smaller one.

  4. the output of 3 piped map reduce map nothing, , finds each machine identifier smallest machine identifier maps to.

  5. the output of 4 piped map reduce map takes pair (machine identifier, canonical machine identifier), grabs events store in #1, , remaps them (canonical machine identifier, rest of event), , reduce stores canonical machine identifier (aka group id), associated events. (by date if want.)

ok, that's nice theory. go wrong?

the problem pairs of identifiers darned many. list wind being on order of 1018, each of you're pulling logs. unless have phenomenal hardware, you're going run out of processing power calculate that. therefore need find heuristics reduce it.

the first, , simplest, identified "these 2 identifiers map same one" should stored , reused front wherever possible.

second, after large initial job can away with, "of created identifiers, map to?" candidates addition canonical mapping, don't want recreating.

third, i'm sure have notion of "similar record". map records sort of record group, , in reduce have "small enough" groups map pairs off "possibly same". send pairs map reduce grabs "possibly same" records , creates lookup mapping machine identifier "came possibly same more x times". save that. repeat above, except in step 2 send machine identifier pairs of "came possibly same" other. shortcut reduce work.

this general strategy lot of work. luck.


Comments

Popular posts from this blog

How to mention the localhost in android -

php - Calling a template part from a post -

c# - String.format() DateTime With Arabic culture -