hadoop - How to read a record that is split into multiple lines and also how to handle broken records during input split -

January 15, 2012

i have log file below

begin ... 12-07-2008 02:00:05         ----> record1 incidentid: inc001 description: blah blah blah  owner: abc  status: resolved  end .... 13-07-2008 02:00:05  begin ... 12-07-2008 03:00:05         ----> record2  incidentid: inc002  description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah owner: abc  status: resolved  end .... 13-07-2008 03:00:05

i want use mapreduce processing this. , want extract incident id, status , time taken incident

how handle both records have variable record lengths , if input split happens before record ends.

you'll need write own input format , record reader ensure proper file splitting around record delimiter.

basically record reader need seek it's split byte offset, scan forward (read lines) until finds either:

the begin ... line
- read lines upto next end ... line , provide these lines between begin , end input next record
it scans pasts end of split or finds eof

this similar in algorithm how mahout's xmlinputformat handles multi line xml input - in fact might able amend source code directly handle situation.

as mentioned in @irw's answer, nlineinputformat option if records have fixed number of lines per record, inefficient larger files has open , read entire file discover line offsets in input format's getsplits() method.

Search This Blog

Live

hadoop - How to read a record that is split into multiple lines and also how to handle broken records during input split -

Comments

Post a Comment

Popular posts from this blog

How to mention the localhost in android -

php - Calling a template part from a post -

c# - String.format() DateTime With Arabic culture -