hadoop - How to read a record that is split into multiple lines and also how to handle broken records during input split -


i have log file below

begin ... 12-07-2008 02:00:05         ----> record1 incidentid: inc001 description: blah blah blah  owner: abc  status: resolved  end .... 13-07-2008 02:00:05  begin ... 12-07-2008 03:00:05         ----> record2  incidentid: inc002  description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah owner: abc  status: resolved  end .... 13-07-2008 03:00:05 

i want use mapreduce processing this. , want extract incident id, status , time taken incident

how handle both records have variable record lengths , if input split happens before record ends.

you'll need write own input format , record reader ensure proper file splitting around record delimiter.

basically record reader need seek it's split byte offset, scan forward (read lines) until finds either:

  • the begin ... line
    • read lines upto next end ... line , provide these lines between begin , end input next record
  • it scans pasts end of split or finds eof

this similar in algorithm how mahout's xmlinputformat handles multi line xml input - in fact might able amend source code directly handle situation.

as mentioned in @irw's answer, nlineinputformat option if records have fixed number of lines per record, inefficient larger files has open , read entire file discover line offsets in input format's getsplits() method.


Comments

Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -