hadoop - How to read a record that is split into multiple lines and also how to handle broken records during input split -
i have log file below
begin ... 12-07-2008 02:00:05         ----> record1 incidentid: inc001 description: blah blah blah  owner: abc  status: resolved  end .... 13-07-2008 02:00:05  begin ... 12-07-2008 03:00:05         ----> record2  incidentid: inc002  description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah owner: abc  status: resolved  end .... 13-07-2008 03:00:05 i want use mapreduce processing this. , want extract incident id, status , time taken incident
how handle both records have variable record lengths , if input split happens before record ends.
you'll need write own input format , record reader ensure proper file splitting around record delimiter.
basically record reader need seek it's split byte offset, scan forward (read lines) until finds either:
- the begin ...line- read lines upto next end ...line , provide these lines between begin , end input next record
 
- read lines upto next 
- it scans pasts end of split or finds eof
this similar in algorithm how mahout's xmlinputformat handles multi line xml input - in fact might able amend source code directly handle situation.
as mentioned in @irw's answer, nlineinputformat option if records have fixed number of lines per record, inefficient larger files has open , read entire file discover line offsets in input format's getsplits() method.
Comments
Post a Comment