hadoop - How to read a record that is split into multiple lines and also how to handle broken records during input split -
i have log file below
begin ... 12-07-2008 02:00:05 ----> record1 incidentid: inc001 description: blah blah blah owner: abc status: resolved end .... 13-07-2008 02:00:05 begin ... 12-07-2008 03:00:05 ----> record2 incidentid: inc002 description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah owner: abc status: resolved end .... 13-07-2008 03:00:05
i want use mapreduce processing this. , want extract incident id, status , time taken incident
how handle both records have variable record lengths , if input split happens before record ends.
you'll need write own input format , record reader ensure proper file splitting around record delimiter.
basically record reader need seek it's split byte offset, scan forward (read lines) until finds either:
- the
begin ...
line- read lines upto next
end ...
line , provide these lines between begin , end input next record
- read lines upto next
- it scans pasts end of split or finds eof
this similar in algorithm how mahout's xmlinputformat handles multi line xml input - in fact might able amend source code directly handle situation.
as mentioned in @irw's answer, nlineinputformat
option if records have fixed number of lines per record, inefficient larger files has open , read entire file discover line offsets in input format's getsplits()
method.
Comments
Post a Comment