Process large data in python -

September 15, 2012

i need process data few hundred times bigger ram. read in large chunk, process it, save result, free memory , repeat. there way make efficient in python?

the general key want process file iteratively.

if you're dealing text file, trivial: for line in f: reads in 1 line @ time. (actually buffers things up, buffers small enough don't have worry it.)

if you're dealing other specific file type, numpy binary file, csv file, xml document, etc., there similar special-purpose solutions, nobody can describe them unless tell kind of data have.

but if have general binary file?

first, read method takes optional max bytes read. so, instead of this:

data = f.read() process(data)

you can this:

while true:     data = f.read(8192)     if not data:         break     process(data)

you may want instead write function this:

def chunks(f):     while true:         data = f.read(8192)         if not data:             break         yield data

then can this:

for chunk in chunks(f):     process(chunk)

you two-argument iter, many people find bit obscure:

for chunk in iter(partial(f.read, 8192), b''):     process(chunk)

either way, option applies of other variants below (except single mmap, trivial enough there's no point).

there's nothing magic number 8192 there. want power of 2, , ideally multiple of system's page size. beyond that, performance won't vary whether you're using 4kb or 4mb—and if does, you'll have test works best use case.

anyway, assumes can process each 8k @ time without keeping around context. if you're, e.g., feeding data progressive decoder or hasher or something, that's perfect.

but if need process 1 "chunk" @ time, chunks end straddling 8k boundary. how deal that?

it depends on how chunks delimited in file, basic idea pretty simple. example, let's use nul bytes separator (not likely, easy show toy example).

data = b'' while true:     buf = f.read(8192)     if not buf:         process(data)         break     data += buf     chunks = data.split(b'\0')     chunk in chunks[:-1]:         process(chunk)     data = chunks[-1]

this kind of code common in networking (because sockets can't "read all", always have read buffer , chunk messages), may find useful examples in networking code uses protocol similar file format.

alternatively, can use mmap.

if virtual memory size larger file, trivial:

with mmap.mmap(f.fileno(), access=mmap.access_read) m:     process(m)

now m acts giant bytes object, if you'd called read() read whole thing memory—but os automatically page bits in , out of memory necessary.

if you're trying read file big fit virtual memory size (e.g., 4gb file 32-bit python, or 20eb file 64-bit python—which happen in 2013 if you're reading sparse or virtual file like, say, vm file process on linux), have implement windowing—mmap in piece of file @ time. example:

windowsize = 8*1024*1024 size = os.fstat(f.fileno()).st_size start in range(0, size, window size):     mmap.mmap(f.fileno(), access=mmap.access_read,                     length=windowsize, offset=start) m:         process(m)

of course mapping windows has same issue reading chunks if need delimit things, , can solve same way.

but, optimization, instead of buffering, can slide window forward page containing end of last complete message, instead of 8mb @ time, , can avoid copying. bit more complicated, if want it, search "sliding mmap window", , write new question if stuck.

Search This Blog

Live

Process large data in python -

Comments

Post a Comment

Popular posts from this blog

How to mention the localhost in android -

php - Calling a template part from a post -

c# - String.format() DateTime With Arabic culture -