memory - oversized python dictionary -
i have large csv files (around 15e6 rows).
each row looks : id_user|id_software.
my aim build dictionary keys tuples of 2 distincts softwares , values probability these 2 softwares installed on same computer (a computer=id_user).
the first step consists in reading csv file , in building dictionary keys user ids , values tuples containing softwares installed on user's computer.
the second step created final dictionary reading first one.
my problem that, first 5e5 lignes of csv file give birth 1gb dictionary (i use heapy profile me algorithm).
any idea me solve issue ?
here code :
import csv import itertools dict_apparition={} dict_prob_jointe={} nb_user=0 open('discovery_requests_id.csv','ru') f: f=csv.reader(f) f.next() row in f: a=int(row[1]) b=int(row[0]) try: if not in dict_apparition[b]: dict_apparition[b]+=(a,) except: dict_apparition[b]=(a,) nb_user+=1 key1 in dict_apparition.keys(): l=itertools.combinations(dict_apparition[key1],2) item in l: try: dict_prob_jointe[item]+=1.0/nb_user except: dict_prob_jointe[item]=1.0/nb_user return dict_prob_jointe
thansks !
Comments
Post a Comment