python - Organizing data read from Excel to Pandas DataFrame -
my goal script to: 1.read timseries data in excel file (>100,000k rows) headers (labels, units) 2.convert excel numeric dates best datetime object pandas dataframe 3.be able use timestamps reference rows , series labels reference columns
so far used xlrd read excel data list. made pandas series each list , used time list index. combined series series headers make python dictionary. passed dictionary pandas dataframe. despite efforts df.index seems set column headers , i'm not sure when convert dates datetime object.
i started using python 3 days ago advice great! here's code:
#open excel workbook , first sheet wb = xlrd.open_workbook("c:\greencsv\calgary\cwater.xlsx") sh = wb.sheet_by_index(0) #read rows containing labels , units labels = sh.row_values(1, start_colx=0, end_colx=none) units = sh.row_values(2, start_colx=0, end_colx=none) #initialize list hold data data = [none] * (sh.ncols) #read column column , store in list colnum in range(sh.ncols): data[colnum] = sh.col_values(colnum, start_rowx=5, end_rowx=none) #delete unecessary rows , columns del labels[3],labels[0:2], units[3], units[0:2], data[3], data[0:2] #create pandas series s = [none] * (sh.ncols - 4) colnum in range(sh.ncols - 4): s[colnum] = series(data[colnum+1], index=data[0]) #create dictionary of series dictionary = {} in range(sh.ncols-4): dictionary[i]= {labels[i] : s[i]} #pass dictionary pandas dataframe df = pd.dataframe.from_dict(dictionary)
you can use pandas directly here, create dictionary of dataframes (with keys being sheet name):
in [11]: xl = pd.excelfile("c:\greencsv\calgary\cwater.xlsx") in [12]: xl.sheet_names # in example may different out[12]: [u'sheet1', u'sheet2', u'sheet3'] in [13]: dfs = {sheet: xl.parse(sheet) sheet in xl.sheet_names} in [14]: dfs['sheet1'] # access dataframe sheet name
you can check out docs on parse
offers more options (for example skiprows
), , these allows parse individual sheets more control...
Comments
Post a Comment