Parsing html data into python list for manipulation -


i trying read in html websites , extract data. example, read in eps (earnings per share) past 5 years of companies. basically, can read in , can use either beautifulsoup or html2text create huge text block. want search file -- have been using re.search -- can't seem work properly. here line trying access:

eps (basic)\n13.4620.6226.6930.1732.81\n\n

so create list called eps = [13.46, 20.62, 26.69, 30.17, 32.81].

thanks help.

from stripogram import html2text urllib import urlopen import re beautifulsoup import beautifulsoup  ticker_symbol = 'goog' url = 'http://www.marketwatch.com/investing/stock/' full_url = url + ticker_symbol + '/financials'  #build url  text_soup = beautifulsoup(urlopen(full_url).read()) #read in   text_parts = text_soup.findall(text=true) text = ''.join(text_parts)  eps = re.search("eps\s+(\d+)", text) if eps not none:     print eps.group(1) 

it's not practice use regex parsing html. use beautifulsoup parser: find cell rowtitle class , eps (basic) text in it, iterate on next siblings valuecell class:

from urllib import urlopen beautifulsoup import beautifulsoup  url = 'http://www.marketwatch.com/investing/stock/goog/financials' text_soup = beautifulsoup(urlopen(url).read()) #read in  titles = text_soup.findall('td', {'class': 'rowtitle'}) title in titles:     if 'eps (basic)' in title.text:         print [td.text td in title.findnextsiblings(attrs={'class': 'valuecell'}) if td.text] 

prints:

['13.46', '20.62', '26.69', '30.17', '32.81'] 

hope helps.


Comments

Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -