Parsing html data into python list for manipulation -
i trying read in html websites , extract data. example, read in eps (earnings per share) past 5 years of companies. basically, can read in , can use either beautifulsoup or html2text create huge text block. want search file -- have been using re.search -- can't seem work properly. here line trying access:
eps (basic)\n13.4620.6226.6930.1732.81\n\n
so create list called eps = [13.46, 20.62, 26.69, 30.17, 32.81].
thanks help.
from stripogram import html2text urllib import urlopen import re beautifulsoup import beautifulsoup ticker_symbol = 'goog' url = 'http://www.marketwatch.com/investing/stock/' full_url = url + ticker_symbol + '/financials' #build url text_soup = beautifulsoup(urlopen(full_url).read()) #read in text_parts = text_soup.findall(text=true) text = ''.join(text_parts) eps = re.search("eps\s+(\d+)", text) if eps not none: print eps.group(1)
it's not practice use regex parsing html. use beautifulsoup
parser: find cell rowtitle
class , eps (basic)
text in it, iterate on next siblings valuecell
class:
from urllib import urlopen beautifulsoup import beautifulsoup url = 'http://www.marketwatch.com/investing/stock/goog/financials' text_soup = beautifulsoup(urlopen(url).read()) #read in titles = text_soup.findall('td', {'class': 'rowtitle'}) title in titles: if 'eps (basic)' in title.text: print [td.text td in title.findnextsiblings(attrs={'class': 'valuecell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
hope helps.
Comments
Post a Comment