python - Is it possible that Scrapy to get plain text from raw html data directly instead of using xPath selectors? -
for example:
scrapy shell http://scrapy.org/ content = hxs.select('//*[@id="content"]').extract()[0] print content then,i got following raw html codes:
<div id="content"> <h2>welcome scrapy</h2> <h3>what scrapy?</h3> <p>scrapy fast high-level screen scraping , web crawling framework, used crawl websites , extract structured data pages. can used wide range of purposes, data mining monitoring , automated testing.</p> <h3>features</h3> <dl> <dt>simple</dt><dt> </dt><dd>scrapy designed simplicity in mind, providing features need without getting in way</dd> <dt>productive</dt> <dd>just write rules extract data web pages , let scrapy crawl entire web site you</dd> <dt>fast</dt> <dd>scrapy used in production crawlers scrape more 500 retailer sites daily, in 1 server</dd> <dt>extensible</dt> <dd>scrapy designed extensibility in mind , provides several mechanisms plug new code without having touch framework core </dd><dt>portable, open-source, 100% python</dt> <dd>scrapy written in python , runs on linux, windows, mac , bsd</dd> <dt>batteries included</dt> <dd>scrapy comes lots of functionality built in. check <a href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this section</a> of documentation list of them.</dd> <dt>well-documented & well-tested</dt> <dd>scrapy <a href="/doc/">extensively documented</a> , has comprehensive test suite <a href="http://static.scrapy.org/coverage-report/">very code coverage</a></dd> <dt><a href="/community">healthy community</a></dt> <dd> 1,500 watchers, 350 forks on github (<a href="https://github.com/scrapy/scrapy">link</a>)<br> 700 followers on twitter (<a href="http://twitter.com/scrapyproject">link</a>)<br> 850 questions on stackoverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br> 200 messages per month on mailing list (<a href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br> 40-50 users connected irc channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>) </dd> <dt><a href="/support">commercial support</a></dt> <dd>a few companies provide scrapy consulting , support</dd> <p>still not sure if scrapy you're looking for?. check out <a href="http://doc.scrapy.org/en/latest/intro/overview.html">scrapy @ glance</a>. </p><h3>companies using scrapy</h3> <p>scrapy being used in large production environments, crawl thousands of sites daily. here list of <a href="/companies/">companies using scrapy</a>.</p> <h3>where start?</h3> <p>start reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">scrapy @ glance</a>, <a href="/download/">download scrapy</a> , follow <a href="http://doc.scrapy.org/en/latest/intro/tutorial.html">tutorial</a>. </p></dl></div> ---------->but want plain text following directly scrapy:-----
welcome scrapy
what scrapy?
scrapy fast high-level screen scraping , web crawling framework, used crawl websites , extract structured data pages. can used wide range of purposes, data mining monitoring , automated testing.
features
- simple
- scrapy designed simplicity in mind, providing features need without getting in way
- productive
- just write rules extract data web pages , let scrapy crawl entire web site you
- fast
- scrapy used in production crawlers scrape more 500 retailer sites daily, in 1 server
- extensible
- scrapy designed extensibility in mind , provides several mechanisms plug new code without having touch framework core
- portable, open-source, 100% python
- scrapy written in python , runs on linux, windows, mac , bsd
- batteries included
- scrapy comes lots of functionality built in. check section of documentation list of them.
- well-documented & well-tested
- scrapy extensively documented , has comprehensive test suite very code coverage
- healthy community
- 1,500 watchers, 350 forks on github (link)
700 followers on twitter (link)
850 questions on stackoverflow (link)
200 messages per month on mailing list (link)
40-50 users connected irc channel (link)- commercial support
- a few companies provide scrapy consulting , support
still not sure if scrapy you're looking for?. check out scrapy @ glance.
companies using scrapy
scrapy being used in large production environments, crawl thousands of sites daily. here list of companies using scrapy.
where start?
start reading scrapy @ glance, download scrapy , follow tutorial.
i not want use xpath selectors extract p, h2, h3 etc,tags,since crawling website main content embedded table, tbody; recursively. can tedious task find xpath. can implemented built in function in scrapy? or need external tools convert it? have read through of scrapy's docs, have gained nothing. sample site can convert raw html plain text: http://beaker.mailchimp.com/html-to-text
scrapy doesn't have such functionality built-in. html2text looking for.
here's sample spider scrapes wikipedia's python page, gets first paragraph using xpath , converts html plain text using html2text:
from scrapy.selector import htmlxpathselector scrapy.spider import basespider import html2text class wikispider(basespider): name = "wiki_spider" allowed_domains = ["www.wikipedia.org"] start_urls = ["http://en.wikipedia.org/wiki/python_(programming_language)"] def parse(self, response): hxs = htmlxpathselector(response) sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0] converter = html2text.html2text() converter.ignore_links = true print(converter.handle(sample)) #python 3 print syntax prints:
**python** used general-purpose, high-level programming language.[11][12][13] design philosophy emphasizes code readability, , syntax allows programmers express concepts in fewer lines of code possible in languages such c.[14][15] language provides constructs intended enable clear programs on both small , large scale.[16]
Comments
Post a Comment