python - Is it possible that Scrapy to get plain text from raw html data directly instead of using xPath selectors? -


for example:

scrapy shell http://scrapy.org/ content = hxs.select('//*[@id="content"]').extract()[0] print content 

then,i got following raw html codes:

<div id="content">       <h2>welcome scrapy</h2>      <h3>what scrapy?</h3>      <p>scrapy fast high-level screen scraping , web crawling     framework, used crawl websites , extract structured data     pages. can used wide range of purposes, data mining     monitoring , automated testing.</p>      <h3>features</h3>      <dl>      <dt>simple</dt><dt>     </dt><dd>scrapy designed simplicity in mind, providing features     need without getting in way</dd>      <dt>productive</dt>     <dd>just write rules extract data web pages , let scrapy     crawl entire web site you</dd>      <dt>fast</dt>     <dd>scrapy used in production crawlers scrape more     500 retailer sites daily, in 1 server</dd>      <dt>extensible</dt>     <dd>scrapy designed extensibility in mind , provides     several mechanisms plug new code without having touch framework     core      </dd><dt>portable, open-source, 100% python</dt>     <dd>scrapy written in python , runs on linux, windows, mac , bsd</dd>      <dt>batteries included</dt>     <dd>scrapy comes lots of functionality built in. check <a href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this     section</a> of documentation list of them.</dd>      <dt>well-documented &amp; well-tested</dt>     <dd>scrapy <a href="/doc/">extensively documented</a> , has comprehensive test suite     <a href="http://static.scrapy.org/coverage-report/">very code     coverage</a></dd>      <dt><a href="/community">healthy community</a></dt>     <dd>     1,500 watchers, 350 forks on github (<a href="https://github.com/scrapy/scrapy">link</a>)<br>     700 followers on twitter (<a href="http://twitter.com/scrapyproject">link</a>)<br>     850 questions on stackoverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br>     200 messages per month on mailing list (<a href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br>     40-50 users connected irc channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>)     </dd>      <dt><a href="/support">commercial support</a></dt>     <dd>a few companies provide scrapy consulting , support</dd>      <p>still not sure if scrapy you're looking for?. check out <a href="http://doc.scrapy.org/en/latest/intro/overview.html">scrapy @     glance</a>.      </p><h3>companies using scrapy</h3>      <p>scrapy being used in large production environments, crawl     thousands of sites daily. here list of <a href="/companies/">companies using scrapy</a>.</p>      <h3>where start?</h3>      <p>start reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">scrapy @ glance</a>,     <a href="/download/">download scrapy</a> , follow <a href="http://doc.scrapy.org/en/latest/intro/tutorial.html">tutorial</a>.             </p></dl></div> 

---------->but want plain text following directly scrapy:-----

welcome scrapy

what scrapy?

scrapy fast high-level screen scraping , web crawling framework, used crawl websites , extract structured data pages. can used wide range of purposes, data mining monitoring , automated testing.

features

simple
scrapy designed simplicity in mind, providing features need without getting in way
productive
just write rules extract data web pages , let scrapy crawl entire web site you
fast
scrapy used in production crawlers scrape more 500 retailer sites daily, in 1 server
extensible
scrapy designed extensibility in mind , provides several mechanisms plug new code without having touch framework core
portable, open-source, 100% python
scrapy written in python , runs on linux, windows, mac , bsd
batteries included
scrapy comes lots of functionality built in. check section of documentation list of them.
well-documented & well-tested
scrapy extensively documented , has comprehensive test suite very code coverage
healthy community
1,500 watchers, 350 forks on github (link)
700 followers on twitter (link)
850 questions on stackoverflow (link)
200 messages per month on mailing list (link)
40-50 users connected irc channel (link)
commercial support
a few companies provide scrapy consulting , support

still not sure if scrapy you're looking for?. check out scrapy @ glance.

companies using scrapy

scrapy being used in large production environments, crawl thousands of sites daily. here list of companies using scrapy.

where start?

start reading scrapy @ glance, download scrapy , follow tutorial.

i not want use xpath selectors extract p, h2, h3 etc,tags,since crawling website main content embedded table, tbody; recursively. can tedious task find xpath. can implemented built in function in scrapy? or need external tools convert it? have read through of scrapy's docs, have gained nothing. sample site can convert raw html plain text: http://beaker.mailchimp.com/html-to-text

scrapy doesn't have such functionality built-in. html2text looking for.

here's sample spider scrapes wikipedia's python page, gets first paragraph using xpath , converts html plain text using html2text:

from scrapy.selector import htmlxpathselector scrapy.spider import basespider import html2text   class wikispider(basespider):     name = "wiki_spider"     allowed_domains = ["www.wikipedia.org"]     start_urls = ["http://en.wikipedia.org/wiki/python_(programming_language)"]      def parse(self, response):         hxs = htmlxpathselector(response)         sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]          converter = html2text.html2text()         converter.ignore_links = true         print(converter.handle(sample)) #python 3 print syntax 

prints:

**python** used general-purpose, high-level programming language.[11][12][13] design philosophy emphasizes code readability, , syntax allows programmers express concepts in fewer lines of code possible in languages such c.[14][15] language provides constructs intended enable clear programs on both small , large scale.[16]


Comments

Popular posts from this blog

How to mention the localhost in android -

php - Calling a template part from a post -

c# - String.format() DateTime With Arabic culture -