xml - Scrapy with deeper level using Xpath -

i trying scrap information, information looking accessible search adress: [http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd+and+state%3dmn&first=2008&last=2011][1]

i want scrapy follow written link (let's try 1 generating number associated with) xpath 1 link is:

/html/body/div/table/tbody/tr[29]/td[3]/a[2]

after crawling link want scrappy crawl xml files available on next page. xpath link general:

//*[@id="formdiv"]/div/table/tbody/tr[3]/td[3]/a

and want scrapy scrap data xml page.

launching scrapy with: scrapy crawl dform -o items.json -t json on json file is: "[".

items.py

from scrapy.item import item, field  class secformd(item):     company = field()     filling_date = field()     types_of_securities = field()     offering_amount = field()     sold_amount = field()     remaining = field()     investors_accredited = field()     investors_non_accredited = field()

*formds_crawler.py*

from scrapy.contrib.spiders import basespider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.selector import htmlxpathselector scrapy.selector import xmlxpathselector formds.items import secformd   class secdform(crawlspider):     name = "dform"     allowed_domain = ["http://www.sec.gov"]     start_urls = [         "http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd+and+state%3dmn&first=2008&last=2011"         ]     rules = (         rule(sgmllinkextractor(restrict_xpaths=('/html/body/div/table/tbody/tr[27]/td[3]/a[2]')), callback='parse_formd', follow= true),)   def parse_formd(self, response):     xxs = xmlpathselector(response)     hsx = htmlxpathselector(response)      sites = xxs.select('//*[@id="formdiv"]/div/table/tbody/tr[3]/td[3]/a')     items = []     site in sites:         item = secformd()         item['company'] = site.select('//*[@id="collapsible1"]/div[1]/div[2]/div[2]/span[2]/text()').extract()         item['filling_date'] = site.select('//*[@id="collapsible40"]/div[1]/div[2]/div[5]/span[2]/text()').extract()         item['types_of_securities'] = site.select('//*[@id="collapsible37"]/div[1]/div[2]/div[1]/span[2]/text()').extract()         item['offering_amount'] = site.select('//*[@id="collapsible39"]/div[1]/div[2]/div[1]/span[2]/text()').extract()         item['sold_amount'] = site.select('//*[@id="collapsible39"]/div[1]/div[2]/div[2]/span[2]/text()').extract()         item['remaining'] = site.select('//*[@id="collapsible39"]/div[1]/div[2]/div[3]/span[2]/text()').extract()         item['investors_accredited'] = site.select('//*[@id="collapsible40"]/div[1]/div[2]/div[2]/span[2]/text()').extract()         item['investors_non_accredited'] = site.select('//*[@id="collapsible40"]/div[1]/div[2]/div[1]/span[2]/text()').extract()          items.append(item)     return items   ***here log:*** uscomputer:formds psykoboy$ scrapy crawl dform -o items.json -t json 2013-07-18 21:18:37-0500 [scrapy] info: scrapy 0.16.4 started (bot: formds) 2013-07-18 21:18:38-0500 [scrapy] debug: enabled extensions: feedexporter, logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2013-07-18 21:18:38-0500 [scrapy] debug: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, redirectmiddleware, cookiesmiddleware, httpcompressionmiddleware, chunkedtransfermiddleware, downloaderstats 2013-07-18 21:18:38-0500 [scrapy] debug: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2013-07-18 21:18:38-0500 [scrapy] debug: enabled item pipelines:  2013-07-18 21:18:38-0500 [dform] info: spider opened 2013-07-18 21:18:38-0500 [dform] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-07-18 21:18:38-0500 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2013-07-18 21:18:38-0500 [scrapy] debug: web service listening on 0.0.0.0:6080 2013-07-18 21:18:42-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=81&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:43-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd+and+state%3dmn&first=2008&last=2011> (referer: none) 2013-07-18 21:18:44-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=161&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:45-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=241&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:45-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=321&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:46-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=401&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:46-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=481&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:47-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=561&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:47-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=641&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:48-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=721&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:48-0500 [dform] debug: crawled (200) <get http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd%20and%20state%3dmn&start=721&count=80&first=2008&last=2011> (referer: none) 2013-07-18 21:18:48-0500 [dform] info: closing spider (finished) 2013-07-18 21:18:48-0500 [dform] info: dumping scrapy stats:     {'downloader/request_bytes': 3419,      'downloader/request_count': 11,      'downloader/request_method_count/get': 11,      'downloader/response_bytes': 68182,      'downloader/response_count': 11,      'downloader/response_status_count/200': 11,      'finish_reason': 'finished',      'finish_time': datetime.datetime(2013, 7, 19, 2, 18, 48, 189346),      'log_count/debug': 17,      'log_count/info': 4,      'response_received_count': 11,      'scheduler/dequeued': 11,      'scheduler/dequeued/memory': 11,      'scheduler/enqueued': 11,      'scheduler/enqueued/memory': 11,      'start_time': datetime.datetime(2013, 7, 19, 2, 18, 38, 701571)} 2013-07-18 21:18:48-0500 [dform] info: spider closed (finished)

if remove tbody/ first 2 xpath expressions work in scrapy shell

paul@wheezy:~$ scrapy shell 'http://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3dd+and+state%3dmn&first=2008&last=2011' ... in [1]: hxs.select('/html/body/div/table/tr[27]/td[3]/a[2]/@href').extract() out[1]: [u'/archives/edgar/data/1490747/000149074710000001/0001490747-10-000001-index.htm'] in [2]: next = hxs.select('/html/body/div/table/tr[27]/td[3]/a[2]/@href').extract()[0] in [3]: import urlparse in [4]: next_url = urlparse.urljoin(response.url, next) in [5]: next_url out[5]: u'http://www.sec.gov/archives/edgar/data/1490747/000149074710000001/0001490747-10-000001-index.htm' in [6]: fetch(next_url) 2013-07-19 09:42:58+0200 [default] debug: crawled (200) <get http://www.sec.gov/archives/edgar/data/1490747/000149074710000001/0001490747-10-000001-index.htm> (referer: none) ... in [8]: hxs.select('//*[@id="formdiv"]/div/table/tr[3]/td[3]/a') out[8]: [<htmlxpathselector xpath='//*[@id="formdiv"]/div/table/tr[3]/td[3]/a' data=u'<a href="/archives/edgar/data/1490747/00'>]

but

sites = xxs.select('//*[@id="formdiv"]/div/table/tbody/tr[3]/td[3]/a') items = [] site in sites:     ... extract item values

part not meant.

you want follow links xml documents, , parse them, need tell scrapy fetch pages, sites = xxs.select('//*[@id="formdiv"]/div/table/tbody/tr[3]/td[3]/a') not that, returns a tags, not issue request document

you need like:

import urlparse scrapy.http import request ... class secdform(crawlspider):     ...     def parse_formd(self, response):         hxs = htmlxpathselector(response)         sites = hxs.select('//*[@id="formdiv"]/div/table/tr[3]/td[3]/a/@href').extract()         site in sites:             yield request(url=urlparse.urljoin(response.url, site), callback=self.parse_xml_document)

and define new parse_xml_document() callback method contains item extraction logic these xml documents.

your xpath expressions item fields come chrome or firebug explorer, right? ("collapsibla0" etc.). need work on xml structure directly, not browser converts html display. did "company" part illustrate.

    def parse_xml_document(self, response):         xxs = xmlxpathselector(response)         item = secformd()         item["company"] = xxs.select('./primaryissuer/entityname/text()').extract()[0]         ...         return item

a way work on xpath expression items use scrapy shell <url_of_xml_document> below "company" (see http://doc.scrapy.org/en/latest/intro/tutorial.html#trying-selectors-in-the-shell)

paul@wheezy:~$ scrapy shell http://www.sec.gov/archives/edgar/data/1490747/000149074710000001/primary_doc.xml in [6]: xxs.select('./primaryissuer') out[6]: [<xmlxpathselector xpath='./primaryissuer' data=u'<primaryissuer>\n        <cik>0001490747<'>]  in [7]: xxs.select('./primaryissuer/entityname') out[7]: [<xmlxpathselector xpath='./primaryissuer/entityname' data=u'<entityname>aei credit tenant fund 35 lp'>]  in [8]: xxs.select('./primaryissuer/entityname/text()') out[8]: [<xmlxpathselector xpath='./primaryissuer/entityname/text()' data=u'aei credit tenant fund 35 lp'>]  in [9]: xxs.select('./primaryissuer/entityname/text()').extract() out[9]: [u'aei credit tenant fund 35 lp']  in [10]:

edit: updated gist rules() follow [next] pages , links docs in rows https://gist.github.com/redapple/02a55aa6aaac0df2fb75

Search This Blog

Live

xml - Scrapy with deeper level using Xpath -

Comments

Post a Comment

Popular posts from this blog

javascript - JS causing window size to be bigger than necessary - Dropdown bug -

How to mention the localhost in android -

php - Calling a template part from a post -