python - Correct greenlet termination -


i using gevent download html pages. websites way slow, stop serving requests after period of time. why had limit total time group of requests make. use gevent "timeout".

timeout = timeout(10) timeout.start()  def downloadsite():     # code download site's url 1 one     url1 = downloadurl()     url2 = downloadurl()     url3 = downloadurl() try:     gevent.spawn(downloadsite).join() except timeout:     print 'lost state here' 

but problem loose state when exception fires up.

imagine crawl site 'www.test.com'. have managed download 10 urls right before site admins decided switch webserver maintenance. in such case lose information crawled pages when exception fires up.

the question - how save state , process data if timeout happens ?

why not try like:

timeout = timeout(10)  def downloadsite(url):     timeout(10):         downloadurl(url)  urls = ["url1", "url2", "url3"]  workers = [] limit = 5 counter = 0 in urls:     # limit 5 url requests @ time     if counter < limit:         workers.append(gevent.spawn(downloadsite, i))         counter += 1     else:         gevent.joinall(workers)         workers = [i,]         counter = 0 gevent.joinall(workers) 

you save status in dict or every url, or append ones fail in different array, retry later.


Comments

Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -