User Tools

Site Tools


crawler:scrapyexamples

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
crawler:scrapyexamples [2015/05/03 04:29] – [Scrapy OpenSources] admincrawler:scrapyexamples [2022/10/29 16:15] (current) – external edit 127.0.0.1
Line 188: Line 188:
 </code> </code>
 ===== Scrapy OpenSources ===== ===== Scrapy OpenSources =====
 +Sort follow filter **Most stars**
 +  * https://github.com/scrapinghub/portia
 +  * https://github.com/gnemoug/distribute_crawler -> distribute spiders  
   * https://github.com/darkrho/scrapy-redis -> distributed spiders with single redis for receiving items   * https://github.com/darkrho/scrapy-redis -> distributed spiders with single redis for receiving items
-  * https://github.com/gnemoug/distribute_crawler -> distribute spiders+  * https://github.com/geekan/scrapy-examples
   * https://github.com/holgerd77/django-dynamic-scraper -> Manage scrapy spiders via django admin   * https://github.com/holgerd77/django-dynamic-scraper -> Manage scrapy spiders via django admin
 +  * https://github.com/scrapinghub/scrapyjs
   * https://github.com/scrapy/scrapyd   * https://github.com/scrapy/scrapyd
-  * https://github.com/scrapinghub/scrapylib+  * https://github.com/scrapinghub/scrapylib -> collection of code
   * https://github.com/mvanveen/hncrawl   * https://github.com/mvanveen/hncrawl
   * https://github.com/scrapinghub/scrapyrt   * https://github.com/scrapinghub/scrapyrt
   * https://github.com/aivarsk/scrapy-proxies   * https://github.com/aivarsk/scrapy-proxies
-  * https://github.com/kalessin/finance+  * https://github.com/kalessin/finance -> Crawl finance data and store to mysql server
   * https://github.com/istresearch/scrapy-cluster   * https://github.com/istresearch/scrapy-cluster
 +  * https://github.com/scrapy/scrapely
   * https://github.com/octoberman/scrapy-indeed-spider   * https://github.com/octoberman/scrapy-indeed-spider
-  * https://github.com/dcondrey/scrapy-spiders +  * https://github.com/dcondrey/scrapy-spiders -> collection of spiders 
-  * https://github.com/arthurk/scrapy-german-news -> good spider +  * https://github.com/arthurk/scrapy-german-news -> good spider(with simhash algorithm, sqlite) 
-  * https://github.com/jackliusr/scrapy-crawlers+  * https://github.com/jackliusr/scrapy-crawlers -> collection of spiders
   * https://github.com/hemslo/poky-engine -> good architecture   * https://github.com/hemslo/poky-engine -> good architecture
   * https://github.com/anderson916/google-play-crawler   * https://github.com/anderson916/google-play-crawler
Line 207: Line 212:
   * https://github.com/duydo/scrapy-crunchbase   * https://github.com/duydo/scrapy-crunchbase
   * https://github.com/richardkyeung/pandora-food-scrapy   * https://github.com/richardkyeung/pandora-food-scrapy
-  * https://github.com/rahulrrixe/Financial-News-Crawler -> config spiders in json data+  * https://github.com/rahulrrixe/Financial-News-Crawler -> config spiders in json data and store crawl data in Mongo
   * https://github.com/supercoderz/hydbusroutes   * https://github.com/supercoderz/hydbusroutes
-  * https://github.com/shijilspark/scrapy -> scrapy deals with django app +  * https://github.com/shijilspark/scrapy -> scrapy deals with django app and crontab 
-  * https://github.com/walbuc/Django-Scrapy+  * https://github.com/walbuc/Django-Scrapy -> good architecture with django: 
 +    * sqlite 3 in dev, postgres or mongodb in prod 
 +    * Using Celery: Distributed Task Queue 
 +    * Using redis server for caching 
 +    * Scrapy create webservice REST and Django using REST API to get data
   * https://github.com/sdiwu/PlayStoreScrapy   * https://github.com/sdiwu/PlayStoreScrapy
   * https://github.com/amferraz/9gag-scraper   * https://github.com/amferraz/9gag-scraper
   * https://github.com/junwei-wang/GoogleLoginSpider   * https://github.com/junwei-wang/GoogleLoginSpider
-  * https://github.com/lodow/portia-proxy -> You can **annotate a web page** to identify the data you wish to extract+  * https://github.com/lodow/portia-proxy -> You can **annotate a web page** to identify the data you wish to  
 +  * https://github.com/voliveirajr/seleniumcrawler -> scrapy with selenium 
 +extract 
 +  * https://github.com/pelick/VerticleSearchEngine 
 +  * https://github.com/eliangcs/pystock-crawler
crawler/scrapyexamples.1430627354.txt.gz · Last modified: 2022/10/29 16:15 (external edit)