Differences

This shows you the differences between two versions of the page.

--- crawler:scrapyexamples [2015/05/03 04:29] – [Scrapy OpenSources] admin
+++ crawler:scrapyexamples [2022/10/29 16:15] (current) – external edit 127.0.0.1
@@ Line 188: / Line 188: @@
 </code>
 ===== Scrapy OpenSources =====
+Sort follow filter **Most stars**
+  * https://github.com/scrapinghub/portia
+  * https://github.com/gnemoug/distribute_crawler -> distribute spiders
   * https://github.com/darkrho/scrapy-redis -> distributed spiders with single redis for receiving items
-  * https://github.com/gnemoug/distribute_crawler -> distribute spiders
+  * https://github.com/geekan/scrapy-examples
   * https://github.com/holgerd77/django-dynamic-scraper -> Manage scrapy spiders via django admin
+  * https://github.com/scrapinghub/scrapyjs
   * https://github.com/scrapy/scrapyd
-  * https://github.com/scrapinghub/scrapylib
+  * https://github.com/scrapinghub/scrapylib -> collection of code
   * https://github.com/mvanveen/hncrawl
   * https://github.com/scrapinghub/scrapyrt
   * https://github.com/aivarsk/scrapy-proxies
-  * https://github.com/kalessin/finance
+  * https://github.com/kalessin/finance -> Crawl finance data and store to mysql server
   * https://github.com/istresearch/scrapy-cluster
+  * https://github.com/scrapy/scrapely
   * https://github.com/octoberman/scrapy-indeed-spider
-  * https://github.com/dcondrey/scrapy-spiders
+  * https://github.com/dcondrey/scrapy-spiders -> collection of spiders
-  * https://github.com/arthurk/scrapy-german-news -> good spider
+  * https://github.com/arthurk/scrapy-german-news -> good spider(with simhash algorithm, sqlite)
-  * https://github.com/jackliusr/scrapy-crawlers
+  * https://github.com/jackliusr/scrapy-crawlers -> collection of spiders
   * https://github.com/hemslo/poky-engine -> good architecture
   * https://github.com/anderson916/google-play-crawler
@@ Line 207: / Line 212: @@
   * https://github.com/duydo/scrapy-crunchbase
   * https://github.com/richardkyeung/pandora-food-scrapy
-  * https://github.com/rahulrrixe/Financial-News-Crawler -> config spiders in json data
+  * https://github.com/rahulrrixe/Financial-News-Crawler -> config spiders in json data and store crawl data in Mongo
   * https://github.com/supercoderz/hydbusroutes
-  * https://github.com/shijilspark/scrapy -> scrapy deals with django app
+  * https://github.com/shijilspark/scrapy -> scrapy deals with django app and crontab
-  * https://github.com/walbuc/Django-Scrapy
+  * https://github.com/walbuc/Django-Scrapy -> good architecture with django:
+    * sqlite 3 in dev, postgres or mongodb in prod
+    * Using Celery: Distributed Task Queue
+    * Using redis server for caching
+    * Scrapy create webservice REST and Django using REST API to get data
   * https://github.com/sdiwu/PlayStoreScrapy
   * https://github.com/amferraz/9gag-scraper
   * https://github.com/junwei-wang/GoogleLoginSpider
-  * https://github.com/lodow/portia-proxy -> You can **annotate a web page** to identify the data you wish to extract
+  * https://github.com/lodow/portia-proxy -> You can **annotate a web page** to identify the data you wish to
+  * https://github.com/voliveirajr/seleniumcrawler -> scrapy with selenium
+extract
+  * https://github.com/pelick/VerticleSearchEngine
+  * https://github.com/eliangcs/pystock-crawler