crawler:scrapyarchitecturecode
Differences
This shows you the differences between two versions of the page.
crawler:scrapyarchitecturecode [2017/01/06 17:13] – [Store in FilesPipeline] admin | crawler:scrapyarchitecturecode [2022/10/29 16:15] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 820: | Line 820: | ||
self.db[self.collection_name].insert(dict(item)) | self.db[self.collection_name].insert(dict(item)) | ||
return item | return item | ||
+ | </ | ||
+ | === Duplicates filter === | ||
+ | A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:<code python> | ||
+ | from scrapy.exceptions import DropItem | ||
+ | |||
+ | class DuplicatesPipeline(object): | ||
+ | |||
+ | def __init__(self): | ||
+ | self.ids_seen = set() | ||
+ | |||
+ | def process_item(self, | ||
+ | if item[' | ||
+ | raise DropItem(" | ||
+ | else: | ||
+ | self.ids_seen.add(item[' | ||
+ | return item | ||
</ | </ | ||
==== scope of allowed_domains ==== | ==== scope of allowed_domains ==== | ||
Line 843: | Line 859: | ||
return url_is_from_spider(request.url, | return url_is_from_spider(request.url, | ||
</ | </ | ||
+ | ==== Integrate Scrapy with Other Systems ==== | ||
+ | Integrate via below systems: | ||
+ | * Database: MySQL, MongoDB | ||
+ | * Cache: Redis Cache, Cm Cache -> You can **start multiple spider instances that share a single redis queue**. Best suitable for **broad multi-domain crawls**. |
crawler/scrapyarchitecturecode.1483722801.txt.gz · Last modified: 2022/10/29 16:15 (external edit)