crawler:scrapyarchitecturecode
Differences
This shows you the differences between two versions of the page.
| crawler:scrapyarchitecturecode [2017/01/06 17:13] – [Store in FilesPipeline] admin | crawler:scrapyarchitecturecode [2022/10/29 16:15] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 820: | Line 820: | ||
| self.db[self.collection_name].insert(dict(item)) | self.db[self.collection_name].insert(dict(item)) | ||
| return item | return item | ||
| + | </ | ||
| + | === Duplicates filter === | ||
| + | A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:<code python> | ||
| + | from scrapy.exceptions import DropItem | ||
| + | |||
| + | class DuplicatesPipeline(object): | ||
| + | |||
| + | def __init__(self): | ||
| + | self.ids_seen = set() | ||
| + | |||
| + | def process_item(self, | ||
| + | if item[' | ||
| + | raise DropItem(" | ||
| + | else: | ||
| + | self.ids_seen.add(item[' | ||
| + | return item | ||
| </ | </ | ||
| ==== scope of allowed_domains ==== | ==== scope of allowed_domains ==== | ||
| Line 843: | Line 859: | ||
| return url_is_from_spider(request.url, | return url_is_from_spider(request.url, | ||
| </ | </ | ||
| + | ==== Integrate Scrapy with Other Systems ==== | ||
| + | Integrate via below systems: | ||
| + | * Database: MySQL, MongoDB | ||
| + | * Cache: Redis Cache, Cm Cache -> You can **start multiple spider instances that share a single redis queue**. Best suitable for **broad multi-domain crawls**. | ||
crawler/scrapyarchitecturecode.1483722801.txt.gz · Last modified: (external edit)
