User Tools

Site Tools


crawler:scrapyarchitecturecode

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

crawler:scrapyarchitecturecode [2017/01/06 17:13] – [Store in FilesPipeline] admincrawler:scrapyarchitecturecode [2022/10/29 16:15] (current) – external edit 127.0.0.1
Line 820: Line 820:
         self.db[self.collection_name].insert(dict(item))         self.db[self.collection_name].insert(dict(item))
         return item         return item
 +</code>
 +=== Duplicates filter ===
 +A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:<code python>
 +from scrapy.exceptions import DropItem
 +
 +class DuplicatesPipeline(object):
 +
 +    def __init__(self):
 +        self.ids_seen = set()
 +
 +    def process_item(self, item, spider):
 +        if item['id'] in self.ids_seen:
 +            raise DropItem("Duplicate item found: %s" % item)
 +        else:
 +            self.ids_seen.add(item['id'])
 +            return item
 </code> </code>
 ==== scope of allowed_domains ==== ==== scope of allowed_domains ====
Line 843: Line 859:
         return url_is_from_spider(request.url, cls)          return url_is_from_spider(request.url, cls) 
 </code> </code>
 +==== Integrate Scrapy with Other Systems ====
 +Integrate via below systems:
 +  * Database: MySQL, MongoDB
 +  * Cache: Redis Cache, Cm Cache -> You can **start multiple spider instances that share a single redis queue**. Best suitable for **broad multi-domain crawls**.
crawler/scrapyarchitecturecode.1483722801.txt.gz · Last modified: 2022/10/29 16:15 (external edit)