Differences

This shows you the differences between two versions of the page.

--- crawler:scrapyarchitecturecode [2017/01/06 17:13] – [Store in FilesPipeline] admin
+++ crawler:scrapyarchitecturecode [2022/10/29 16:15] (current) – external edit 127.0.0.1
@@ Line 820: / Line 820: @@
         self.db[self.collection_name].insert(dict(item))
         return item
+</code>
+=== Duplicates filter ===
+A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:<code python>
+from scrapy.exceptions import DropItem
+class DuplicatesPipeline(object):
+    def __init__(self):
+        self.ids_seen = set()
+    def process_item(self, item, spider):
+        if item['id'] in self.ids_seen:
+            raise DropItem("Duplicate item found: %s" % item)
+        else:
+            self.ids_seen.add(item['id'])
+            return item
 </code>
 ==== scope of allowed_domains ====
@@ Line 843: / Line 859: @@
         return url_is_from_spider(request.url, cls)
 </code>
+==== Integrate Scrapy with Other Systems ====
+Integrate via below systems:
+  * Database: MySQL, MongoDB
+  * Cache: Redis Cache, Cm Cache -> You can **start multiple spider instances that share a single redis queue**. Best suitable for **broad multi-domain crawls**.