Differences

This shows you the differences between two versions of the page.

--- crawler:scrapyarchitecturecode [2016/07/20 15:22] – [Overview about scrapy commands] admin
+++ crawler:scrapyarchitecturecode [2022/10/29 16:15] (current) – external edit 127.0.0.1
@@ Line 502: / Line 502: @@
   - Step6: **Continue download new requests**: Crawler get all requests yielded from function **parse** and continue downloading
   - Step7: **process data downloaded**: Crawler get all items yielded from function **parse** and process theme
+  - Step8: Mỗi request đều cố hàm callback **parse(self, response)** Hoặc **_response_downloaded(self, response)**, sau khi request được download, scrapy sẽ gọi hàm callback này để xử lý response. Nếu:
+    * Hàm callback trả về giá trị là **Request** -> sẽ tiếp tục download những request này
+    * Nếu hàm callback trả về giá trị là **Item** -> đưa vô pipeline đễ xử lý data
 === Spider Classes ===
@@ Line 705: / Line 708: @@
 </code>
 ==== ItemPipeline để xử lý lưu trữ dữ liệu ====
+refer: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
+Riêng **ImagesPipeline and FilesPipeline**, request sẽ được **gọi và download trong hàm process_item của pipeline**
 === ItemPipeline Classes ===
 {{:crawler:scrapy_pipeline.png|}}
@@ Line 764: / Line 771: @@
     return checksum
 </code>
+=== Write items to a JSON file ===
+The following pipeline stores all scraped items (from all spiders) into a single items.jl file, containing one item per line serialized in JSON format:<code python>
+import json
+class JsonWriterPipeline(object):
+    def open_spider(self, spider):
+        self.file = open('items.jl', 'wb')
+    def close_spider(self, spider):
+        self.file.close()
+    def process_item(self, item, spider):
+        line = json.dumps(dict(item)) + "\n"
+        self.file.write(line)
+        return item
+</code>
+Note: The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
+=== Write items to MongoDB ===
+In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class.
+The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:<code python>
+import pymongo
+class MongoPipeline(object):
+    collection_name = 'scrapy_items'
+    def __init__(self, mongo_uri, mongo_db):
+        self.mongo_uri = mongo_uri
+        self.mongo_db = mongo_db
+    @classmethod
+    def from_crawler(cls, crawler):
+        return cls(
+            mongo_uri=crawler.settings.get('MONGO_URI'),
+            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
+        )
+    def open_spider(self, spider):
+        self.client = pymongo.MongoClient(self.mongo_uri)
+        self.db = self.client[self.mongo_db]
+    def close_spider(self, spider):
+        self.client.close()
+    def process_item(self, item, spider):
+        self.db[self.collection_name].insert(dict(item))
+        return item
+</code>
+=== Duplicates filter ===
+A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:<code python>
+from scrapy.exceptions import DropItem
+class DuplicatesPipeline(object):
+    def __init__(self):
+        self.ids_seen = set()
+    def process_item(self, item, spider):
+        if item['id'] in self.ids_seen:
+            raise DropItem("Duplicate item found: %s" % item)
+        else:
+            self.ids_seen.add(item['id'])
+            return item
+</code>
+==== scope of allowed_domains ====
+allowed_domains was filtered in **site-packages\scrapy\utils\url.py**:<code python>
+def url_is_from_any_domain(url, domains):
+    """Return True if the url belongs to any of the given domains"""
+    host = parse_url(url).netloc.lower()
+    if host:
+        return any(((host == d.lower()) or (host.endswith('.%s' % d.lower())) for d in domains))
+    else:
+        return False
+def url_is_from_spider(url, spider):
+    """Return True if the url belongs to the given spider"""
+    return url_is_from_any_domain(url,
+        [spider.name] + list(getattr(spider, 'allowed_domains', [])))
+</code>
+and spider call it to check before download:<code python>
+class Spider(object_ref):
+    @classmethod
+    def handles_request(cls, request):
+        return url_is_from_spider(request.url, cls)
+</code>
+==== Integrate Scrapy with Other Systems ====
+Integrate via below systems:
+  * Database: MySQL, MongoDB
+  * Cache: Redis Cache, Cm Cache -> You can **start multiple spider instances that share a single redis queue**. Best suitable for **broad multi-domain crawls**.