crawler:scrapyarchitecturecode
Differences
This shows you the differences between two versions of the page.
crawler:scrapyarchitecturecode [2016/08/02 17:11] – [ItemPipeline để xử lý lưu trữ dữ liệu] admin | crawler:scrapyarchitecturecode [2022/10/29 16:15] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 708: | Line 708: | ||
</ | </ | ||
==== ItemPipeline để xử lý lưu trữ dữ liệu ==== | ==== ItemPipeline để xử lý lưu trữ dữ liệu ==== | ||
- | Rieeng | + | refer: https:// |
+ | |||
+ | Riêng **ImagesPipeline and FilesPipeline**, request sẽ được | ||
=== ItemPipeline Classes === | === ItemPipeline Classes === | ||
{{: | {{: | ||
Line 768: | Line 771: | ||
return checksum | return checksum | ||
</ | </ | ||
+ | === Write items to a JSON file === | ||
+ | The following pipeline stores all scraped items (from all spiders) into a single items.jl file, containing one item per line serialized in JSON format:< | ||
+ | import json | ||
+ | class JsonWriterPipeline(object): | ||
+ | |||
+ | def open_spider(self, | ||
+ | self.file = open(' | ||
+ | |||
+ | def close_spider(self, | ||
+ | self.file.close() | ||
+ | |||
+ | def process_item(self, | ||
+ | line = json.dumps(dict(item)) + " | ||
+ | self.file.write(line) | ||
+ | return item | ||
+ | </ | ||
+ | Note: The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports. | ||
+ | === Write items to MongoDB === | ||
+ | In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class. | ||
+ | |||
+ | The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:< | ||
+ | import pymongo | ||
+ | |||
+ | class MongoPipeline(object): | ||
+ | |||
+ | collection_name = ' | ||
+ | |||
+ | def __init__(self, | ||
+ | self.mongo_uri = mongo_uri | ||
+ | self.mongo_db = mongo_db | ||
+ | |||
+ | @classmethod | ||
+ | def from_crawler(cls, | ||
+ | return cls( | ||
+ | mongo_uri=crawler.settings.get(' | ||
+ | mongo_db=crawler.settings.get(' | ||
+ | ) | ||
+ | |||
+ | def open_spider(self, | ||
+ | self.client = pymongo.MongoClient(self.mongo_uri) | ||
+ | self.db = self.client[self.mongo_db] | ||
+ | |||
+ | def close_spider(self, | ||
+ | self.client.close() | ||
+ | |||
+ | def process_item(self, | ||
+ | self.db[self.collection_name].insert(dict(item)) | ||
+ | return item | ||
+ | </ | ||
+ | === Duplicates filter === | ||
+ | A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:<code python> | ||
+ | from scrapy.exceptions import DropItem | ||
+ | |||
+ | class DuplicatesPipeline(object): | ||
+ | |||
+ | def __init__(self): | ||
+ | self.ids_seen = set() | ||
+ | |||
+ | def process_item(self, | ||
+ | if item[' | ||
+ | raise DropItem(" | ||
+ | else: | ||
+ | self.ids_seen.add(item[' | ||
+ | return item | ||
+ | </ | ||
+ | ==== scope of allowed_domains ==== | ||
+ | allowed_domains was filtered in **site-packages\scrapy\utils\url.py**:< | ||
+ | def url_is_from_any_domain(url, | ||
+ | """ | ||
+ | host = parse_url(url).netloc.lower() | ||
+ | |||
+ | if host: | ||
+ | return any(((host == d.lower()) or (host.endswith(' | ||
+ | else: | ||
+ | return False | ||
+ | |||
+ | def url_is_from_spider(url, | ||
+ | """ | ||
+ | return url_is_from_any_domain(url, | ||
+ | [spider.name] + list(getattr(spider, | ||
+ | </ | ||
+ | and spider call it to check before download:< | ||
+ | class Spider(object_ref): | ||
+ | @classmethod | ||
+ | def handles_request(cls, | ||
+ | return url_is_from_spider(request.url, | ||
+ | </ | ||
+ | ==== Integrate Scrapy with Other Systems ==== | ||
+ | Integrate via below systems: | ||
+ | * Database: MySQL, MongoDB | ||
+ | * Cache: Redis Cache, Cm Cache -> You can **start multiple spider instances that share a single redis queue**. Best suitable for **broad multi-domain crawls**. |
crawler/scrapyarchitecturecode.1470157878.txt.gz · Last modified: 2022/10/29 16:15 (external edit)