crawler:scrapyarchitecturecode
Differences
This shows you the differences between two versions of the page.
| crawler:scrapyarchitecturecode [2016/07/20 15:22] – [Overview about scrapy commands] admin | crawler:scrapyarchitecturecode [2022/10/29 16:15] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 502: | Line 502: | ||
| - Step6: **Continue download new requests**: Crawler get all requests yielded from function **parse** and continue downloading | - Step6: **Continue download new requests**: Crawler get all requests yielded from function **parse** and continue downloading | ||
| - Step7: **process data downloaded**: | - Step7: **process data downloaded**: | ||
| + | - Step8: Mỗi request đều cố hàm callback **parse(self, | ||
| + | * Hàm callback trả về giá trị là **Request** -> sẽ tiếp tục download những request này | ||
| + | * Nếu hàm callback trả về giá trị là **Item** -> đưa vô pipeline đễ xử lý data | ||
| === Spider Classes === | === Spider Classes === | ||
| Line 705: | Line 708: | ||
| </ | </ | ||
| ==== ItemPipeline để xử lý lưu trữ dữ liệu ==== | ==== ItemPipeline để xử lý lưu trữ dữ liệu ==== | ||
| + | refer: https:// | ||
| + | |||
| + | Riêng **ImagesPipeline and FilesPipeline**, | ||
| + | |||
| === ItemPipeline Classes === | === ItemPipeline Classes === | ||
| {{: | {{: | ||
| Line 764: | Line 771: | ||
| return checksum | return checksum | ||
| </ | </ | ||
| + | === Write items to a JSON file === | ||
| + | The following pipeline stores all scraped items (from all spiders) into a single items.jl file, containing one item per line serialized in JSON format:< | ||
| + | import json | ||
| + | class JsonWriterPipeline(object): | ||
| + | |||
| + | def open_spider(self, | ||
| + | self.file = open(' | ||
| + | |||
| + | def close_spider(self, | ||
| + | self.file.close() | ||
| + | |||
| + | def process_item(self, | ||
| + | line = json.dumps(dict(item)) + " | ||
| + | self.file.write(line) | ||
| + | return item | ||
| + | </ | ||
| + | Note: The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports. | ||
| + | === Write items to MongoDB === | ||
| + | In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class. | ||
| + | |||
| + | The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:< | ||
| + | import pymongo | ||
| + | |||
| + | class MongoPipeline(object): | ||
| + | |||
| + | collection_name = ' | ||
| + | |||
| + | def __init__(self, | ||
| + | self.mongo_uri = mongo_uri | ||
| + | self.mongo_db = mongo_db | ||
| + | |||
| + | @classmethod | ||
| + | def from_crawler(cls, | ||
| + | return cls( | ||
| + | mongo_uri=crawler.settings.get(' | ||
| + | mongo_db=crawler.settings.get(' | ||
| + | ) | ||
| + | |||
| + | def open_spider(self, | ||
| + | self.client = pymongo.MongoClient(self.mongo_uri) | ||
| + | self.db = self.client[self.mongo_db] | ||
| + | |||
| + | def close_spider(self, | ||
| + | self.client.close() | ||
| + | |||
| + | def process_item(self, | ||
| + | self.db[self.collection_name].insert(dict(item)) | ||
| + | return item | ||
| + | </ | ||
| + | === Duplicates filter === | ||
| + | A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:<code python> | ||
| + | from scrapy.exceptions import DropItem | ||
| + | |||
| + | class DuplicatesPipeline(object): | ||
| + | |||
| + | def __init__(self): | ||
| + | self.ids_seen = set() | ||
| + | |||
| + | def process_item(self, | ||
| + | if item[' | ||
| + | raise DropItem(" | ||
| + | else: | ||
| + | self.ids_seen.add(item[' | ||
| + | return item | ||
| + | </ | ||
| + | ==== scope of allowed_domains ==== | ||
| + | allowed_domains was filtered in **site-packages\scrapy\utils\url.py**:< | ||
| + | def url_is_from_any_domain(url, | ||
| + | """ | ||
| + | host = parse_url(url).netloc.lower() | ||
| + | |||
| + | if host: | ||
| + | return any(((host == d.lower()) or (host.endswith(' | ||
| + | else: | ||
| + | return False | ||
| + | |||
| + | def url_is_from_spider(url, | ||
| + | """ | ||
| + | return url_is_from_any_domain(url, | ||
| + | [spider.name] + list(getattr(spider, | ||
| + | </ | ||
| + | and spider call it to check before download:< | ||
| + | class Spider(object_ref): | ||
| + | @classmethod | ||
| + | def handles_request(cls, | ||
| + | return url_is_from_spider(request.url, | ||
| + | </ | ||
| + | ==== Integrate Scrapy with Other Systems ==== | ||
| + | Integrate via below systems: | ||
| + | * Database: MySQL, MongoDB | ||
| + | * Cache: Redis Cache, Cm Cache -> You can **start multiple spider instances that share a single redis queue**. Best suitable for **broad multi-domain crawls**. | ||
crawler/scrapyarchitecturecode.1469028174.txt.gz · Last modified: (external edit)
