User Tools

Site Tools


crawler:scrapyarchitecturecode

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

crawler:scrapyarchitecturecode [2016/07/20 15:22] – [Overview about scrapy commands] admincrawler:scrapyarchitecturecode [2022/10/29 16:15] (current) – external edit 127.0.0.1
Line 502: Line 502:
   - Step6: **Continue download new requests**: Crawler get all requests yielded from function **parse** and continue downloading   - Step6: **Continue download new requests**: Crawler get all requests yielded from function **parse** and continue downloading
   - Step7: **process data downloaded**: Crawler get all items yielded from function **parse** and process theme   - Step7: **process data downloaded**: Crawler get all items yielded from function **parse** and process theme
 +  - Step8: Mỗi request đều cố hàm callback **parse(self, response)** Hoặc **_response_downloaded(self, response)**, sau khi request được download, scrapy sẽ gọi hàm callback này để xử lý response. Nếu:
 +    * Hàm callback trả về giá trị là **Request** -> sẽ tiếp tục download những request này
 +    * Nếu hàm callback trả về giá trị là **Item** -> đưa vô pipeline đễ xử lý data 
  
 === Spider Classes === === Spider Classes ===
Line 705: Line 708:
 </code> </code>
 ==== ItemPipeline để xử lý lưu trữ dữ liệu ==== ==== ItemPipeline để xử lý lưu trữ dữ liệu ====
 +refer: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 +
 +Riêng **ImagesPipeline and FilesPipeline**, request sẽ được **gọi và download trong hàm process_item của pipeline**
 +
 === ItemPipeline Classes === === ItemPipeline Classes ===
 {{:crawler:scrapy_pipeline.png|}} {{:crawler:scrapy_pipeline.png|}}
Line 764: Line 771:
     return checksum     return checksum
 </code> </code>
 +=== Write items to a JSON file ===
 +The following pipeline stores all scraped items (from all spiders) into a single items.jl file, containing one item per line serialized in JSON format:<code python>
 +import json
  
 +class JsonWriterPipeline(object):
 +
 +    def open_spider(self, spider):
 +        self.file = open('items.jl', 'wb')
 +
 +    def close_spider(self, spider):
 +        self.file.close()
 +
 +    def process_item(self, item, spider):
 +        line = json.dumps(dict(item)) + "\n"
 +        self.file.write(line)
 +        return item
 +</code>
 +Note: The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
 +=== Write items to MongoDB ===
 +In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class.
 +
 +The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:<code python>
 +import pymongo
 +
 +class MongoPipeline(object):
 +
 +    collection_name = 'scrapy_items'
 +
 +    def __init__(self, mongo_uri, mongo_db):
 +        self.mongo_uri = mongo_uri
 +        self.mongo_db = mongo_db
 +
 +    @classmethod
 +    def from_crawler(cls, crawler):
 +        return cls(
 +            mongo_uri=crawler.settings.get('MONGO_URI'),
 +            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
 +        )
 +
 +    def open_spider(self, spider):
 +        self.client = pymongo.MongoClient(self.mongo_uri)
 +        self.db = self.client[self.mongo_db]
 +
 +    def close_spider(self, spider):
 +        self.client.close()
 +
 +    def process_item(self, item, spider):
 +        self.db[self.collection_name].insert(dict(item))
 +        return item
 +</code>
 +=== Duplicates filter ===
 +A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:<code python>
 +from scrapy.exceptions import DropItem
 +
 +class DuplicatesPipeline(object):
 +
 +    def __init__(self):
 +        self.ids_seen = set()
 +
 +    def process_item(self, item, spider):
 +        if item['id'] in self.ids_seen:
 +            raise DropItem("Duplicate item found: %s" % item)
 +        else:
 +            self.ids_seen.add(item['id'])
 +            return item
 +</code>
 +==== scope of allowed_domains ====
 +allowed_domains was filtered in **site-packages\scrapy\utils\url.py**:<code python>
 +def url_is_from_any_domain(url, domains):
 +    """Return True if the url belongs to any of the given domains"""
 +    host = parse_url(url).netloc.lower()
 +
 +    if host:
 +        return any(((host == d.lower()) or (host.endswith('.%s' % d.lower())) for d in domains))
 +    else:
 +        return False
 + 
 +def url_is_from_spider(url, spider):
 +    """Return True if the url belongs to the given spider"""
 +    return url_is_from_any_domain(url,
 +        [spider.name] + list(getattr(spider, 'allowed_domains', [])))
 +</code>
 +and spider call it to check before download:<code python>
 +class Spider(object_ref): 
 +    @classmethod
 +    def handles_request(cls, request):
 +        return url_is_from_spider(request.url, cls) 
 +</code>
 +==== Integrate Scrapy with Other Systems ====
 +Integrate via below systems:
 +  * Database: MySQL, MongoDB
 +  * Cache: Redis Cache, Cm Cache -> You can **start multiple spider instances that share a single redis queue**. Best suitable for **broad multi-domain crawls**.
crawler/scrapyarchitecturecode.1469028174.txt.gz · Last modified: 2022/10/29 16:15 (external edit)