Differences

This shows you the differences between two versions of the page.

--- crawler:scrapyarchitecturecode [2016/08/02 17:13] – old revision restored (2016/08/03 00:11) admin
+++ crawler:scrapyarchitecturecode [2022/10/29 16:15] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+====== Scrapy Architecture Code ======
+===== Scrapy commands =====
+==== Overview about scrapy commands ====
+  * Scrapy command format<code>
+scrapy --help
+Scrapy 1.0.3 - project: templatedownload
+Usage:
+  scrapy <command> [options] [args]
+Available commands:
+  bench         Run quick benchmark test
+  check         Check spider contracts
+  commands
+  crawl         Run a spider
+  edit          Edit spider
+  fetch         Fetch a URL using the Scrapy downloader
+  genspider     Generate new spider using pre-defined templates
+  list          List available spiders
+  parse         Parse URL (using its spider) and print the results
+  runspider     Run a self-contained spider (without creating a project)
+  settings      Get settings values
+  shell         Interactive scraping console
+  startproject  Create new project
+  version       Print Scrapy version
+  view          Open URL in browser, as seen by Scrapy
+Use "scrapy <command> -h" to see more info about a command</code>
+  * For example:<code dos>
+scrapy startproject -h
+Usage
+=====
+  scrapy startproject <project_name>
+Create new project
+Options
+=======
+--help, -h              show this help message and exit
+Global Options
+--------------
+--logfile=FILE          log file. if omitted stderr will be used
+--loglevel=LEVEL, -L LEVEL
+                        log level (default: DEBUG)
+--nolog                 disable logging completely
+--profile=FILE          write python cProfile stats to FILE
+--lsprof=FILE           write lsprof profiling stats to FILE
+--pidfile=FILE          write process ID to FILE
+--set=NAME=VALUE, -s NAME=VALUE
+                        set/override setting (may be repeated)
+--pdb                   enable pdb on failure
+</code>
+  * content of script scrapy:<code dos>
+python -mscrapy.cmdline %*
+</code>
+  * scrapy command diagram
+{{:crawler:scrapy-command.png|}}
+==== scrapy command code scripts ====
+Scrapy command code scripts are stored in directory **scrapy/commands/**:<code>
+bench.py
+check.py
+crawl.py
+deploy.py
+edit.py
+fetch.py
+genspider.py
+list.py
+parse.py
+runspider.py
+settings.py
+shell.py
+startproject.py
+version.py
+view.py
+</code>
+===== Scrapy Settings =====
+==== Default settings ====
+default_settings.py
+<code python>
+"""
+This module contains the default values for all settings used by Scrapy.
+For more information about these settings you can read the settings
+documentation in docs/topics/settings.rst
+Scrapy developers, if you add a setting here remember to:
+* add it in alphabetical order
+* group similar settings without leaving blank lines
+* add its documentation to the available settings documentation
+  (docs/topics/settings.rst)
+"""
+import os
+import sys
+from importlib import import_module
+from os.path import join, abspath, dirname
+AJAXCRAWL_ENABLED = False
+BOT_NAME = 'scrapybot'
+CLOSESPIDER_TIMEOUT = 0
+CLOSESPIDER_PAGECOUNT = 0
+CLOSESPIDER_ITEMCOUNT = 0
+CLOSESPIDER_ERRORCOUNT = 0
+COMMANDS_MODULE = ''
+COMPRESSION_ENABLED = True
+CONCURRENT_ITEMS = 100
+CONCURRENT_REQUESTS = 16
+CONCURRENT_REQUESTS_PER_DOMAIN = 8
+CONCURRENT_REQUESTS_PER_IP = 0
+COOKIES_ENABLED = True
+COOKIES_DEBUG = False
+DEFAULT_ITEM_CLASS = 'scrapy.item.Item'
+DEFAULT_REQUEST_HEADERS = {
+    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+    'Accept-Language': 'en',
+}
+DEPTH_LIMIT = 0
+DEPTH_STATS = True
+DEPTH_PRIORITY = 0
+DNSCACHE_ENABLED = True
+DNSCACHE_SIZE = 10000
+DNS_TIMEOUT = 60
+DOWNLOAD_DELAY = 0
+DOWNLOAD_HANDLERS = {}
+DOWNLOAD_HANDLERS_BASE = {
+    'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
+    'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
+    'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
+    's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
+    'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
+}
+DOWNLOAD_TIMEOUT = 180      # 3mins
+DOWNLOAD_MAXSIZE = 1024*1024*1024   # 1024m
+DOWNLOAD_WARNSIZE = 32*1024*1024    # 32m
+DOWNLOADER = 'scrapy.core.downloader.Downloader'
+DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
+DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
+DOWNLOADER_MIDDLEWARES = {}
+DOWNLOADER_MIDDLEWARES_BASE = {
+    # Engine side
+    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
+    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
+    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
+    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
+    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
+    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
+    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
+    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
+    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
+    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
+    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
+    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
+    'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
+    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
+    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
+    # Downloader side
+}
+DOWNLOADER_STATS = True
+DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'
+try:
+    EDITOR = os.environ['EDITOR']
+except KeyError:
+    if sys.platform == 'win32':
+        EDITOR = '%s -m idlelib.idle'
+    else:
+        EDITOR = 'vi'
+EXTENSIONS = {}
+EXTENSIONS_BASE = {
+    'scrapy.extensions.corestats.CoreStats': 0,
+    'scrapy.telnet.TelnetConsole': 0,
+    'scrapy.extensions.memusage.MemoryUsage': 0,
+    'scrapy.extensions.memdebug.MemoryDebugger': 0,
+    'scrapy.extensions.closespider.CloseSpider': 0,
+    'scrapy.extensions.feedexport.FeedExporter': 0,
+    'scrapy.extensions.logstats.LogStats': 0,
+    'scrapy.extensions.spiderstate.SpiderState': 0,
+    'scrapy.extensions.throttle.AutoThrottle': 0,
+}
+FEED_URI = None
+FEED_URI_PARAMS = None  # a function to extend uri arguments
+FEED_FORMAT = 'jsonlines'
+FEED_STORE_EMPTY = False
+FEED_EXPORT_FIELDS = None
+FEED_STORAGES = {}
+FEED_STORAGES_BASE = {
+    '': 'scrapy.extensions.feedexport.FileFeedStorage',
+    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
+    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
+    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
+    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
+}
+FEED_EXPORTERS = {}
+FEED_EXPORTERS_BASE = {
+    'json': 'scrapy.exporters.JsonItemExporter',
+    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
+    'jl': 'scrapy.exporters.JsonLinesItemExporter',
+    'csv': 'scrapy.exporters.CsvItemExporter',
+    'xml': 'scrapy.exporters.XmlItemExporter',
+    'marshal': 'scrapy.exporters.MarshalItemExporter',
+    'pickle': 'scrapy.exporters.PickleItemExporter',
+}
+HTTPCACHE_ENABLED = False
+HTTPCACHE_DIR = 'httpcache'
+HTTPCACHE_IGNORE_MISSING = False
+HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
+HTTPCACHE_EXPIRATION_SECS = 0
+HTTPCACHE_IGNORE_HTTP_CODES = []
+HTTPCACHE_IGNORE_SCHEMES = ['file']
+HTTPCACHE_DBM_MODULE = 'anydbm'
+HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
+HTTPCACHE_GZIP = False
+ITEM_PROCESSOR = 'scrapy.pipelines.ItemPipelineManager'
+ITEM_PIPELINES = {}
+ITEM_PIPELINES_BASE = {}
+LOG_ENABLED = True
+LOG_ENCODING = 'utf-8'
+LOG_FORMATTER = 'scrapy.logformatter.LogFormatter'
+LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
+LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
+LOG_STDOUT = False
+LOG_LEVEL = 'DEBUG'
+LOG_FILE = None
+LOG_UNSERIALIZABLE_REQUESTS = False
+LOGSTATS_INTERVAL = 60.0
+MAIL_HOST = 'localhost'
+MAIL_PORT = 25
+MAIL_FROM = 'scrapy@localhost'
+MAIL_PASS = None
+MAIL_USER = None
+MEMDEBUG_ENABLED = False        # enable memory debugging
+MEMDEBUG_NOTIFY = []            # send memory debugging report by mail at engine shutdown
+MEMUSAGE_ENABLED = False
+MEMUSAGE_LIMIT_MB = 0
+MEMUSAGE_NOTIFY_MAIL = []
+MEMUSAGE_REPORT = False
+MEMUSAGE_WARNING_MB = 0
+METAREFRESH_ENABLED = True
+METAREFRESH_MAXDELAY = 100
+NEWSPIDER_MODULE = ''
+RANDOMIZE_DOWNLOAD_DELAY = True
+REACTOR_THREADPOOL_MAXSIZE = 10
+REDIRECT_ENABLED = True
+REDIRECT_MAX_TIMES = 20  # uses Firefox default setting
+REDIRECT_PRIORITY_ADJUST = +2
+REFERER_ENABLED = True
+RETRY_ENABLED = True
+RETRY_TIMES = 2  # initial response + 2 retries = 3 requests
+RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 408]
+RETRY_PRIORITY_ADJUST = -1
+ROBOTSTXT_OBEY = False
+SCHEDULER = 'scrapy.core.scheduler.Scheduler'
+SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
+SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
+SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'
+SPIDER_MIDDLEWARES = {}
+SPIDER_MIDDLEWARES_BASE = {
+    # Engine side
+    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
+    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
+    'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
+    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
+    'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
+    # Spider side
+}
+SPIDER_MODULES = []
+STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
+STATS_DUMP = True
+STATSMAILER_RCPTS = []
+TEMPLATES_DIR = abspath(join(dirname(__file__), '..', 'templates'))
+URLLENGTH_LIMIT = 2083
+USER_AGENT = 'Scrapy/%s (+http://scrapy.org)' % import_module('scrapy').__version__
+TELNETCONSOLE_ENABLED = 1
+TELNETCONSOLE_PORT = [6023, 6073]
+TELNETCONSOLE_HOST = '127.0.0.1'
+SPIDER_CONTRACTS = {}
+SPIDER_CONTRACTS_BASE = {
+    'scrapy.contracts.default.UrlContract': 1,
+    'scrapy.contracts.default.ReturnsContract': 2,
+    'scrapy.contracts.default.ScrapesContract': 3,
+}
+</code>
+==== Some sections in default_settings.py and Custom Settings ===
+  * Bot name<code python>
+BOT_NAME = 'scrapybot'
+</code>
+  * Download handlers<code python>
+DOWNLOAD_HANDLERS = {}
+DOWNLOAD_HANDLERS_BASE = {
+    'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
+    'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
+    'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
+    's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
+    'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
+}
+</code>
+  * Download Middlewares:<code python>
+DOWNLOADER_MIDDLEWARES = {}
+DOWNLOADER_MIDDLEWARES_BASE = {
+    # Engine side
+    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
+    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
+    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
+    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
+    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
+    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
+    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
+    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
+    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
+    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
+    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
+    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
+    'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
+    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
+    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
+    # Downloader side
+}
+</code>
+  * Extensions:<code python>
+EXTENSIONS = {}
+EXTENSIONS_BASE = {
+    'scrapy.extensions.corestats.CoreStats': 0,
+    'scrapy.telnet.TelnetConsole': 0,
+    'scrapy.extensions.memusage.MemoryUsage': 0,
+    'scrapy.extensions.memdebug.MemoryDebugger': 0,
+    'scrapy.extensions.closespider.CloseSpider': 0,
+    'scrapy.extensions.feedexport.FeedExporter': 0,
+    'scrapy.extensions.logstats.LogStats': 0,
+    'scrapy.extensions.spiderstate.SpiderState': 0,
+    'scrapy.extensions.throttle.AutoThrottle': 0,
+}
+</code>
+  * Feed storages<code python>
+FEED_STORAGES = {}
+FEED_STORAGES_BASE = {
+    '': 'scrapy.extensions.feedexport.FileFeedStorage',
+    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
+    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
+    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
+    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
+}
+</code>
+  * Feed Exporters:<code python>
+FEED_EXPORTERS = {}
+FEED_EXPORTERS_BASE = {
+    'json': 'scrapy.exporters.JsonItemExporter',
+    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
+    'jl': 'scrapy.exporters.JsonLinesItemExporter',
+    'csv': 'scrapy.exporters.CsvItemExporter',
+    'xml': 'scrapy.exporters.XmlItemExporter',
+    'marshal': 'scrapy.exporters.MarshalItemExporter',
+    'pickle': 'scrapy.exporters.PickleItemExporter',
+}
+</code>
+  * Spider Middlewares: <code python>
+SPIDER_MIDDLEWARES = {}
+SPIDER_MIDDLEWARES_BASE = {
+    # Engine side
+    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
+    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
+    'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
+    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
+    'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
+    # Spider side
+}
+</code>
+  * Spider Modules<code python>
+SPIDER_MODULES = []
+</code>
+===== scrapy engine =====
+==== scrapy engine init and start ====
+{{:crawler:scrapyengine_init.png|}}
+==== DownloadMiddleware ====
+{{:crawler:scrapy_downloadmiddlewareclass.png|}}
+=== Instant DownloadMiddleware ===
+{{:crawler:scrapy_downloadmiddleware_instant.png|}}
+=== DownloadMiddleware run Download ===
+{{:crawler:scrapy_downloadmiddleware_download.png|}}
+==== Request in Scrapy ====
+==== Crawl and Spider ====
+=== scrapy crawl command start and run Spider ===
+Command to start spider<code bat>
+scrapy crawl <spidername>
+</code>
+crawling diagram:
+{{:crawler:scrapy_crawlcommand_start.png|}}
+Basic Steps for crawling:
+  - Step1: Call Class Method **update_settings** of Spider
+  - Step2: Call Class Method **from_crawler** to create spider Object
+  - Step3: Call  Method self.spider.**start_requests()** return all requests for downloading  before run the spider to download<code python>
+class Crawler(object):
+................
+    @defer.inlineCallbacks
+    def crawl(self, *args, **kwargs):
+        assert not self.crawling, "Crawling already taking place"
+        self.crawling = True
+        try:
+            self.spider = self._create_spider(*args, **kwargs)
+            self.engine = self._create_engine()
+            start_requests = iter(self.spider.start_requests())
+            yield self.engine.open_spider(self.spider, start_requests)
+            yield defer.maybeDeferred(self.engine.start)
+        except Exception:
+            self.crawling = False
+            raise
+...........................
+class Spider(object_ref):
+    """Base class for scrapy spiders. All spiders must inherit from this
+    class.
+    """
+    @classmethod
+    def from_crawler(cls, crawler, *args, **kwargs):
+        spider = cls(*args, **kwargs)
+        spider._set_crawler(crawler)
+        return spider
+    def start_requests(self):
+        for url in self.start_urls:
+            yield self.make_requests_from_url(url)
+    def make_requests_from_url(self, url):
+        return Request(url, dont_filter=True)
+    def parse(self, response):
+        raise NotImplementedError
+    @classmethod
+    def update_settings(cls, settings):
+        settings.setdict(cls.custom_settings or {}, priority='spider')
+    @classmethod
+    def handles_request(cls, request):
+        return url_is_from_spider(request.url, cls)
+    @staticmethod
+    def close(spider, reason):
+        closed = getattr(spider, 'closed', None)
+        if callable(closed):
+            return closed(reason)
+</code>
+  - Step4: Add spider to Shedule for downloading
+  - Step5: After download url finished, Crawler calling function **parse**
+  - Step6: **Continue download new requests**: Crawler get all requests yielded from function **parse** and continue downloading
+  - Step7: **process data downloaded**: Crawler get all items yielded from function **parse** and process theme
+  - Step8: Mỗi request đều cố hàm callback **parse(self, response)** Hoặc **_response_downloaded(self, response)**, sau khi request được download, scrapy sẽ gọi hàm callback này để xử lý response. Nếu:
+    * Hàm callback trả về giá trị là **Request** -> sẽ tiếp tục download những request này
+    * Nếu hàm callback trả về giá trị là **Item** -> đưa vô pipeline đễ xử lý data
+=== Spider Classes ===
+{{:crawler:scrapy_spider_class.png|}}
+=== Thuật toán download và parse response trong Spider ===
+Spider class<code python>
+class Spider(object_ref):
+    """Base class for scrapy spiders. All spiders must inherit from this
+    class.
+    """
+    name = None
+    custom_settings = None
+    def __init__(self, name=None, **kwargs):
+        if name is not None:
+            self.name = name
+        elif not getattr(self, 'name', None):
+            raise ValueError("%s must have a name" % type(self).__name__)
+        self.__dict__.update(kwargs)
+        if not hasattr(self, 'start_urls'):
+            self.start_urls = []
+    @property
+    def logger(self):
+        logger = logging.getLogger(self.name)
+        return logging.LoggerAdapter(logger, {'spider': self})
+    def log(self, message, level=logging.DEBUG, **kw):
+        """Log the given message at the given log level
+        This helper wraps a log call to the logger within the spider, but you
+        can use it directly (e.g. Spider.logger.info('msg')) or use any other
+        Python logger too.
+        """
+        self.logger.log(level, message, **kw)
+    @classmethod
+    def from_crawler(cls, crawler, *args, **kwargs):
+        spider = cls(*args, **kwargs)
+        spider._set_crawler(crawler)
+        return spider
+    def set_crawler(self, crawler):
+        warnings.warn("set_crawler is deprecated, instantiate and bound the "
+                      "spider to this crawler with from_crawler method "
+                      "instead.",
+                      category=ScrapyDeprecationWarning, stacklevel=2)
+        assert not hasattr(self, 'crawler'), "Spider already bounded to a " \
+                                             "crawler"
+        self._set_crawler(crawler)
+    def _set_crawler(self, crawler):
+        self.crawler = crawler
+        self.settings = crawler.settings
+        crawler.signals.connect(self.close, signals.spider_closed)
+    def start_requests(self):
+        for url in self.start_urls:
+            yield self.make_requests_from_url(url)
+    def make_requests_from_url(self, url):
+        return Request(url, dont_filter=True)
+    def parse(self, response):
+        raise NotImplementedError
+    @classmethod
+    def update_settings(cls, settings):
+        settings.setdict(cls.custom_settings or {}, priority='spider')
+    @classmethod
+    def handles_request(cls, request):
+        return url_is_from_spider(request.url, cls)
+    @staticmethod
+    def close(spider, reason):
+        closed = getattr(spider, 'closed', None)
+        if callable(closed):
+            return closed(reason)
+    def __str__(self):
+        return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
+    __repr__ = __str__
+</code>
+Mô tả thuật toán:
+  - Step1: Scrapy Engine gọi hàm **start_requests** thực hiện việc download tất cả các link trong **start_urls**:<code python>
+def start_requests(self):
+    for url in self.start_urls:
+        yield self.make_requests_from_url(url)
+def make_requests_from_url(self, url):
+    return Request(url, dont_filter=True)
+</code>
+  - Step2: Response được download trong step1 sẽ được xử lý trong hàm **parse(self, response)**. Hàm parse trả về các tập dữ liệu bên dưới
+    - Tập dữ liệu 1: Tập gồm các dữ liệu web được gởi tới Pipeline để xử lý
+    - Tập dữ liệu 2: Tập gồm các link mới sẽ được download bởi scrapy và gởi tới hàm **parse(self, response)**
+=== CrawlSpider class ===
+**CrawlSpider kế thừa từ spider** và sử dụng 2 thuật toán cơ bản:
+  * **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó
+  * **Thuật toán extract links** dựa theo rule để lọc ra những url mà nó muốn download
+=> This modules implements the CrawlSpider which is the recommended spider to use for scraping typical web sites that requires crawling pages.
+<code python>
+"""
+This modules implements the CrawlSpider which is the recommended spider to use
+for scraping typical web sites that requires crawling pages.
+See documentation in docs/topics/spiders.rst
+"""
+import copy
+from scrapy.http import Request, HtmlResponse
+from scrapy.utils.spider import iterate_spider_output
+from scrapy.spiders import Spider
+def identity(x):
+    return x
+class Rule(object):
+    def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
+        self.link_extractor = link_extractor
+        self.callback = callback
+        self.cb_kwargs = cb_kwargs or {}
+        self.process_links = process_links
+        self.process_request = process_request
+        if follow is None:
+            self.follow = False if callback else True
+        else:
+            self.follow = follow
+class CrawlSpider(Spider):
+    rules = ()
+    def __init__(self, *a, **kw):
+        super(CrawlSpider, self).__init__(*a, **kw)
+        self._compile_rules()
+    def parse(self, response):
+        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
+    def parse_start_url(self, response):
+        return []
+    def process_results(self, response, results):
+        return results
+    def _requests_to_follow(self, response):
+        if not isinstance(response, HtmlResponse):
+            return
+        seen = set()
+        for n, rule in enumerate(self._rules):
+            links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
+            if links and rule.process_links:
+                links = rule.process_links(links)
+            for link in links:
+                seen.add(link)
+                r = Request(url=link.url, callback=self._response_downloaded)
+                r.meta.update(rule=n, link_text=link.text)
+                yield rule.process_request(r)
+    def _response_downloaded(self, response):
+        rule = self._rules[response.meta['rule']]
+        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
+    def _parse_response(self, response, callback, cb_kwargs, follow=True):
+        if callback:
+            cb_res = callback(response, **cb_kwargs) or ()
+            cb_res = self.process_results(response, cb_res)
+            for requests_or_item in iterate_spider_output(cb_res):
+                yield requests_or_item
+        if follow and self._follow_links:
+            for request_or_item in self._requests_to_follow(response):
+                yield request_or_item
+    def _compile_rules(self):
+        def get_method(method):
+            if callable(method):
+                return method
+            elif isinstance(method, basestring):
+                return getattr(self, method, None)
+        self._rules = [copy.copy(r) for r in self.rules]
+        for rule in self._rules:
+            rule.callback = get_method(rule.callback)
+            rule.process_links = get_method(rule.process_links)
+            rule.process_request = get_method(rule.process_request)
+    @classmethod
+    def from_crawler(cls, crawler, *args, **kwargs):
+        spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
+        spider._follow_links = crawler.settings.getbool(
+            'CRAWLSPIDER_FOLLOW_LINKS', True)
+        return spider
+    def set_crawler(self, crawler):
+        super(CrawlSpider, self).set_crawler(crawler)
+        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)
+</code>
+==== ItemPipeline để xử lý lưu trữ dữ liệu ====
+refer: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
+Riêng **ImagesPipeline and FilesPipeline**, request sẽ được **gọi và download trong hàm process_item của pipeline**
+=== ItemPipeline Classes ===
+{{:crawler:scrapy_pipeline.png|}}
+With some public functions for pipeline:<code python>
+@classmethod
+def from_crawler(cls, crawler)
+def open_spider(self, spider)
+def process_item(self, item, spider)
+def media_to_download(self, request, info)
+def get_media_requests(self, item, info)
+def media_downloaded(self, response, request, info)
+def media_failed(self, failure, request, info)
+def item_completed(self, results, item, info)
+</code>
+=== ItemPipelineManager ===
+{{:crawler:scrapy_itempipelinemanager.png|}}
+Chú thích cho sơ đồ và mô tả kiến trúc của thuật toán pipeline:
+  * Tất cả **pipeline classes** được xem là middleware để xử lý dữ liệu sau khi parse bởi spider(middleware có nghĩa là **dữ liệu được parse bởi tất cả spider đều được gởi tuần tự đến tất cả những pipeline này**), danh sách các pipeline được sử dụng sẽ lưu trữ trong settings của chương trình **ITEM_PIPELINES** và **ITEM_PIPELINES_BASE**:<code python>
+ITEM_PIPELINES = {'templatedownload.pipelines.MyFilesPipeline',
+                }
+</code>
+  * khi **chương trình khởi động**, scrapy sẽ **gọi các class method của tất cả pipeline class để khởi dựng nó** :
+    - Gọi class method **from_crawler** của pipeline class(cls là pipeline class)<code python>
+    @classmethod
+    def from_crawler(cls, crawler):
+</code>
+    - Gọi class method from_settings  của pipeline class(cls là pipeline class)<code python>
+    @classmethod
+    def from_settings(cls, settings)
+</code> => Với class method, chương trình Có thể **khởi tạo các thuộc tính cơ bản** và **khởi dựng pipeline class** trong phương thức from_crawler hoặc from_settings
+  * Khi s**pider parse và trả về dữ liệu bất kỳ**, scrapy sẽ gọi các hàm bên dưới của tất cả pipeline class để xử lý nó. Theo thứ tự như sau:
+    - Gọi phương thức open_spider<code python>
+def open_spider(self, spider)
+</code>
+    - Gọi phương thức process_item<code python>
+def process_item(self, item, spider):
+</code>
+    - Gọi phương thức close_spider<code python>
+def close_spider(self, spider)
+</code>
+=== ImagesPipeline ===
+http://doc.scrapy.org/en/latest/topics/images.html
+{{:crawler:scrapy_imagepipeline.png|}}
+=== Store in FilesPipeline ===
+{{:crawler:scrapyfilestore.png|}}
+With 2 public functions:<code python>
+def persist_file(self, path, buf, info, meta=None, headers=None)
+def stat_file(self, path, info)
+</code>
+Called from FilesPipeline:<code python>
+def file_downloaded(self, response, request, info):
+    path = self.file_path(request, response=response, info=info)
+    buf = BytesIO(response.body)
+    self.store.persist_file(path, buf, info)
+    checksum = md5sum(buf)
+    return checksum
+</code>
+=== Write items to a JSON file ===
+The following pipeline stores all scraped items (from all spiders) into a single items.jl file, containing one item per line serialized in JSON format:<code python>
+import json
+class JsonWriterPipeline(object):
+    def open_spider(self, spider):
+        self.file = open('items.jl', 'wb')
+    def close_spider(self, spider):
+        self.file.close()
+    def process_item(self, item, spider):
+        line = json.dumps(dict(item)) + "\n"
+        self.file.write(line)
+        return item
+</code>
+Note: The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
+=== Write items to MongoDB ===
+In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class.
+The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:<code python>
+import pymongo
+class MongoPipeline(object):
+    collection_name = 'scrapy_items'
+    def __init__(self, mongo_uri, mongo_db):
+        self.mongo_uri = mongo_uri
+        self.mongo_db = mongo_db
+    @classmethod
+    def from_crawler(cls, crawler):
+        return cls(
+            mongo_uri=crawler.settings.get('MONGO_URI'),
+            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
+        )
+    def open_spider(self, spider):
+        self.client = pymongo.MongoClient(self.mongo_uri)
+        self.db = self.client[self.mongo_db]
+    def close_spider(self, spider):
+        self.client.close()
+    def process_item(self, item, spider):
+        self.db[self.collection_name].insert(dict(item))
+        return item
+</code>
+=== Duplicates filter ===
+A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:<code python>
+from scrapy.exceptions import DropItem
+class DuplicatesPipeline(object):
+    def __init__(self):
+        self.ids_seen = set()
+    def process_item(self, item, spider):
+        if item['id'] in self.ids_seen:
+            raise DropItem("Duplicate item found: %s" % item)
+        else:
+            self.ids_seen.add(item['id'])
+            return item
+</code>
+==== scope of allowed_domains ====
+allowed_domains was filtered in **site-packages\scrapy\utils\url.py**:<code python>
+def url_is_from_any_domain(url, domains):
+    """Return True if the url belongs to any of the given domains"""
+    host = parse_url(url).netloc.lower()
+    if host:
+        return any(((host == d.lower()) or (host.endswith('.%s' % d.lower())) for d in domains))
+    else:
+        return False
+def url_is_from_spider(url, spider):
+    """Return True if the url belongs to the given spider"""
+    return url_is_from_any_domain(url,
+        [spider.name] + list(getattr(spider, 'allowed_domains', [])))
+</code>
+and spider call it to check before download:<code python>
+class Spider(object_ref):
+    @classmethod
+    def handles_request(cls, request):
+        return url_is_from_spider(request.url, cls)
+</code>
+==== Integrate Scrapy with Other Systems ====
+Integrate via below systems:
+  * Database: MySQL, MongoDB
+  * Cache: Redis Cache, Cm Cache -> You can **start multiple spider instances that share a single redis queue**. Best suitable for **broad multi-domain crawls**.