User Tools

Site Tools


crawler:scrapy

Scrapy

refer:

code examples:

scrapy overview

Basic Concepts

  • Command line tool: Learn about the command-line tool used to manage your Scrapy project.
  • Items: Define the data you want to scrape.
  • Spiders: Write the rules to crawl your websites.
  • Selectors: Extract the data from web pages using XPath.
  • Scrapy shell: Test your extraction code in an interactive environment.
  • Item Loaders: Populate your items with the extracted data.
  • Item Pipeline: Post-process and store your scraped data.
  • Feed exports: Output your scraped data using different formats and storages.
  • Link Extractors: Convenient classes to extract links to follow from pages.

Data Flow


The data flow in Scrapy is controlled by the execution engine, and goes like this:

  1. The Engine opens a domain, locates the Spider that handles that domain, and asks the spider for the first URLs to crawl.
  2. The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.
  3. The Engine asks the Scheduler for the next URLs to crawl.
  4. The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader, passing through the Downloader Middleware (request direction).
  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middleware (response direction).
  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (input direction).
  7. The Spider processes the Response and returns scraped Items and new Requests (to follow) to the Engine.
  8. The Engine sends scraped Items (returned by the Spider) to the Item Pipeline and Requests (returned by spider) to the Scheduler
  9. The process repeats (from step 2) until there are no more requests from the Scheduler, and the Engine closes the domain.

Installation

Installation on linux

Pre-requisites:

Install python 2.7 from source

Because yum packages in CenOS6 is python2.6, we need install python 2.7 from source

yum install bzip2-devel
yum install sqlite-devel
yum install 
wget https://www.python.org/ftp/python/2.7.7/Python-2.7.7.tgz
tar xf Python-2.7.7.tgz
cd Python-2.7.7
./configure
make
make install
install pip
  • install
    wget https://bootstrap.pypa.io/get-pip.py
    python2.7 get-pip.py
  • check pip
    pip -V
    pip 1.5.6 from /usr/local/lib/python2.7/site-packages (python 2.7)
install lxml
  • Install Pre-requisites packages
    yum install libxml2-devel
    yum install libxslt-devel
  • install lxml with pip
    pip install lxml==3.1.2
  • check installed:
    /usr/lib64/python2.6/site-packages/lxml
    Or
    /usr/local/lib/python2.7/site-packages/lxml
install pyopenssl
  • Install Pre-requisites packages
    yum install openssl-devel
    yum install libffi-devel
  • Install pyopenssl
    • Install from source
      wget https://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.14.tar.gz
      tar xf pyOpenSSL-0.14.tar.gz
      cd pyOpenSSL-0.14
      python2.7 setup.py --help
      python2.7 setup.py build
      python2.7 setup.py install
    • Or Install with pip
      pip install pyopenssl
install Twisted-14.0.0 from source
wget http://twistedmatrix.com/Releases/Twisted/14.0/Twisted-14.0.0.tar.bz2
tar xf Twisted-14.0.0.tar.bz2
cd Twisted-14.0.0
python2.7 setup.py --help
python2.7 setup.py build
python2.7 setup.py install

Install scrapy

  • Install
    pip install scrapy
  • update
    pip install scrapy --upgrade
  • build and install from source:
    wget https://codeload.github.com/scrapy/scrapy/legacy.tar.gz/master
    tar xf master
    cd scrapy-scrapy-9d57ecf
    python setup.py install

    upgrade master for fixing error below:

    trial test_crawl.py
    scrapy.tests.test_crawl
      CrawlTestCase
        test_delay ... Traceback (most recent call last):
      File "/usr/local/lib/python2.7/runpy.py", line 162, in _run_module_as_main
        "__main__", fname, loader, pkg_name)
      File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
        exec code in run_globals
      File "/usr/local/lib/python2.7/site-packages/scrapy/tests/mockserver.py", line 198, in <module>
        os.path.join(os.path.dirname(__file__), 'keys/cert.pem'),
      File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 104, in __init__
        self.cacheContext()
      File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 113, in cacheContext
        ctx.use_certificate_file(self.certificateFileName)
      File "/usr/local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 391, in use_certificate_file
        _raise_current_error()
      File "/usr/local/lib/python2.7/site-packages/OpenSSL/_util.py", line 22, in exception_from_error_queue
        raise exceptionType(errors)
    OpenSSL.SSL.Error: [('system library', 'fopen', 'No such file or directory'), ('BIO routines', 'FILE_CTRL', 'system lib'), ('SSL routines', 'SSL_CTX_use_certificate_file', 'system lib')]

Install on windows

Prepare environment variables

  • set environment variable for python
    set PATH=%PATH%;"D:\tools\Python27\";"D:\tools\Python27\Scripts"
  • Check current environment variables of Visual studio:
    set

    output:

    VS100COMNTOOLS=C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\Tools\
    VS110COMNTOOLS=D:\tools\Microsoft Visual Studio 11.0\Common7\Tools\
  • Set environment for VS90COMNTOOLS(compiler on windows)
    SET VS90COMNTOOLS=%VS100COMNTOOLS%

    ⇒ Fix error: Unable to find vcvarsall.bat

  • upgrade setuptools:
    pip install -U setuptools

Install pyopenssl

Step by steop install openssl:

  1. Download Visual C++ 2008(or 2010) redistributables for your Windows and architecture
  2. Download OpenSSL for your Windows and architecture (the regular version, not the light one)
  3. Prepare environment for building pyopenssl
    SET LIB
    SET INCLUDE
    SET VS90COMNTOOLS=%VS100COMNTOOLS%
    SET INCLUDE=d:\tools\OpenSSL-Win32\include
    SET LIB=d:\tools\OpenSSL-Win32\lib

    And on windows64bit

    SET LIB
    SET INCLUDE
    SET VS90COMNTOOLS=%VS110COMNTOOLS%
    SET INCLUDE=d:\tools\OpenSSL-Win64\include
    SET LIB=d:\tools\OpenSSL-Win64\lib

    We set the environment variable VS90COMNTOOLS=%VS100COMNTOOLS% because Visual C++ 2010 are installed on windows with environment VS100COMNTOOLS, not Visual C++ 2008. And python using environment variable VS90COMNTOOLS for finding the compiler on windows, so we need set the variable VS90COMNTOOLS=%VS100COMNTOOLS%

  4. Install pyopenssl: Go to directory d:\tools\Python27\Scripts\ and run
    easy_install.exe pyopenssl

Issues for building pyopenssl:

  • Issue:
    cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -ID:\tools\Python27\include -ID:\tools\Python27\PC /Tccryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.c /Fobuild\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj
    link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:D:\tools\Python27\libs /LIBPATH:D:\tools\Python27\PCbuild libeay32.lib ssleay32.lib advapi32.lib /EXPORT:init_Cryptography_cffi_444d7397xa22f8491 build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj /OUT:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd /IMPLIB:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.lib /MANIFESTFILE:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest
    mt.exe -nologo -manifest build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest -outputresource:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd;2
    error: command 'mt.exe' failed with exit status 31
  • Root cause of issue:
    link.exe  /MANIFESTFILE:build\temp.win32-7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest

    ⇒ Can't create file .manifest

  • Anf follow specification in link http://msdn.microsoft.com/en-us/library/y0zzbyt4.aspx, we need add option /MANIFEST when using link.exe to create manifest file

Fix:For fixing this issue, we need update msvc9compiler.py to add build option /MANIFEST in front of code line ld_args.append('/MANIFESTFILE:' + temp_manifest):

ld_args.append('/MANIFEST ')
ld_args.append('/MANIFESTFILE:' + temp_manifest)

Install pywin32, twisted

  • Install pywin32 follow link: http://sourceforge.net/projects/pywin32/files/ or Install with pip
    easy_install.exe pypiwin32
  • Go to directory d:\tools\Python27\Scripts\ and run below commands to install twisted, lxml:
    easy_install.exe twisted

Install lxml

Install lxml version 2.x.x

easy_install.exe lxml==2.3

Install lxml 3.x.x:

  1. Step1: Download lxml MS Windows Installer packages
  2. Step2: Install with easy_install:
    easy_install lxml-3.4.0.win32-py2.7.exe

Build lxml from source

hg clone git://github.com/lxml/lxml.git lxml
cd lxml
SET VS90COMNTOOLS=%VS100COMNTOOLS%
pip install -r requirements.txt
python setup.py build
python setup.py install

Install scrapy

Go to directory d:\tools\Python27\Scripts\ and run:

pip.exe install scrapy

Install and run scrapy on virtual environment with miniconda

  1. Step1: Create virtual environment with name scrapy and install package scrapy in this environment
    conda create -n scrapy scrapy python=2
  2. Step2: Go to virtual environment scrapy:
    activate scrapy

Fisrt Project

This tutorial will walk you through these tasks:

  1. Creating a new Scrapy project
  2. Writing a simple spider to crawl a site and run crawling
  3. Modify spider to extract Items with Selector
  4. Defining Object store extracted Items and return Items for Item Pipeline
  5. Writing an Item Pipeline to store the extracted Items

Creating a project

Before you start scraping, you will have set up a new Scrapy project. Enter a directory where you’d like to store your code and then run:

scrapy startproject myscrapy

This will create a myscrapy directory with the following contents:

myscrapy/
    scrapy.cfg
    myscrapy/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

These are basically:

  • scrapy.cfg: the project configuration file
  • myscrapy/: the project’s python module, you’ll later import your code from here.
  • myscrapy/items.py: the project’s items file.
  • myscrapy/pipelines.py: the project’s pipelines file.
  • myscrapy/settings.py: the project’s settings file.
  • myscrapy/spiders/: a directory where you’ll later put your spiders.

Create simple Spider

This is the code for our first Spider; save it in a file named dmoz_spider.py under the myscrapy/spiders directory:

from scrapy.spider import Spider
 
class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
 
    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

⇒ This code define a spider with basic informations below:

  • Identifies the Spider:
    name = "dmoz"
  • List allowed domain for crawling:
    allowed_domains = ["dmoz.org"]
  • List of URLs where the Spider will begin to crawl from:
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

Crawling

To run spider with name 'dmoz', go to the project’s top level directory and run:

scrapy crawl dmoz

Output:

  • crawling log:
    2014-06-11 15:52:30+0700 [scrapy] INFO: Scrapy 0.22.2 started (bot: myscrapy)
    2014-06-11 15:52:30+0700 [scrapy] INFO: Optional features available: ssl, http11, django
    2014-06-11 15:52:30+0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'myscrapy.spiders', 'SPIDER_MODULES': ['myscrapy.spiders'], 'BOT_NAME': 'myscrapy'}
    2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeader
    ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled item pipelines:
    2014-06-11 15:52:30+0700 [dmoz] INFO: Spider opened
    2014-06-11 15:52:30+0700 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2014-06-11 15:52:30+0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2014-06-11 15:52:30+0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
    2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
    2014-06-11 15:52:31+0700 [dmoz] INFO: Closing spider (finished)
    2014-06-11 15:52:31+0700 [dmoz] INFO: Dumping Scrapy stats:
            {'downloader/request_bytes': 516,
             'downloader/request_count': 2,
             'downloader/request_method_count/GET': 2,
             'downloader/response_bytes': 16515,
             'downloader/response_count': 2,
             'downloader/response_status_count/200': 2,
             'finish_reason': 'finished',
             'finish_time': datetime.datetime(2014, 6, 11, 8, 52, 31, 299000),
             'log_count/DEBUG': 4,
             'log_count/INFO': 7,
             'response_received_count': 2,
             'scheduler/dequeued': 2,
             'scheduler/dequeued/memory': 2,
             'scheduler/enqueued': 2,
             'scheduler/enqueued/memory': 2,
             'start_time': datetime.datetime(2014, 6, 11, 8, 52, 30, 424000)}
    2014-06-11 15:52:31+0700 [dmoz] INFO: Spider closed (finished)
  • Output files created from parse function in dmoz_spider:
    Books
    Resources

Modify spider to extract Items with Selector

Edit spiders\dmoz_spyder.py:

from scrapy.spider import Spider
from scrapy.selector import Selector
 
class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
 
    def parse(self, response):
        print "------response.url=",response.url
        sel = Selector(response)
        sites = sel.xpath('//ul/li')
        for site in sites:
            title = site.xpath('a/text()').extract()
            link = site.xpath('a/@href').extract()
            desc = site.xpath('text()').extract()
            print title, link, desc

Defining Object store extracted Items and return Items for Item Pipeline

  • Define Object store extracted Items in items.py
    from scrapy.item import Item, Field
     
    class DmozItem(Item):
        title = Field()
        link = Field()
        desc = Field()
  • Modify spyders\dmoz_spyder.py return Items for Item Pipeline
    from scrapy.spider import Spider
    from scrapy.selector import Selector
    from myscrapy.items import DmozItem
     
    class DmozSpider(Spider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
        start_urls = [
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
        ]
     
        def parse(self, response):
            print "------response.url=",response.url
            filename = response.url.split("/")[-2]
            open(filename, 'wb').write(response.body)
            sel = Selector(response)
            sites = sel.xpath('//ul/li')
            items = []
            for site in sites:
                item = DmozItem()
                item['title'] = site.xpath('a/text()').extract()
                item['link'] = site.xpath('a/@href').extract()
                item['desc'] = site.xpath('text()').extract()
                items.append(item)
            return items

Storing the scraped data

scrapy crawl dmoz -o items.json -t json

Item Pipiline

After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

Typical use for item pipelines are:

  • cleansing HTML data
  • validating scraped data (checking that the items contain certain fields)
  • checking for duplicates (and dropping them)
  • storing the scraped item in a database

Follow below steps for ItemPipeline:

  • Add setting for ITEM_PIPELINES class in settings.py:
    ITEM_PIPELINES = {'myscrapy.pipelines.MyscrapyPipeline'}
  • Define class MyscrapyPipeline in pipelines.py:
    class MyscrapyPipeline(object):
        def process_item(self, item, spider):
            return item

Run check spiders And run it

  • Run list all spiders:
    scrapy list
  • Run spider:
    scrapy crawl spidername

Custom APIs

Selectors

http://doc.scrapy.org/en/latest/topics/selectors.html
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:

  • BeautifulSoup is a very popular screen scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.
  • lxml is a XML parsing library (which also parses HTML) with a pythonic API based on ElementTree (which is not part of the Python standard library).

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

How to create Selector:

  • Create Selector from text
    from scrapy.http import HtmlResponse
    from scrapy.selector import Selector
     
    body = '<html><body><span>good</span></body></html>'
    Selector(text=body).xpath('//span/text()').extract()

    ⇒output:

    [u'good']
  • Create Selector from Response
    from scrapy.http import HtmlResponse
    from scrapy.selector import Selector
     
    response = HtmlResponse(url='http://example.com', body=body)
    Selector(response=response).xpath('//span/text()').extract()

    ⇒output:

    [u'good']

Rule and linkextractors in CrawlSpider

CrawlSpider sử dụng 2 thuật toán cơ bản:

  1. Thuật toán đệ quy để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó
  2. Thuật toán extract links dựa theo rule để lọc ra những url mà nó muốn download

Scrapy linkextractors package

refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html

Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.

linkextractors classes

Link extractors classes bundled with Scrapy are provided in the scrapy.linkextractors module. Some basic classes in scrapy.linkextractors used to extract links:

  • scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor. Because alias below
    # Top-level imports
    from .lxmlhtml import LxmlLinkExtractor as LinkExtractor

    ⇒ So we can use LxmlLinkExtractor by import LinkExtractor with code below:

    from scrapy.linkextractors import LinkExtractor
  • scrapy.linkextractors.htmlparser.HtmlParserLinkExtractor ⇒ removed in future
  • scrapy.linkextractors.sgml.SgmlLinkExtractor ⇒ removed in future

Contructor LxmlLinkExtractor:

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)
  • allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
  • deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. If not given (or empty) it won’t exclude any links.
  • allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links
  • deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
  • deny_extensions
  • restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
  • restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as restrict_xpaths
  • tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
  • attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
  • unique (boolean) – whether duplicate filtering should be applied to extracted links.

How to extract links in LxmlLinkExtractor:

class LxmlLinkExtractor(FilteringLinkExtractor):
 
    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=True,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=()):
        ............................
        lx = LxmlParserLinkExtractor(tag=tag_func, attr=attr_func,
            unique=unique, process=process_value)
        super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions)
        ............................
    def extract_links(self, response):
        html = Selector(response)
        base_url = get_base_url(response)
        if self.restrict_xpaths:
            docs = [subdoc
                    for x in self.restrict_xpaths
                    for subdoc in html.xpath(x)]
        else:
            docs = [html]
        all_links = []
        for doc in docs:
            links = self._extract_links(doc, response.url, response.encoding, base_url)
            all_links.extend(self._process_links(links))
        return unique_list(all_links)
class LxmlParserLinkExtractor(object):
...................
    def _extract_links(self, selector, response_url, response_encoding, base_url):
        links = []
        # hacky way to get the underlying lxml parsed document
        for el, attr, attr_val in self._iter_links(selector._root):
            # pseudo lxml.html.HtmlElement.make_links_absolute(base_url)
            attr_val = urljoin(base_url, attr_val)
            url = self.process_attr(attr_val)
            if url is None:
                continue
            if isinstance(url, unicode):
                url = url.encode(response_encoding)
            # to fix relative links after process_value
            url = urljoin(response_url, url)
            link = Link(url, _collect_string_content(el) or u'',
                nofollow=True if el.get('rel') == 'nofollow' else False)
            links.append(link)
 
        return unique_list(links, key=lambda link: link.url) \
                if self.unique else links
 
    def extract_links(self, response):
        html = Selector(response)
        base_url = get_base_url(response)
        return self._extract_links(html, response.url, response.encoding, base_url)
...................

Extract files in html file which links in tags=('script','img') and attrs=('src'):

filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = [])
links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen]
  file_item = FileItem()
  file_urls = []        
  if len(links) > 0:
      for link in links:
          self.seen.add(link)
          fullurl = getFullUrl(link.url, self.defaultBaseUrl)
          file_urls.append(fullurl)
      file_item['file_urls'] = file_urls

Scrapy Selector Package

Rule in CrawlSpider

Understand about using Rule in CrawlSpider:

  • Contructor Rule:
    Class Rule(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity)
  • Using Rule in CrawlSpider:
    class CrawlSpider(Spider):
        rules = ()
        def __init__(self, *a, **kw):
        ...................
            self._compile_rules()
        def _compile_rules(self):
            def get_method(method):
                if callable(method):
                    return method
                elif isinstance(method, basestring):
                    return getattr(self, method, None)
     
            self._rules = [copy.copy(r) for r in self.rules]
            for rule in self._rules:
                rule.callback = get_method(rule.callback)
                rule.process_links = get_method(rule.process_links)
                rule.process_request = get_method(rule.process_request)    
        def _requests_to_follow(self, response):
            if not isinstance(response, HtmlResponse):
                return
            seen = set()
            for n, rule in enumerate(self._rules):
                links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
                if links and rule.process_links:
                    links = rule.process_links(links)
                for link in links:
                    seen.add(link)
                    r = Request(url=link.url, callback=self._response_downloaded)
                    r.meta.update(rule=n, link_text=link.text)
                    yield rule.process_request(r)        

Thuật toán đệ quy trong CrawlSpider:

  • Hàm đệ quy _parse_response
    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item
     
        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item
    1. Step1: Phân tích và lấy tất cả các link trong callback
    2. Step2: Phân tích và lấy tất cả link theo rules được định nghĩa bởi CrawlSpider nếu follow=True
    3. Step3: Sau đó Lấy tất cả link ở trên thực hiện việc download nội dung tất cả các link trên
    4. Step4: Gọi hàm _parse_response để phân tích các nội dung được download trong Step3 và quay lại Step1
    5. Khi nào hàm đệ quy kết thúc? Kết thúc khi không còn link nào để trả về trong _parse_response
  • Tổng quan về thuật toán xử lý trong CrawlSpider:
    class CrawlSpider(Spider):
    ...................
        def parse(self, response):
            return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
    1. Step1: Download tất cả link được định nghĩa trong start_urls
    2. Step2: Response được download trong Step1 sẽ gọi hàm parse→_parse_response với tùy chọn mặc định follow=True để lấy tất cả các link trong content của response được định nghĩa trong rules
    3. Step3: Thực hiện việc download tất cả các link lấy được trong Step2
    4. Step4: Với nội dung được download trong Step3, kiểm tra nếu follow=True thì thực hiện việc lấy tất cả các link theo rules và download tất cả những link này

Example define Rule in download.py

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
................
class DownloadSpider(CrawlSpider):
    name = "download"
    allowed_domains = ["template-help.com"]
    start_urls = [
        "http://livedemo00.template-help.com/wordpress_44396/"
    ]
    rules = [
        Rule(sle(allow=("/*")), callback='myparse', follow = False, process_request=process_request),
        Rule(sle(allow=("/*"), tags=('link'), deny_extensions = []), callback='parse_css')
    ]
................    

Code and Debug Scrapy with Eclipse

Scrapy Unittest with trial

Unittest with trial

Refer python unittest: Python Unittest

create simpletest.py with below content:

import unittest
class DemoTest(unittest.TestCase):
    def test_passes(self):
        pass
    def test_fails(self):
        self.fail("I failed.")

Run with trial:

trial simpletest.py

Output:

simpletest
  DemoTest
    test_fails ...                                                       [FAIL]
    test_passes ...                                                        [OK]

===============================================================================
[FAIL]
Traceback (most recent call last):
  File "D:\tools\Python27\lib\unittest\case.py", line 327, in run
    testMethod()
  File "D:\simpletest.py", line 6, in test_fails
    self.fail("I failed.")
  File "D:\tools\Python27\lib\unittest\case.py", line 408, in fail
    raise self.failureException(msg)
exceptions.AssertionError: I failed.

simpletest.DemoTest.test_fails
-------------------------------------------------------------------------------
Ran 2 tests in 1.540s

FAILED (failures=1, successes=1)

Scrapy Unittest

Install missing Python package

  • Install some missing packages with pip:
    pip install mock
    pip install boto
    pip install django
    yum install vsftpd
  • Install missing packages for running test_pipeline_images.py:
    pip uninstall PIL
    pip install pillow

    And on Windows

    SET VS90COMNTOOLS=%VS100COMNTOOLS%
    easy_install pillow
  • Install missing module bz2:
    • Install bzip2-devel
      yum install bzip2-devel
    • Then rebuild source python and check bz2 module:
      ./configure
      make
       ./python -c "import bz2; print bz2.__doc__"
    • Install python If check bz2 module OK:
      make install
  • Fix error “Error loading either pysqlite2 or sqlite3 modules”
    • Install sqlite-devel
      yum install sqlite-devel
    • Then rebuild source python and check sqlite3 module:
      ./configure
      make
      ./python -c "import sqlite3;print sqlite3.__doc__"
    • install python if sqlite3 check OK:
      make install

Run scrapy Unittest

Run below command to run all unittest for scrapy which has installed:

trial scrapy

Or copy bin/runtests.sh in scrapy source to /usr/local/lib/python2.7/site-packages/scrapy/ and run:

./bin/runtests.sh

Go to scrapy source and run below script to run all unittest for scrapy source:

./bin/runscripts.sh

Run python script unittest in \Lib\site-packages\Scrapy-0.22.2-py2.7.egg\scrapy\tests\, for example: test_downloadermiddleware.py:

trial test_downloadermiddleware.py
crawler/scrapy.txt · Last modified: 2022/10/29 16:15 by 127.0.0.1