Table of Contents

Scrapy

refer:

code examples:

scrapy overview

Basic Concepts

Data Flow


The data flow in Scrapy is controlled by the execution engine, and goes like this:

  1. The Engine opens a domain, locates the Spider that handles that domain, and asks the spider for the first URLs to crawl.
  2. The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.
  3. The Engine asks the Scheduler for the next URLs to crawl.
  4. The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader, passing through the Downloader Middleware (request direction).
  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middleware (response direction).
  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (input direction).
  7. The Spider processes the Response and returns scraped Items and new Requests (to follow) to the Engine.
  8. The Engine sends scraped Items (returned by the Spider) to the Item Pipeline and Requests (returned by spider) to the Scheduler
  9. The process repeats (from step 2) until there are no more requests from the Scheduler, and the Engine closes the domain.

Installation

Installation on linux

Pre-requisites:

Install python 2.7 from source

Because yum packages in CenOS6 is python2.6, we need install python 2.7 from source

yum install bzip2-devel
yum install sqlite-devel
yum install 
wget https://www.python.org/ftp/python/2.7.7/Python-2.7.7.tgz
tar xf Python-2.7.7.tgz
cd Python-2.7.7
./configure
make
make install
install pip
install lxml
install pyopenssl
install Twisted-14.0.0 from source
wget http://twistedmatrix.com/Releases/Twisted/14.0/Twisted-14.0.0.tar.bz2
tar xf Twisted-14.0.0.tar.bz2
cd Twisted-14.0.0
python2.7 setup.py --help
python2.7 setup.py build
python2.7 setup.py install

Install scrapy

Install on windows

Prepare environment variables

Install pyopenssl

Step by steop install openssl:

  1. Download Visual C++ 2008(or 2010) redistributables for your Windows and architecture
  2. Download OpenSSL for your Windows and architecture (the regular version, not the light one)
  3. Prepare environment for building pyopenssl
    SET LIB
    SET INCLUDE
    SET VS90COMNTOOLS=%VS100COMNTOOLS%
    SET INCLUDE=d:\tools\OpenSSL-Win32\include
    SET LIB=d:\tools\OpenSSL-Win32\lib

    And on windows64bit

    SET LIB
    SET INCLUDE
    SET VS90COMNTOOLS=%VS110COMNTOOLS%
    SET INCLUDE=d:\tools\OpenSSL-Win64\include
    SET LIB=d:\tools\OpenSSL-Win64\lib

    We set the environment variable VS90COMNTOOLS=%VS100COMNTOOLS% because Visual C++ 2010 are installed on windows with environment VS100COMNTOOLS, not Visual C++ 2008. And python using environment variable VS90COMNTOOLS for finding the compiler on windows, so we need set the variable VS90COMNTOOLS=%VS100COMNTOOLS%

  4. Install pyopenssl: Go to directory d:\tools\Python27\Scripts\ and run
    easy_install.exe pyopenssl

Issues for building pyopenssl:

Fix:For fixing this issue, we need update msvc9compiler.py to add build option /MANIFEST in front of code line ld_args.append('/MANIFESTFILE:' + temp_manifest):

ld_args.append('/MANIFEST ')
ld_args.append('/MANIFESTFILE:' + temp_manifest)

Install pywin32, twisted

Install lxml

Install lxml version 2.x.x

easy_install.exe lxml==2.3

Install lxml 3.x.x:

  1. Step1: Download lxml MS Windows Installer packages
  2. Step2: Install with easy_install:
    easy_install lxml-3.4.0.win32-py2.7.exe

Build lxml from source

hg clone git://github.com/lxml/lxml.git lxml
cd lxml
SET VS90COMNTOOLS=%VS100COMNTOOLS%
pip install -r requirements.txt
python setup.py build
python setup.py install

Install scrapy

Go to directory d:\tools\Python27\Scripts\ and run:

pip.exe install scrapy

Install and run scrapy on virtual environment with miniconda

  1. Step1: Create virtual environment with name scrapy and install package scrapy in this environment
    conda create -n scrapy scrapy python=2
  2. Step2: Go to virtual environment scrapy:
    activate scrapy

Fisrt Project

This tutorial will walk you through these tasks:

  1. Creating a new Scrapy project
  2. Writing a simple spider to crawl a site and run crawling
  3. Modify spider to extract Items with Selector
  4. Defining Object store extracted Items and return Items for Item Pipeline
  5. Writing an Item Pipeline to store the extracted Items

Creating a project

Before you start scraping, you will have set up a new Scrapy project. Enter a directory where you’d like to store your code and then run:

scrapy startproject myscrapy

This will create a myscrapy directory with the following contents:

myscrapy/
    scrapy.cfg
    myscrapy/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

These are basically:

Create simple Spider

This is the code for our first Spider; save it in a file named dmoz_spider.py under the myscrapy/spiders directory:

from scrapy.spider import Spider
 
class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
 
    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

⇒ This code define a spider with basic informations below:

Crawling

To run spider with name 'dmoz', go to the project’s top level directory and run:

scrapy crawl dmoz

Output:

Modify spider to extract Items with Selector

Edit spiders\dmoz_spyder.py:

from scrapy.spider import Spider
from scrapy.selector import Selector
 
class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
 
    def parse(self, response):
        print "------response.url=",response.url
        sel = Selector(response)
        sites = sel.xpath('//ul/li')
        for site in sites:
            title = site.xpath('a/text()').extract()
            link = site.xpath('a/@href').extract()
            desc = site.xpath('text()').extract()
            print title, link, desc

Defining Object store extracted Items and return Items for Item Pipeline

Storing the scraped data

scrapy crawl dmoz -o items.json -t json

Item Pipiline

After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

Typical use for item pipelines are:

Follow below steps for ItemPipeline:

Run check spiders And run it

Custom APIs

Selectors

http://doc.scrapy.org/en/latest/topics/selectors.html
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

How to create Selector:

Rule and linkextractors in CrawlSpider

CrawlSpider sử dụng 2 thuật toán cơ bản:

  1. Thuật toán đệ quy để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó
  2. Thuật toán extract links dựa theo rule để lọc ra những url mà nó muốn download

Scrapy linkextractors package

refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html

Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.

linkextractors classes

Link extractors classes bundled with Scrapy are provided in the scrapy.linkextractors module. Some basic classes in scrapy.linkextractors used to extract links:

Contructor LxmlLinkExtractor:

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)

How to extract links in LxmlLinkExtractor:

class LxmlLinkExtractor(FilteringLinkExtractor):
 
    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=True,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=()):
        ............................
        lx = LxmlParserLinkExtractor(tag=tag_func, attr=attr_func,
            unique=unique, process=process_value)
        super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions)
        ............................
    def extract_links(self, response):
        html = Selector(response)
        base_url = get_base_url(response)
        if self.restrict_xpaths:
            docs = [subdoc
                    for x in self.restrict_xpaths
                    for subdoc in html.xpath(x)]
        else:
            docs = [html]
        all_links = []
        for doc in docs:
            links = self._extract_links(doc, response.url, response.encoding, base_url)
            all_links.extend(self._process_links(links))
        return unique_list(all_links)
class LxmlParserLinkExtractor(object):
...................
    def _extract_links(self, selector, response_url, response_encoding, base_url):
        links = []
        # hacky way to get the underlying lxml parsed document
        for el, attr, attr_val in self._iter_links(selector._root):
            # pseudo lxml.html.HtmlElement.make_links_absolute(base_url)
            attr_val = urljoin(base_url, attr_val)
            url = self.process_attr(attr_val)
            if url is None:
                continue
            if isinstance(url, unicode):
                url = url.encode(response_encoding)
            # to fix relative links after process_value
            url = urljoin(response_url, url)
            link = Link(url, _collect_string_content(el) or u'',
                nofollow=True if el.get('rel') == 'nofollow' else False)
            links.append(link)
 
        return unique_list(links, key=lambda link: link.url) \
                if self.unique else links
 
    def extract_links(self, response):
        html = Selector(response)
        base_url = get_base_url(response)
        return self._extract_links(html, response.url, response.encoding, base_url)
...................

Extract files in html file which links in tags=('script','img') and attrs=('src'):

filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = [])
links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen]
  file_item = FileItem()
  file_urls = []        
  if len(links) > 0:
      for link in links:
          self.seen.add(link)
          fullurl = getFullUrl(link.url, self.defaultBaseUrl)
          file_urls.append(fullurl)
      file_item['file_urls'] = file_urls

Scrapy Selector Package

Rule in CrawlSpider

Understand about using Rule in CrawlSpider:

Thuật toán đệ quy trong CrawlSpider:

Example define Rule in download.py

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
................
class DownloadSpider(CrawlSpider):
    name = "download"
    allowed_domains = ["template-help.com"]
    start_urls = [
        "http://livedemo00.template-help.com/wordpress_44396/"
    ]
    rules = [
        Rule(sle(allow=("/*")), callback='myparse', follow = False, process_request=process_request),
        Rule(sle(allow=("/*"), tags=('link'), deny_extensions = []), callback='parse_css')
    ]
................    

Code and Debug Scrapy with Eclipse

refer: Python Editor with pydev plugin on Eclipse

Scrapy Unittest with trial

Unittest with trial

Refer python unittest: Python Unittest

create simpletest.py with below content:

import unittest
class DemoTest(unittest.TestCase):
    def test_passes(self):
        pass
    def test_fails(self):
        self.fail("I failed.")

Run with trial:

trial simpletest.py

Output:

simpletest
  DemoTest
    test_fails ...                                                       [FAIL]
    test_passes ...                                                        [OK]

===============================================================================
[FAIL]
Traceback (most recent call last):
  File "D:\tools\Python27\lib\unittest\case.py", line 327, in run
    testMethod()
  File "D:\simpletest.py", line 6, in test_fails
    self.fail("I failed.")
  File "D:\tools\Python27\lib\unittest\case.py", line 408, in fail
    raise self.failureException(msg)
exceptions.AssertionError: I failed.

simpletest.DemoTest.test_fails
-------------------------------------------------------------------------------
Ran 2 tests in 1.540s

FAILED (failures=1, successes=1)

Scrapy Unittest

Install missing Python package

Run scrapy Unittest

Run below command to run all unittest for scrapy which has installed:

trial scrapy

Or copy bin/runtests.sh in scrapy source to /usr/local/lib/python2.7/site-packages/scrapy/ and run:

./bin/runtests.sh

Go to scrapy source and run below script to run all unittest for scrapy source:

./bin/runscripts.sh

Run python script unittest in \Lib\site-packages\Scrapy-0.22.2-py2.7.egg\scrapy\tests\, for example: test_downloadermiddleware.py:

trial test_downloadermiddleware.py