Scrapy

refer:

http://doc.scrapy.org/en/latest/

code examples:

https://code.google.com/p/scrapy-tutorial/

git clone https://code.google.com/p/scrapy-tutorial/

https://code.google.com/p/scrapy-spider/

svn checkout http://scrapy-spider.googlecode.com/svn/trunk/ scrapy-spider-read-only

https://code.google.com/p/kateglo-crawler/

svn checkout http://kateglo-crawler.googlecode.com/svn/trunk/ kateglo-crawler-read-only

https://code.google.com/p/ina-news-crawler/

svn checkout http://ina-news-crawler.googlecode.com/svn/trunk/ ina-news-crawler-read-only

https://code.google.com/p/simple-dm-forums/

hg clone https://code.google.com/p/simple-dm-forums/

https://code.google.com/p/tanqing-web-based-scrapy/

svn checkout http://tanqing-web-based-scrapy.googlecode.com/svn/trunk/ tanqing-web-based-scrapy-read-only

https://code.google.com/p/scrapy/

svn checkout http://scrapy.googlecode.com/svn/trunk/ scrapy-read-only

https://github.com/geekan/scrapy-examples

git clone https://github.com/geekan/scrapy-examples.git

https://github.com/php5engineer/scrapy-drupal-org

git clone https://github.com/php5engineer/scrapy-drupal-org.git

scrapy overview

Basic Concepts

Command line tool: Learn about the command-line tool used to manage your Scrapy project.
Items: Define the data you want to scrape.
Spiders: Write the rules to crawl your websites.
Selectors: Extract the data from web pages using XPath.
Scrapy shell: Test your extraction code in an interactive environment.
Item Loaders: Populate your items with the extracted data.
Item Pipeline: Post-process and store your scraped data.
Feed exports: Output your scraped data using different formats and storages.
Link Extractors: Convenient classes to extract links to follow from pages.

Data Flow

The data flow in Scrapy is controlled by the execution engine, and goes like this:

The Engine opens a domain, locates the Spider that handles that domain, and asks the spider for the first URLs to crawl.
The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.
The Engine asks the Scheduler for the next URLs to crawl.
The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader, passing through the Downloader Middleware (request direction).
Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middleware (response direction).
The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (input direction).
The Spider processes the Response and returns scraped Items and new Requests (to follow) to the Engine.
The Engine sends scraped Items (returned by the Spider) to the Item Pipeline and Requests (returned by spider) to the Scheduler
The process repeats (from step 2) until there are no more requests from the Scheduler, and the Engine closes the domain.

Installation

Installation on linux

Pre-requisites:

Python 2.7
lxml: http://lxml.de/installation.html

Install python 2.7 from source

Because yum packages in CenOS6 is python2.6, we need install python 2.7 from source

yum install bzip2-devel
yum install sqlite-devel
yum install 
wget https://www.python.org/ftp/python/2.7.7/Python-2.7.7.tgz
tar xf Python-2.7.7.tgz
cd Python-2.7.7
./configure
make
make install

install pip

install

wget https://bootstrap.pypa.io/get-pip.py
python2.7 get-pip.py

check pip

pip -V
pip 1.5.6 from /usr/local/lib/python2.7/site-packages (python 2.7)

install lxml

Install Pre-requisites packages

yum install libxml2-devel
yum install libxslt-devel

install lxml with pip
```
pip install lxml==3.1.2
```

check installed:

/usr/lib64/python2.6/site-packages/lxml
Or
/usr/local/lib/python2.7/site-packages/lxml

install pyopenssl

Install Pre-requisites packages

yum install openssl-devel
yum install libffi-devel

Install pyopenssl

Install from source

wget https://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.14.tar.gz
tar xf pyOpenSSL-0.14.tar.gz
cd pyOpenSSL-0.14
python2.7 setup.py --help
python2.7 setup.py build
python2.7 setup.py install

Or Install with pip
```
pip install pyopenssl
```

install Twisted-14.0.0 from source

wget http://twistedmatrix.com/Releases/Twisted/14.0/Twisted-14.0.0.tar.bz2
tar xf Twisted-14.0.0.tar.bz2
cd Twisted-14.0.0
python2.7 setup.py --help
python2.7 setup.py build
python2.7 setup.py install

Install scrapy

Install
```
pip install scrapy
```
update
```
pip install scrapy --upgrade
```

build and install from source:

wget https://codeload.github.com/scrapy/scrapy/legacy.tar.gz/master
tar xf master
cd scrapy-scrapy-9d57ecf
python setup.py install

upgrade master for fixing error below:

trial test_crawl.py
scrapy.tests.test_crawl
  CrawlTestCase
    test_delay ... Traceback (most recent call last):
  File "/usr/local/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/site-packages/scrapy/tests/mockserver.py", line 198, in <module>
    os.path.join(os.path.dirname(__file__), 'keys/cert.pem'),
  File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 104, in __init__
    self.cacheContext()
  File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 113, in cacheContext
    ctx.use_certificate_file(self.certificateFileName)
  File "/usr/local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 391, in use_certificate_file
    _raise_current_error()
  File "/usr/local/lib/python2.7/site-packages/OpenSSL/_util.py", line 22, in exception_from_error_queue
    raise exceptionType(errors)
OpenSSL.SSL.Error: [('system library', 'fopen', 'No such file or directory'), ('BIO routines', 'FILE_CTRL', 'system lib'), ('SSL routines', 'SSL_CTX_use_certificate_file', 'system lib')]

Install on windows

Prepare environment variables

set environment variable for python

set PATH=%PATH%;"D:\tools\Python27\";"D:\tools\Python27\Scripts"

Check current environment variables of Visual studio:

set

output:

VS100COMNTOOLS=C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\Tools\
VS110COMNTOOLS=D:\tools\Microsoft Visual Studio 11.0\Common7\Tools\

Set environment for VS90COMNTOOLS(compiler on windows)
```
SET VS90COMNTOOLS=%VS100COMNTOOLS%
```
⇒ Fix error: Unable to find vcvarsall.bat
upgrade setuptools:
```
pip install -U setuptools
```

Install pyopenssl

Step by steop install openssl:

Goto page http://slproweb.com/products/Win32OpenSSL.html
Download Visual C++ 2008(or 2010) redistributables for your Windows and architecture
Download OpenSSL for your Windows and architecture (the regular version, not the light one)
Prepare environment for building pyopenssl
```
SET LIB
SET INCLUDE
SET VS90COMNTOOLS=%VS100COMNTOOLS%
SET INCLUDE=d:\tools\OpenSSL-Win32\include
SET LIB=d:\tools\OpenSSL-Win32\lib
```
And on windows64bit
```
SET LIB
SET INCLUDE
SET VS90COMNTOOLS=%VS110COMNTOOLS%
SET INCLUDE=d:\tools\OpenSSL-Win64\include
SET LIB=d:\tools\OpenSSL-Win64\lib
```
We set the environment variable VS90COMNTOOLS=%VS100COMNTOOLS% because Visual C++ 2010 are installed on windows with environment VS100COMNTOOLS, not Visual C++ 2008. And python using environment variable VS90COMNTOOLS for finding the compiler on windows, so we need set the variable VS90COMNTOOLS=%VS100COMNTOOLS%
Install pyopenssl: Go to directory d:\tools\Python27\Scripts\ and run
```
easy_install.exe pyopenssl
```

Issues for building pyopenssl:

Issue:

cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -ID:\tools\Python27\include -ID:\tools\Python27\PC /Tccryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.c /Fobuild\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj
link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:D:\tools\Python27\libs /LIBPATH:D:\tools\Python27\PCbuild libeay32.lib ssleay32.lib advapi32.lib /EXPORT:init_Cryptography_cffi_444d7397xa22f8491 build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj /OUT:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd /IMPLIB:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.lib /MANIFESTFILE:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest
mt.exe -nologo -manifest build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest -outputresource:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd;2
error: command 'mt.exe' failed with exit status 31

Root cause of issue:

link.exe  /MANIFESTFILE:build\temp.win32-7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest

⇒ Can't create file .manifest

Anf follow specification in link http://msdn.microsoft.com/en-us/library/y0zzbyt4.aspx, we need add option /MANIFEST when using link.exe to create manifest file

Fix:For fixing this issue, we need update msvc9compiler.py to add build option /MANIFEST in front of code line ld_args.append('/MANIFESTFILE:' + temp_manifest):

ld_args.append('/MANIFEST ')
ld_args.append('/MANIFESTFILE:' + temp_manifest)

Install pywin32, twisted

Install pywin32 follow link: http://sourceforge.net/projects/pywin32/files/ or Install with pip
```
easy_install.exe pypiwin32
```
Go to directory d:\tools\Python27\Scripts\ and run below commands to install twisted, lxml:
```
easy_install.exe twisted
```

Install lxml

Install lxml version 2.x.x

easy_install.exe lxml==2.3

Install lxml 3.x.x:

Step1: Download lxml MS Windows Installer packages
Step2: Install with easy_install:
```
easy_install lxml-3.4.0.win32-py2.7.exe
```

Build lxml from source

hg clone git://github.com/lxml/lxml.git lxml
cd lxml
SET VS90COMNTOOLS=%VS100COMNTOOLS%
pip install -r requirements.txt
python setup.py build
python setup.py install

Install scrapy

Go to directory d:\tools\Python27\Scripts\ and run:

pip.exe install scrapy

Install and run scrapy on virtual environment with miniconda

Step1: Create virtual environment with name scrapy and install package scrapy in this environment
```
conda create -n scrapy scrapy python=2
```
Step2: Go to virtual environment scrapy:
```
activate scrapy
```

Fisrt Project

This tutorial will walk you through these tasks:

Creating a new Scrapy project
Writing a simple spider to crawl a site and run crawling
Modify spider to extract Items with Selector
Defining Object store extracted Items and return Items for Item Pipeline
Writing an Item Pipeline to store the extracted Items

Creating a project

Before you start scraping, you will have set up a new Scrapy project. Enter a directory where you’d like to store your code and then run:

scrapy startproject myscrapy

This will create a myscrapy directory with the following contents:

myscrapy/
    scrapy.cfg
    myscrapy/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

These are basically:

scrapy.cfg: the project configuration file
myscrapy/: the project’s python module, you’ll later import your code from here.
myscrapy/items.py: the project’s items file.
myscrapy/pipelines.py: the project’s pipelines file.
myscrapy/settings.py: the project’s settings file.
myscrapy/spiders/: a directory where you’ll later put your spiders.

Create simple Spider

This is the code for our first Spider; save it in a file named dmoz_spider.py under the myscrapy/spiders directory:

from scrapy.spider import Spider
 
class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
 
    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

⇒ This code define a spider with basic informations below:

Identifies the Spider:
```
name = "dmoz"
```
List allowed domain for crawling:
```
allowed_domains = ["dmoz.org"]
```

List of URLs where the Spider will begin to crawl from:

start_urls = [
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Crawling

To run spider with name 'dmoz', go to the project’s top level directory and run:

scrapy crawl dmoz

Output:

crawling log:

2014-06-11 15:52:30+0700 [scrapy] INFO: Scrapy 0.22.2 started (bot: myscrapy)
2014-06-11 15:52:30+0700 [scrapy] INFO: Optional features available: ssl, http11, django
2014-06-11 15:52:30+0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'myscrapy.spiders', 'SPIDER_MODULES': ['myscrapy.spiders'], 'BOT_NAME': 'myscrapy'}
2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeader
ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled item pipelines:
2014-06-11 15:52:30+0700 [dmoz] INFO: Spider opened
2014-06-11 15:52:30+0700 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-11 15:52:30+0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-11 15:52:30+0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-06-11 15:52:31+0700 [dmoz] INFO: Closing spider (finished)
2014-06-11 15:52:31+0700 [dmoz] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 516,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 16515,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 6, 11, 8, 52, 31, 299000),
         'log_count/DEBUG': 4,
         'log_count/INFO': 7,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2014, 6, 11, 8, 52, 30, 424000)}
2014-06-11 15:52:31+0700 [dmoz] INFO: Spider closed (finished)

Output files created from parse function in dmoz_spider:
```
Books
Resources
```

Modify spider to extract Items with Selector

Edit spiders\dmoz_spyder.py:

from scrapy.spider import Spider
from scrapy.selector import Selector
 
class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
 
    def parse(self, response):
        print "------response.url=",response.url
        sel = Selector(response)
        sites = sel.xpath('//ul/li')
        for site in sites:
            title = site.xpath('a/text()').extract()
            link = site.xpath('a/@href').extract()
            desc = site.xpath('text()').extract()
            print title, link, desc

Defining Object store extracted Items and return Items for Item Pipeline

Define Object store extracted Items in items.py

from scrapy.item import Item, Field
 
class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

Modify spyders\dmoz_spyder.py return Items for Item Pipeline

from scrapy.spider import Spider
from scrapy.selector import Selector
from myscrapy.items import DmozItem
 
class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
 
    def parse(self, response):
        print "------response.url=",response.url
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)
        sel = Selector(response)
        sites = sel.xpath('//ul/li')
        items = []
        for site in sites:
            item = DmozItem()
            item['title'] = site.xpath('a/text()').extract()
            item['link'] = site.xpath('a/@href').extract()
            item['desc'] = site.xpath('text()').extract()
            items.append(item)
        return items

Storing the scraped data

scrapy crawl dmoz -o items.json -t json

Item Pipiline

After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

Typical use for item pipelines are:

cleansing HTML data
validating scraped data (checking that the items contain certain fields)
checking for duplicates (and dropping them)
storing the scraped item in a database

Follow below steps for ItemPipeline:

Add setting for ITEM_PIPELINES class in settings.py:

ITEM_PIPELINES = {'myscrapy.pipelines.MyscrapyPipeline'}

Define class MyscrapyPipeline in pipelines.py:

class MyscrapyPipeline(object):
    def process_item(self, item, spider):
        return item

Run check spiders And run it

Run list all spiders:
```
scrapy list
```
Run spider:
```
scrapy crawl spidername
```

Custom APIs

Selectors

http://doc.scrapy.org/en/latest/topics/selectors.html
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:

BeautifulSoup is a very popular screen scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.
lxml is a XML parsing library (which also parses HTML) with a pythonic API based on ElementTree (which is not part of the Python standard library).

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

How to create Selector:

Create Selector from text

from scrapy.http import HtmlResponse
from scrapy.selector import Selector
 
body = '<html><body><span>good</span></body></html>'
Selector(text=body).xpath('//span/text()').extract()

⇒output:

[u'good']

Create Selector from Response

from scrapy.http import HtmlResponse
from scrapy.selector import Selector
 
response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').extract()

⇒output:

[u'good']

Rule and linkextractors in CrawlSpider

CrawlSpider sử dụng 2 thuật toán cơ bản:

Thuật toán đệ quy để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó
Thuật toán extract links dựa theo rule để lọc ra những url mà nó muốn download

Scrapy linkextractors package

refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html

Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.

linkextractors classes

Link extractors classes bundled with Scrapy are provided in the scrapy.linkextractors module. Some basic classes in scrapy.linkextractors used to extract links:

scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor. Because alias below
```
# Top-level imports
from .lxmlhtml import LxmlLinkExtractor as LinkExtractor
```
⇒ So we can use LxmlLinkExtractor by import LinkExtractor with code below:
```
from scrapy.linkextractors import LinkExtractor
```
scrapy.linkextractors.htmlparser.HtmlParserLinkExtractor ⇒ removed in future
scrapy.linkextractors.sgml.SgmlLinkExtractor ⇒ removed in future

Contructor LxmlLinkExtractor:

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. If not given (or empty) it won’t exclude any links.
allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links
deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
deny_extensions
restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as restrict_xpaths
tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
unique (boolean) – whether duplicate filtering should be applied to extracted links.

How to extract links in LxmlLinkExtractor:

class LxmlLinkExtractor(FilteringLinkExtractor):
 
    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=True,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=()):
        ............................
        lx = LxmlParserLinkExtractor(tag=tag_func, attr=attr_func,
            unique=unique, process=process_value)
        super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions)
        ............................
    def extract_links(self, response):
        html = Selector(response)
        base_url = get_base_url(response)
        if self.restrict_xpaths:
            docs = [subdoc
                    for x in self.restrict_xpaths
                    for subdoc in html.xpath(x)]
        else:
            docs = [html]
        all_links = []
        for doc in docs:
            links = self._extract_links(doc, response.url, response.encoding, base_url)
            all_links.extend(self._process_links(links))
        return unique_list(all_links)
class LxmlParserLinkExtractor(object):
...................
    def _extract_links(self, selector, response_url, response_encoding, base_url):
        links = []
        # hacky way to get the underlying lxml parsed document
        for el, attr, attr_val in self._iter_links(selector._root):
            # pseudo lxml.html.HtmlElement.make_links_absolute(base_url)
            attr_val = urljoin(base_url, attr_val)
            url = self.process_attr(attr_val)
            if url is None:
                continue
            if isinstance(url, unicode):
                url = url.encode(response_encoding)
            # to fix relative links after process_value
            url = urljoin(response_url, url)
            link = Link(url, _collect_string_content(el) or u'',
                nofollow=True if el.get('rel') == 'nofollow' else False)
            links.append(link)
 
        return unique_list(links, key=lambda link: link.url) \
                if self.unique else links
 
    def extract_links(self, response):
        html = Selector(response)
        base_url = get_base_url(response)
        return self._extract_links(html, response.url, response.encoding, base_url)
...................

extract links with linkextractors

Extract files in html file which links in tags=('script','img') and attrs=('src'):

filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = [])
links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen]
  file_item = FileItem()
  file_urls = []        
  if len(links) > 0:
      for link in links:
          self.seen.add(link)
          fullurl = getFullUrl(link.url, self.defaultBaseUrl)
          file_urls.append(fullurl)
      file_item['file_urls'] = file_urls

Scrapy Selector Package

Rule in CrawlSpider

Understand about using Rule in CrawlSpider:

Contructor Rule:

Class Rule(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity)

Using Rule in CrawlSpider:

class CrawlSpider(Spider):
    rules = ()
    def __init__(self, *a, **kw):
    ...................
        self._compile_rules()
    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, basestring):
                return getattr(self, method, None)
 
        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)    
    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                yield rule.process_request(r)

Thuật toán đệ quy trong CrawlSpider:

Hàm đệ quy _parse_response
```
def _parse_response(self, response, callback, cb_kwargs, follow=True):
    if callback:
        cb_res = callback(response, **cb_kwargs) or ()
        cb_res = self.process_results(response, cb_res)
        for requests_or_item in iterate_spider_output(cb_res):
            yield requests_or_item
 
    if follow and self._follow_links:
        for request_or_item in self._requests_to_follow(response):
            yield request_or_item
```
1. Step1: Phân tích và lấy tất cả các link trong callback
2. Step2: Phân tích và lấy tất cả link theo rules được định nghĩa bởi CrawlSpider nếu follow=True
3. Step3: Sau đó Lấy tất cả link ở trên thực hiện việc download nội dung tất cả các link trên
4. Step4: Gọi hàm _parse_response để phân tích các nội dung được download trong Step3 và quay lại Step1
5. Khi nào hàm đệ quy kết thúc? Kết thúc khi không còn link nào để trả về trong _parse_response
Tổng quan về thuật toán xử lý trong CrawlSpider:
```
class CrawlSpider(Spider):
...................
    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
```
1. Step1: Download tất cả link được định nghĩa trong start_urls
2. Step2: Response được download trong Step1 sẽ gọi hàm parse→_parse_response với tùy chọn mặc định follow=True để lấy tất cả các link trong content của response được định nghĩa trong rules
3. Step3: Thực hiện việc download tất cả các link lấy được trong Step2
4. Step4: Với nội dung được download trong Step3, kiểm tra nếu follow=True thì thực hiện việc lấy tất cả các link theo rules và download tất cả những link này

Example define Rule in download.py

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
................
class DownloadSpider(CrawlSpider):
    name = "download"
    allowed_domains = ["template-help.com"]
    start_urls = [
        "http://livedemo00.template-help.com/wordpress_44396/"
    ]
    rules = [
        Rule(sle(allow=("/*")), callback='myparse', follow = False, process_request=process_request),
        Rule(sle(allow=("/*"), tags=('link'), deny_extensions = []), callback='parse_css')
    ]
................

Code and Debug Scrapy with Eclipse

refer: Python Editor with pydev plugin on Eclipse

Scrapy Unittest with trial

Unittest with trial

Refer python unittest: Python Unittest

create simpletest.py with below content:

import unittest
class DemoTest(unittest.TestCase):
    def test_passes(self):
        pass
    def test_fails(self):
        self.fail("I failed.")

Run with trial:

trial simpletest.py

Output:

simpletest
  DemoTest
    test_fails ...                                                       [FAIL]
    test_passes ...                                                        [OK]

===============================================================================
[FAIL]
Traceback (most recent call last):
  File "D:\tools\Python27\lib\unittest\case.py", line 327, in run
    testMethod()
  File "D:\simpletest.py", line 6, in test_fails
    self.fail("I failed.")
  File "D:\tools\Python27\lib\unittest\case.py", line 408, in fail
    raise self.failureException(msg)
exceptions.AssertionError: I failed.

simpletest.DemoTest.test_fails
-------------------------------------------------------------------------------
Ran 2 tests in 1.540s

FAILED (failures=1, successes=1)

Scrapy Unittest

Install missing Python package

Install some missing packages with pip:

pip install mock
pip install boto
pip install django
yum install vsftpd

Install missing packages for running test_pipeline_images.py:

pip uninstall PIL
pip install pillow

And on Windows

SET VS90COMNTOOLS=%VS100COMNTOOLS%
easy_install pillow

Install missing module bz2:
- Install bzip2-devel
```
yum install bzip2-devel
```
- Then rebuild source python and check bz2 module:
```
./configure
make
 ./python -c "import bz2; print bz2.__doc__"
```
- Install python If check bz2 module OK:
```
make install
```
Fix error “Error loading either pysqlite2 or sqlite3 modules”
- Install sqlite-devel
```
yum install sqlite-devel
```
- Then rebuild source python and check sqlite3 module:
```
./configure
make
./python -c "import sqlite3;print sqlite3.__doc__"
```
- install python if sqlite3 check OK:
```
make install
```

Run scrapy Unittest

Run below command to run all unittest for scrapy which has installed:

trial scrapy

Or copy bin/runtests.sh in scrapy source to /usr/local/lib/python2.7/site-packages/scrapy/ and run:

./bin/runtests.sh

Go to scrapy source and run below script to run all unittest for scrapy source:

./bin/runscripts.sh

Run python script unittest in \Lib\site-packages\Scrapy-0.22.2-py2.7.egg\scrapy\tests\, for example: test_downloadermiddleware.py:

trial test_downloadermiddleware.py

Table of Contents

Scrapy

scrapy overview

Basic Concepts

Data Flow

Installation

Installation on linux

Pre-requisites:

Install python 2.7 from source

install pip

install lxml

install pyopenssl

install Twisted-14.0.0 from source

Install scrapy

Install on windows

Prepare environment variables

Install pyopenssl

Install pywin32, twisted

Install lxml

Install scrapy

Install and run scrapy on virtual environment with miniconda

Fisrt Project

Creating a project

Create simple Spider

Crawling

Modify spider to extract Items with Selector

Defining Object store extracted Items and return Items for Item Pipeline

Storing the scraped data

Item Pipiline

Run check spiders And run it

Custom APIs

Selectors

Rule and linkextractors in CrawlSpider

Scrapy linkextractors package

linkextractors classes

extract links with linkextractors

Scrapy Selector Package

Rule in CrawlSpider

Code and Debug Scrapy with Eclipse

Scrapy Unittest with trial

Unittest with trial

Scrapy Unittest

Install missing Python package

Run scrapy Unittest