====== Scrapy ======
refer:
* http://doc.scrapy.org/en/latest/
code examples:
* https://code.google.com/p/scrapy-tutorial/git clone https://code.google.com/p/scrapy-tutorial/
* https://code.google.com/p/scrapy-spider/svn checkout http://scrapy-spider.googlecode.com/svn/trunk/ scrapy-spider-read-only
* https://code.google.com/p/kateglo-crawler/svn checkout http://kateglo-crawler.googlecode.com/svn/trunk/ kateglo-crawler-read-only
* https://code.google.com/p/ina-news-crawler/svn checkout http://ina-news-crawler.googlecode.com/svn/trunk/ ina-news-crawler-read-only
* https://code.google.com/p/simple-dm-forums/hg clone https://code.google.com/p/simple-dm-forums/
* https://code.google.com/p/tanqing-web-based-scrapy/svn checkout http://tanqing-web-based-scrapy.googlecode.com/svn/trunk/ tanqing-web-based-scrapy-read-only
* https://code.google.com/p/scrapy/svn checkout http://scrapy.googlecode.com/svn/trunk/ scrapy-read-only
* https://github.com/geekan/scrapy-examplesgit clone https://github.com/geekan/scrapy-examples.git
* https://github.com/php5engineer/scrapy-drupal-orggit clone https://github.com/php5engineer/scrapy-drupal-org.git
===== scrapy overview =====
==== Basic Concepts ====
* Command line tool: Learn about the command-line tool used to manage your Scrapy project.
* **Items**: Define the data you want to scrape.
* **Spiders**: Write the rules to crawl your websites.
* **Selectors**: Extract the data from web pages using XPath.
* **Scrapy shell**: Test your extraction code in an interactive environment.
* **Item Loaders**: Populate your items with the extracted data.
* **Item Pipeline**: Post-process and store your scraped data.
* **Feed exports**: Output your scraped data using different formats and storages.
* **Link Extractors**: Convenient classes to extract links to follow from pages.
==== Data Flow ====
{{:crawler:scrapy_architecture.png|}}\\
The data flow in Scrapy is controlled by the execution engine, and goes like this:
- The Engine opens a domain, locates the Spider that handles that domain, and asks the spider for the first URLs to crawl.
- The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.
- The Engine asks the Scheduler for the next URLs to crawl.
- The Scheduler returns the next URLs to **crawl** to **the Engine** and the **Engine sends them to the Downloader**, passing through the Downloader Middleware (**request direction**).
- Once the page **finishes downloading** the **Downloader** generates a **Response** (with that page) and sends it to the Engine, passing through the Downloader Middleware (**response direction**).
- **The Engine** receives the Response from the Downloader and **sends it to the Spider for processing**, passing through the Spider Middleware (**input direction**).
- **The Spider** processes the Response and returns **scraped Items** and **new Requests** (to follow) to the Engine.
- **The Engine** sends **scraped Items** (returned by the Spider) to the **Item Pipeline** and **Requests** (returned by spider) **to the Scheduler**
- The process **repeats (from step 2)** until there are **no more requests from the Scheduler**, and the Engine closes the domain.
===== Installation =====
==== Installation on linux ====
=== Pre-requisites: ===
* Python 2.7
* lxml: http://lxml.de/installation.html
== Install python 2.7 from source ==
Because yum packages in CenOS6 is python2.6, we need install python 2.7 from source
yum install bzip2-devel
yum install sqlite-devel
yum install
wget https://www.python.org/ftp/python/2.7.7/Python-2.7.7.tgz
tar xf Python-2.7.7.tgz
cd Python-2.7.7
./configure
make
make install
== install pip ==
* install
wget https://bootstrap.pypa.io/get-pip.py
python2.7 get-pip.py
* check pip
pip -V
pip 1.5.6 from /usr/local/lib/python2.7/site-packages (python 2.7)
== install lxml ==
* Install Pre-requisites packages
yum install libxml2-devel
yum install libxslt-devel
* install lxml with pip
pip install lxml==3.1.2
* check installed:
/usr/lib64/python2.6/site-packages/lxml
Or
/usr/local/lib/python2.7/site-packages/lxml
== install pyopenssl ==
* Install Pre-requisites packages
yum install openssl-devel
yum install libffi-devel
* Install pyopenssl
* Install from source
wget https://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.14.tar.gz
tar xf pyOpenSSL-0.14.tar.gz
cd pyOpenSSL-0.14
python2.7 setup.py --help
python2.7 setup.py build
python2.7 setup.py install
*Or Install with pip
pip install pyopenssl
== install Twisted-14.0.0 from source ==
wget http://twistedmatrix.com/Releases/Twisted/14.0/Twisted-14.0.0.tar.bz2
tar xf Twisted-14.0.0.tar.bz2
cd Twisted-14.0.0
python2.7 setup.py --help
python2.7 setup.py build
python2.7 setup.py install
=== Install scrapy ===
* Install
pip install scrapy
* update
pip install scrapy --upgrade
* build and install from source:
wget https://codeload.github.com/scrapy/scrapy/legacy.tar.gz/master
tar xf master
cd scrapy-scrapy-9d57ecf
python setup.py install
upgrade master for fixing error below:
trial test_crawl.py
scrapy.tests.test_crawl
CrawlTestCase
test_delay ... Traceback (most recent call last):
File "/usr/local/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/site-packages/scrapy/tests/mockserver.py", line 198, in
os.path.join(os.path.dirname(__file__), 'keys/cert.pem'),
File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 104, in __init__
self.cacheContext()
File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 113, in cacheContext
ctx.use_certificate_file(self.certificateFileName)
File "/usr/local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 391, in use_certificate_file
_raise_current_error()
File "/usr/local/lib/python2.7/site-packages/OpenSSL/_util.py", line 22, in exception_from_error_queue
raise exceptionType(errors)
OpenSSL.SSL.Error: [('system library', 'fopen', 'No such file or directory'), ('BIO routines', 'FILE_CTRL', 'system lib'), ('SSL routines', 'SSL_CTX_use_certificate_file', 'system lib')]
==== Install on windows ====
=== Prepare environment variables ===
* set environment variable for python
set PATH=%PATH%;"D:\tools\Python27\";"D:\tools\Python27\Scripts"
* Check current environment variables of Visual studio:
set
output:
VS100COMNTOOLS=C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\Tools\
VS110COMNTOOLS=D:\tools\Microsoft Visual Studio 11.0\Common7\Tools\
* Set environment for VS90COMNTOOLS(compiler on windows)
SET VS90COMNTOOLS=%VS100COMNTOOLS%
=> Fix error: Unable to find vcvarsall.bat
* upgrade setuptools:
pip install -U setuptools
=== Install pyopenssl ===
Step by steop install openssl:
- Goto page http://slproweb.com/products/Win32OpenSSL.html
- Download Visual C++ 2008(or 2010) redistributables for your Windows and architecture
- Download OpenSSL for your Windows and architecture (the regular version, not the light one)
- Prepare environment for building pyopenssl
SET LIB
SET INCLUDE
SET VS90COMNTOOLS=%VS100COMNTOOLS%
SET INCLUDE=d:\tools\OpenSSL-Win32\include
SET LIB=d:\tools\OpenSSL-Win32\lib
And on windows64bit
SET LIB
SET INCLUDE
SET VS90COMNTOOLS=%VS110COMNTOOLS%
SET INCLUDE=d:\tools\OpenSSL-Win64\include
SET LIB=d:\tools\OpenSSL-Win64\lib
We set the environment variable VS90COMNTOOLS=%VS100COMNTOOLS% because Visual C++ 2010 are installed on windows with environment VS100COMNTOOLS, not Visual C++ 2008. And python using environment variable VS90COMNTOOLS for finding the compiler on windows, so we need set the variable VS90COMNTOOLS=%VS100COMNTOOLS%
- Install pyopenssl: Go to directory d:\tools\Python27\Scripts\ and run
easy_install.exe pyopenssl
Issues for building pyopenssl:
* Issue:
cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -ID:\tools\Python27\include -ID:\tools\Python27\PC /Tccryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.c /Fobuild\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj
link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:D:\tools\Python27\libs /LIBPATH:D:\tools\Python27\PCbuild libeay32.lib ssleay32.lib advapi32.lib /EXPORT:init_Cryptography_cffi_444d7397xa22f8491 build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj /OUT:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd /IMPLIB:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.lib /MANIFESTFILE:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest
mt.exe -nologo -manifest build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest -outputresource:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd;2
error: command 'mt.exe' failed with exit status 31
* Root cause of issue: link.exe /MANIFESTFILE:build\temp.win32-7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest
=> Can't create file .manifest
* Anf follow specification in link http://msdn.microsoft.com/en-us/library/y0zzbyt4.aspx, we need add option **/MANIFEST** when using link.exe to create manifest file
Fix:For fixing this issue, we need update msvc9compiler.py to add build option /MANIFEST in front of code line **ld_args.append('/MANIFESTFILE:' + temp_manifest)**:
ld_args.append('/MANIFEST ')
ld_args.append('/MANIFESTFILE:' + temp_manifest)
=== Install pywin32, twisted ===
* Install pywin32 follow link: http://sourceforge.net/projects/pywin32/files/ or Install with pip
easy_install.exe pypiwin32
* Go to directory d:\tools\Python27\Scripts\ and run below commands to install twisted, lxml:
easy_install.exe twisted
=== Install lxml ===
Install lxml version 2.x.x
easy_install.exe lxml==2.3
Install lxml 3.x.x:
- Step1: Download lxml MS Windows Installer packages
- Step2: Install with easy_install:
easy_install lxml-3.4.0.win32-py2.7.exe
Build lxml from source
hg clone git://github.com/lxml/lxml.git lxml
cd lxml
SET VS90COMNTOOLS=%VS100COMNTOOLS%
pip install -r requirements.txt
python setup.py build
python setup.py install
=== Install scrapy ===
Go to directory d:\tools\Python27\Scripts\ and run:
pip.exe install scrapy
==== Install and run scrapy on virtual environment with miniconda ====
- Step1: Create virtual environment with name scrapy and install package scrapy in this environment
conda create -n scrapy scrapy python=2
- Step2: Go to virtual environment scrapy:
activate scrapy
===== Fisrt Project =====
This tutorial will walk you through these tasks:
- Creating a new Scrapy project
- Writing a simple spider to crawl a site and run crawling
- Modify spider to extract Items with **Selector**
- Defining Object store extracted Items and return Items for **Item Pipeline**
- Writing an **Item Pipeline** to store the extracted Items
==== Creating a project ====
Before you start scraping, you will have set up a new Scrapy project. Enter a directory where you’d like to store your code and then run:
scrapy startproject myscrapy
This will create a myscrapy directory with the following contents:
myscrapy/
scrapy.cfg
myscrapy/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
These are basically:
* scrapy.cfg: the project configuration file
* myscrapy/: the project’s python module, you’ll later import your code from here.
* myscrapy/items.py: the project’s items file.
* myscrapy/pipelines.py: the project’s pipelines file.
* myscrapy/settings.py: the project’s settings file.
* myscrapy/spiders/: a directory where you’ll later put your spiders.
==== Create simple Spider ====
This is the code for our first Spider; save it in a file named dmoz_spider.py under the **myscrapy/spiders** directory:
from scrapy.spider import Spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
=> This code define a spider with basic informations below:
* Identifies the Spider:
name = "dmoz"
* List allowed domain for crawling:
allowed_domains = ["dmoz.org"]
* List of URLs where the Spider will begin to crawl from:
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
==== Crawling ====
To run spider with name **'dmoz'**, go to the project’s top level directory and run:
scrapy crawl dmoz
Output:
* crawling log:
2014-06-11 15:52:30+0700 [scrapy] INFO: Scrapy 0.22.2 started (bot: myscrapy)
2014-06-11 15:52:30+0700 [scrapy] INFO: Optional features available: ssl, http11, django
2014-06-11 15:52:30+0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'myscrapy.spiders', 'SPIDER_MODULES': ['myscrapy.spiders'], 'BOT_NAME': 'myscrapy'}
2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeader
ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled item pipelines:
2014-06-11 15:52:30+0700 [dmoz] INFO: Spider opened
2014-06-11 15:52:30+0700 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-11 15:52:30+0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-11 15:52:30+0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) (referer: None)
2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) (referer: None)
2014-06-11 15:52:31+0700 [dmoz] INFO: Closing spider (finished)
2014-06-11 15:52:31+0700 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 516,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16515,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 11, 8, 52, 31, 299000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 6, 11, 8, 52, 30, 424000)}
2014-06-11 15:52:31+0700 [dmoz] INFO: Spider closed (finished)
* Output files created from **parse function** in dmoz_spider:
Books
Resources
==== Modify spider to extract Items with Selector ====
Edit **spiders\dmoz_spyder.py**:
from scrapy.spider import Spider
from scrapy.selector import Selector
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
print "------response.url=",response.url
sel = Selector(response)
sites = sel.xpath('//ul/li')
for site in sites:
title = site.xpath('a/text()').extract()
link = site.xpath('a/@href').extract()
desc = site.xpath('text()').extract()
print title, link, desc
==== Defining Object store extracted Items and return Items for Item Pipeline ====
* Define Object store extracted Items in **items.py**
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
* Modify **spyders\dmoz_spyder.py** return Items for **Item Pipeline**
from scrapy.spider import Spider
from scrapy.selector import Selector
from myscrapy.items import DmozItem
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
print "------response.url=",response.url
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
sel = Selector(response)
sites = sel.xpath('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items
==== Storing the scraped data ====
scrapy crawl dmoz -o items.json -t json
==== Item Pipiline ====
After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.
Typical use for item pipelines are:
* cleansing HTML data
* validating scraped data (checking that the items contain certain fields)
* checking for duplicates (and dropping them)
* storing the scraped item in a database
Follow below steps for ItemPipeline:
* Add setting for ITEM_PIPELINES class in settings.py:
ITEM_PIPELINES = {'myscrapy.pipelines.MyscrapyPipeline'}
* Define class MyscrapyPipeline in pipelines.py:
class MyscrapyPipeline(object):
def process_item(self, item, spider):
return item
==== Run check spiders And run it ====
* Run list all spiders:
scrapy list
* Run spider:
scrapy crawl spidername
===== Custom APIs ====
==== Selectors ====
http://doc.scrapy.org/en/latest/topics/selectors.html\\
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:
* **BeautifulSoup** is a very popular screen scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: **it’s slow**.
* **lxml** is a XML parsing library (which also parses HTML) with **a pythonic API based on ElementTree** (which is not part of the Python standard library).
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified **either by XPath or CSS expressions**.
How to create Selector:
* Create Selector from text
from scrapy.http import HtmlResponse
from scrapy.selector import Selector
body = 'good'
Selector(text=body).xpath('//span/text()').extract()
=>output: [u'good']
* Create Selector from Response
from scrapy.http import HtmlResponse
from scrapy.selector import Selector
response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').extract()
=>output: [u'good']
==== Rule and linkextractors in CrawlSpider ====
CrawlSpider sử dụng 2 thuật toán cơ bản:
- **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó
- **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download
==== Scrapy linkextractors package ====
refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html
Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, which **receives a Response object** and **returns a list of scrapy.link.Link objects**. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.
=== linkextractors classes ===
Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links:
* **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below
# Top-level imports
from .lxmlhtml import LxmlLinkExtractor as LinkExtractor
=> So we can use LxmlLinkExtractor by import LinkExtractor with code below:
from scrapy.linkextractors import LinkExtractor
* scrapy.linkextractors.htmlparser.HtmlParserLinkExtractor => removed in future
* scrapy.linkextractors.sgml.SgmlLinkExtractor => removed in future
Contructor LxmlLinkExtractor:
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)
* **allow (a regular expression (or list of))** – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. **If not given (or empty), it will match all links**.
* **deny (a regular expression (or list of))** – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. **If not given (or empty) it won’t exclude any links**.
* **allow_domains (str or list)** – a single value or a list of string containing domains which **will be considered for extracting the links**
* **deny_domains (str or list)** – a single value or a list of strings containing domains which **won’t be considered for extracting the links**
* **deny_extensions**
* **restrict_xpaths (str or list)** – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, **only the text selected by those XPath** will be scanned for links. See examples below.
* **restrict_css (str or list)** – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has **the same behaviour as restrict_xpaths**
* **tags (str or list)** – a tag or a list of tags to consider when extracting links. **Defaults to ('a', 'area')**.
* **attrs (list)** – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). **Defaults to ('href',)**
* **unique (boolean)** – whether **duplicate filtering** should be applied to extracted links.
How to extract links in LxmlLinkExtractor:
class LxmlLinkExtractor(FilteringLinkExtractor):
def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href',), canonicalize=True,
unique=True, process_value=None, deny_extensions=None, restrict_css=()):
............................
lx = LxmlParserLinkExtractor(tag=tag_func, attr=attr_func,
unique=unique, process=process_value)
super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
allow_domains=allow_domains, deny_domains=deny_domains,
restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
canonicalize=canonicalize, deny_extensions=deny_extensions)
............................
def extract_links(self, response):
html = Selector(response)
base_url = get_base_url(response)
if self.restrict_xpaths:
docs = [subdoc
for x in self.restrict_xpaths
for subdoc in html.xpath(x)]
else:
docs = [html]
all_links = []
for doc in docs:
links = self._extract_links(doc, response.url, response.encoding, base_url)
all_links.extend(self._process_links(links))
return unique_list(all_links)
class LxmlParserLinkExtractor(object):
...................
def _extract_links(self, selector, response_url, response_encoding, base_url):
links = []
# hacky way to get the underlying lxml parsed document
for el, attr, attr_val in self._iter_links(selector._root):
# pseudo lxml.html.HtmlElement.make_links_absolute(base_url)
attr_val = urljoin(base_url, attr_val)
url = self.process_attr(attr_val)
if url is None:
continue
if isinstance(url, unicode):
url = url.encode(response_encoding)
# to fix relative links after process_value
url = urljoin(response_url, url)
link = Link(url, _collect_string_content(el) or u'',
nofollow=True if el.get('rel') == 'nofollow' else False)
links.append(link)
return unique_list(links, key=lambda link: link.url) \
if self.unique else links
def extract_links(self, response):
html = Selector(response)
base_url = get_base_url(response)
return self._extract_links(html, response.url, response.encoding, base_url)
...................
=== extract links with linkextractors ===
Extract files in html file which links in **tags=('script','img')** and **attrs=('src')**:
filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = [])
links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen]
file_item = FileItem()
file_urls = []
if len(links) > 0:
for link in links:
self.seen.add(link)
fullurl = getFullUrl(link.url, self.defaultBaseUrl)
file_urls.append(fullurl)
file_item['file_urls'] = file_urls
==== Scrapy Selector Package ====
==== Rule in CrawlSpider ====
Understand about using Rule in CrawlSpider:
* Contructor Rule:
Class Rule(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity)
* Using Rule in CrawlSpider:
class CrawlSpider(Spider):
rules = ()
def __init__(self, *a, **kw):
...................
self._compile_rules()
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, basestring):
return getattr(self, method, None)
self._rules = [copy.copy(r) for r in self.rules]
for rule in self._rules:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
Thuật toán đệ quy trong CrawlSpider:
* Hàm đệ quy _parse_response
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
- Step1: Phân tích và lấy tất cả các **link trong callback**
- Step2: Phân tích và lấy tất cả **link theo rules** được định nghĩa bởi CrawlSpider **nếu follow=True**
- Step3: Sau đó Lấy tất cả link ở trên thực hiện việc download nội dung tất cả các link trên
- Step4: Gọi hàm **_parse_response** để phân tích các nội dung được download trong Step3 và **quay lại Step1**
- Khi nào hàm đệ quy kết thúc? Kết thúc khi không còn link nào để trả về trong **_parse_response**
* Tổng quan về thuật toán xử lý trong CrawlSpider:
class CrawlSpider(Spider):
...................
def parse(self, response):
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
- Step1: Download tất cả link được định nghĩa trong **start_urls**
- Step2: Response được download trong Step1 sẽ gọi hàm **parse->_parse_response** với tùy chọn mặc định follow=True để lấy **tất cả các link** trong content của response **được định nghĩa trong rules**
- Step3: Thực hiện việc download tất cả các link lấy được trong **Step2**
- Step4: Với nội dung được download trong Step3, kiểm tra nếu **follow=True** thì thực hiện việc lấy tất cả các link theo rules và download tất cả những link này
Example define Rule in download.py
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
................
class DownloadSpider(CrawlSpider):
name = "download"
allowed_domains = ["template-help.com"]
start_urls = [
"http://livedemo00.template-help.com/wordpress_44396/"
]
rules = [
Rule(sle(allow=("/*")), callback='myparse', follow = False, process_request=process_request),
Rule(sle(allow=("/*"), tags=('link'), deny_extensions = []), callback='parse_css')
]
................
===== Code and Debug Scrapy with Eclipse =====
refer: [[python:pydeveclipse#import_scrapy_project_and_debug|Python Editor with pydev plugin on Eclipse]]
===== Scrapy Unittest with trial =====
==== Unittest with trial ====
Refer python unittest: [[script:python#python_unit_test|Python Unittest]]
create simpletest.py with below content:
import unittest
class DemoTest(unittest.TestCase):
def test_passes(self):
pass
def test_fails(self):
self.fail("I failed.")
Run with trial:
trial simpletest.py
Output:
simpletest
DemoTest
test_fails ... [FAIL]
test_passes ... [OK]
===============================================================================
[FAIL]
Traceback (most recent call last):
File "D:\tools\Python27\lib\unittest\case.py", line 327, in run
testMethod()
File "D:\simpletest.py", line 6, in test_fails
self.fail("I failed.")
File "D:\tools\Python27\lib\unittest\case.py", line 408, in fail
raise self.failureException(msg)
exceptions.AssertionError: I failed.
simpletest.DemoTest.test_fails
-------------------------------------------------------------------------------
Ran 2 tests in 1.540s
FAILED (failures=1, successes=1)
==== Scrapy Unittest ====
=== Install missing Python package ===
* Install some missing packages with pip:
pip install mock
pip install boto
pip install django
yum install vsftpd
* Install missing packages for running **test_pipeline_images.py**:
pip uninstall PIL
pip install pillow
And on Windows
SET VS90COMNTOOLS=%VS100COMNTOOLS%
easy_install pillow
* Install missing module bz2:
* Install bzip2-devel
yum install bzip2-devel
* Then rebuild source python and check bz2 module:
./configure
make
./python -c "import bz2; print bz2.__doc__"
* Install python If check bz2 module OK:
make install
* Fix error "Error loading either pysqlite2 or sqlite3 modules"
* Install sqlite-devel
yum install sqlite-devel
* Then rebuild source python and check sqlite3 module:
./configure
make
./python -c "import sqlite3;print sqlite3.__doc__"
* install python if sqlite3 check OK:
make install
=== Run scrapy Unittest ===
Run below command to run all unittest for scrapy which has installed:
trial scrapy
Or copy bin/runtests.sh in scrapy source to /usr/local/lib/python2.7/site-packages/scrapy/ and run:
./bin/runtests.sh
Go to scrapy source and run below script to run all unittest for scrapy source:
./bin/runscripts.sh
Run python script unittest in \Lib\site-packages\Scrapy-0.22.2-py2.7.egg\scrapy\tests\, for example: test_downloadermiddleware.py:
trial test_downloadermiddleware.py