Table of Contents
Scrapy
refer:
code examples:
- https://code.google.com/p/scrapy-tutorial/
git clone https://code.google.com/p/scrapy-tutorial/
- https://code.google.com/p/scrapy-spider/
svn checkout http://scrapy-spider.googlecode.com/svn/trunk/ scrapy-spider-read-only
- https://code.google.com/p/kateglo-crawler/
svn checkout http://kateglo-crawler.googlecode.com/svn/trunk/ kateglo-crawler-read-only
- https://code.google.com/p/ina-news-crawler/
svn checkout http://ina-news-crawler.googlecode.com/svn/trunk/ ina-news-crawler-read-only
- https://code.google.com/p/simple-dm-forums/
hg clone https://code.google.com/p/simple-dm-forums/
- https://code.google.com/p/tanqing-web-based-scrapy/
svn checkout http://tanqing-web-based-scrapy.googlecode.com/svn/trunk/ tanqing-web-based-scrapy-read-only
- https://code.google.com/p/scrapy/
svn checkout http://scrapy.googlecode.com/svn/trunk/ scrapy-read-only
- https://github.com/geekan/scrapy-examples
git clone https://github.com/geekan/scrapy-examples.git
- https://github.com/php5engineer/scrapy-drupal-org
git clone https://github.com/php5engineer/scrapy-drupal-org.git
scrapy overview
Basic Concepts
- Command line tool: Learn about the command-line tool used to manage your Scrapy project.
- Items: Define the data you want to scrape.
- Spiders: Write the rules to crawl your websites.
- Selectors: Extract the data from web pages using XPath.
- Scrapy shell: Test your extraction code in an interactive environment.
- Item Loaders: Populate your items with the extracted data.
- Item Pipeline: Post-process and store your scraped data.
- Feed exports: Output your scraped data using different formats and storages.
- Link Extractors: Convenient classes to extract links to follow from pages.
Data Flow
The data flow in Scrapy is controlled by the execution engine, and goes like this:
- The Engine opens a domain, locates the Spider that handles that domain, and asks the spider for the first URLs to crawl.
- The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as Requests.
- The Engine asks the Scheduler for the next URLs to crawl.
- The Scheduler returns the next URLs to crawl to the Engine and the Engine sends them to the Downloader, passing through the Downloader Middleware (request direction).
- Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middleware (response direction).
- The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (input direction).
- The Spider processes the Response and returns scraped Items and new Requests (to follow) to the Engine.
- The Engine sends scraped Items (returned by the Spider) to the Item Pipeline and Requests (returned by spider) to the Scheduler
- The process repeats (from step 2) until there are no more requests from the Scheduler, and the Engine closes the domain.
Installation
Installation on linux
Pre-requisites:
- Python 2.7
Install python 2.7 from source
Because yum packages in CenOS6 is python2.6, we need install python 2.7 from source
yum install bzip2-devel yum install sqlite-devel yum install wget https://www.python.org/ftp/python/2.7.7/Python-2.7.7.tgz tar xf Python-2.7.7.tgz cd Python-2.7.7 ./configure make make install
install pip
- install
wget https://bootstrap.pypa.io/get-pip.py python2.7 get-pip.py
- check pip
pip -V pip 1.5.6 from /usr/local/lib/python2.7/site-packages (python 2.7)
install lxml
- Install Pre-requisites packages
yum install libxml2-devel yum install libxslt-devel
- install lxml with pip
pip install lxml==3.1.2
- check installed:
/usr/lib64/python2.6/site-packages/lxml Or /usr/local/lib/python2.7/site-packages/lxml
install pyopenssl
- Install Pre-requisites packages
yum install openssl-devel yum install libffi-devel
- Install pyopenssl
- Install from source
wget https://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.14.tar.gz tar xf pyOpenSSL-0.14.tar.gz cd pyOpenSSL-0.14 python2.7 setup.py --help python2.7 setup.py build python2.7 setup.py install
- Or Install with pip
pip install pyopenssl
install Twisted-14.0.0 from source
wget http://twistedmatrix.com/Releases/Twisted/14.0/Twisted-14.0.0.tar.bz2 tar xf Twisted-14.0.0.tar.bz2 cd Twisted-14.0.0 python2.7 setup.py --help python2.7 setup.py build python2.7 setup.py install
Install scrapy
- Install
pip install scrapy
- update
pip install scrapy --upgrade
- build and install from source:
wget https://codeload.github.com/scrapy/scrapy/legacy.tar.gz/master tar xf master cd scrapy-scrapy-9d57ecf python setup.py install
upgrade master for fixing error below:
trial test_crawl.py scrapy.tests.test_crawl CrawlTestCase test_delay ... Traceback (most recent call last): File "/usr/local/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/local/lib/python2.7/site-packages/scrapy/tests/mockserver.py", line 198, in <module> os.path.join(os.path.dirname(__file__), 'keys/cert.pem'), File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 104, in __init__ self.cacheContext() File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 113, in cacheContext ctx.use_certificate_file(self.certificateFileName) File "/usr/local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 391, in use_certificate_file _raise_current_error() File "/usr/local/lib/python2.7/site-packages/OpenSSL/_util.py", line 22, in exception_from_error_queue raise exceptionType(errors) OpenSSL.SSL.Error: [('system library', 'fopen', 'No such file or directory'), ('BIO routines', 'FILE_CTRL', 'system lib'), ('SSL routines', 'SSL_CTX_use_certificate_file', 'system lib')]
Install on windows
Prepare environment variables
- set environment variable for python
set PATH=%PATH%;"D:\tools\Python27\";"D:\tools\Python27\Scripts"
- Check current environment variables of Visual studio:
set
output:
VS100COMNTOOLS=C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\Tools\ VS110COMNTOOLS=D:\tools\Microsoft Visual Studio 11.0\Common7\Tools\
- Set environment for VS90COMNTOOLS(compiler on windows)
SET VS90COMNTOOLS=%VS100COMNTOOLS%
⇒ Fix error: Unable to find vcvarsall.bat
- upgrade setuptools:
pip install -U setuptools
Install pyopenssl
Step by steop install openssl:
- Download Visual C++ 2008(or 2010) redistributables for your Windows and architecture
- Download OpenSSL for your Windows and architecture (the regular version, not the light one)
- Prepare environment for building pyopenssl
SET LIB SET INCLUDE SET VS90COMNTOOLS=%VS100COMNTOOLS% SET INCLUDE=d:\tools\OpenSSL-Win32\include SET LIB=d:\tools\OpenSSL-Win32\lib
And on windows64bit
SET LIB SET INCLUDE SET VS90COMNTOOLS=%VS110COMNTOOLS% SET INCLUDE=d:\tools\OpenSSL-Win64\include SET LIB=d:\tools\OpenSSL-Win64\lib
We set the environment variable VS90COMNTOOLS=%VS100COMNTOOLS% because Visual C++ 2010 are installed on windows with environment VS100COMNTOOLS, not Visual C++ 2008. And python using environment variable VS90COMNTOOLS for finding the compiler on windows, so we need set the variable VS90COMNTOOLS=%VS100COMNTOOLS%
- Install pyopenssl: Go to directory d:\tools\Python27\Scripts\ and run
easy_install.exe pyopenssl
Issues for building pyopenssl:
- Issue:
cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -ID:\tools\Python27\include -ID:\tools\Python27\PC /Tccryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.c /Fobuild\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:D:\tools\Python27\libs /LIBPATH:D:\tools\Python27\PCbuild libeay32.lib ssleay32.lib advapi32.lib /EXPORT:init_Cryptography_cffi_444d7397xa22f8491 build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj /OUT:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd /IMPLIB:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.lib /MANIFESTFILE:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest mt.exe -nologo -manifest build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest -outputresource:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd;2 error: command 'mt.exe' failed with exit status 31
- Root cause of issue:
link.exe /MANIFESTFILE:build\temp.win32-7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest
⇒ Can't create file .manifest
- Anf follow specification in link http://msdn.microsoft.com/en-us/library/y0zzbyt4.aspx, we need add option /MANIFEST when using link.exe to create manifest file
Fix:For fixing this issue, we need update msvc9compiler.py to add build option /MANIFEST in front of code line ld_args.append('/MANIFESTFILE:' + temp_manifest):
ld_args.append('/MANIFEST ') ld_args.append('/MANIFESTFILE:' + temp_manifest)
Install pywin32, twisted
- Install pywin32 follow link: http://sourceforge.net/projects/pywin32/files/ or Install with pip
easy_install.exe pypiwin32
- Go to directory d:\tools\Python27\Scripts\ and run below commands to install twisted, lxml:
easy_install.exe twisted
Install lxml
Install lxml version 2.x.x
easy_install.exe lxml==2.3
Install lxml 3.x.x:
- Step1: Download lxml MS Windows Installer packages
- Step2: Install with easy_install:
easy_install lxml-3.4.0.win32-py2.7.exe
Build lxml from source
hg clone git://github.com/lxml/lxml.git lxml cd lxml SET VS90COMNTOOLS=%VS100COMNTOOLS% pip install -r requirements.txt python setup.py build python setup.py install
Install scrapy
Go to directory d:\tools\Python27\Scripts\ and run:
pip.exe install scrapy
Install and run scrapy on virtual environment with miniconda
- Step1: Create virtual environment with name scrapy and install package scrapy in this environment
conda create -n scrapy scrapy python=2
- Step2: Go to virtual environment scrapy:
activate scrapy
Fisrt Project
This tutorial will walk you through these tasks:
- Creating a new Scrapy project
- Writing a simple spider to crawl a site and run crawling
- Modify spider to extract Items with Selector
- Defining Object store extracted Items and return Items for Item Pipeline
- Writing an Item Pipeline to store the extracted Items
Creating a project
Before you start scraping, you will have set up a new Scrapy project. Enter a directory where you’d like to store your code and then run:
scrapy startproject myscrapy
This will create a myscrapy directory with the following contents:
myscrapy/ scrapy.cfg myscrapy/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
These are basically:
- scrapy.cfg: the project configuration file
- myscrapy/: the project’s python module, you’ll later import your code from here.
- myscrapy/items.py: the project’s items file.
- myscrapy/pipelines.py: the project’s pipelines file.
- myscrapy/settings.py: the project’s settings file.
- myscrapy/spiders/: a directory where you’ll later put your spiders.
Create simple Spider
This is the code for our first Spider; save it in a file named dmoz_spider.py under the myscrapy/spiders directory:
from scrapy.spider import Spider class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body)
⇒ This code define a spider with basic informations below:
- Identifies the Spider:
name = "dmoz"
- List allowed domain for crawling:
allowed_domains = ["dmoz.org"]
- List of URLs where the Spider will begin to crawl from:
start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ]
Crawling
To run spider with name 'dmoz', go to the project’s top level directory and run:
scrapy crawl dmoz
Output:
- crawling log:
2014-06-11 15:52:30+0700 [scrapy] INFO: Scrapy 0.22.2 started (bot: myscrapy) 2014-06-11 15:52:30+0700 [scrapy] INFO: Optional features available: ssl, http11, django 2014-06-11 15:52:30+0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'myscrapy.spiders', 'SPIDER_MODULES': ['myscrapy.spiders'], 'BOT_NAME': 'myscrapy'} 2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeader ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled item pipelines: 2014-06-11 15:52:30+0700 [dmoz] INFO: Spider opened 2014-06-11 15:52:30+0700 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2014-06-11 15:52:30+0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2014-06-11 15:52:30+0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 2014-06-11 15:52:31+0700 [dmoz] INFO: Closing spider (finished) 2014-06-11 15:52:31+0700 [dmoz] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 516, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 16515, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2014, 6, 11, 8, 52, 31, 299000), 'log_count/DEBUG': 4, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2014, 6, 11, 8, 52, 30, 424000)} 2014-06-11 15:52:31+0700 [dmoz] INFO: Spider closed (finished)
- Output files created from parse function in dmoz_spider:
Books Resources
Modify spider to extract Items with Selector
Edit spiders\dmoz_spyder.py:
from scrapy.spider import Spider from scrapy.selector import Selector class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): print "------response.url=",response.url sel = Selector(response) sites = sel.xpath('//ul/li') for site in sites: title = site.xpath('a/text()').extract() link = site.xpath('a/@href').extract() desc = site.xpath('text()').extract() print title, link, desc
Defining Object store extracted Items and return Items for Item Pipeline
- Define Object store extracted Items in items.py
from scrapy.item import Item, Field class DmozItem(Item): title = Field() link = Field() desc = Field()
- Modify spyders\dmoz_spyder.py return Items for Item Pipeline
from scrapy.spider import Spider from scrapy.selector import Selector from myscrapy.items import DmozItem class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): print "------response.url=",response.url filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) sel = Selector(response) sites = sel.xpath('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.xpath('a/text()').extract() item['link'] = site.xpath('a/@href').extract() item['desc'] = site.xpath('text()').extract() items.append(item) return items
Storing the scraped data
scrapy crawl dmoz -o items.json -t json
Item Pipiline
After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.
Typical use for item pipelines are:
- cleansing HTML data
- validating scraped data (checking that the items contain certain fields)
- checking for duplicates (and dropping them)
- storing the scraped item in a database
Follow below steps for ItemPipeline:
- Add setting for ITEM_PIPELINES class in settings.py:
ITEM_PIPELINES = {'myscrapy.pipelines.MyscrapyPipeline'}
- Define class MyscrapyPipeline in pipelines.py:
class MyscrapyPipeline(object): def process_item(self, item, spider): return item
Run check spiders And run it
- Run list all spiders:
scrapy list
- Run spider:
scrapy crawl spidername
Custom APIs
Selectors
http://doc.scrapy.org/en/latest/topics/selectors.html
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:
- BeautifulSoup is a very popular screen scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.
- lxml is a XML parsing library (which also parses HTML) with a pythonic API based on ElementTree (which is not part of the Python standard library).
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.
How to create Selector:
- Create Selector from text
from scrapy.http import HtmlResponse from scrapy.selector import Selector body = '<html><body><span>good</span></body></html>' Selector(text=body).xpath('//span/text()').extract()
⇒output:
[u'good']
- Create Selector from Response
from scrapy.http import HtmlResponse from scrapy.selector import Selector response = HtmlResponse(url='http://example.com', body=body) Selector(response=response).xpath('//span/text()').extract()
⇒output:
[u'good']
Rule and linkextractors in CrawlSpider
CrawlSpider sử dụng 2 thuật toán cơ bản:
- Thuật toán đệ quy để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó
- Thuật toán extract links dựa theo rule để lọc ra những url mà nó muốn download
Scrapy linkextractors package
refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html
Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.
linkextractors classes
Link extractors classes bundled with Scrapy are provided in the scrapy.linkextractors module. Some basic classes in scrapy.linkextractors used to extract links:
- scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor. Because alias below
# Top-level imports from .lxmlhtml import LxmlLinkExtractor as LinkExtractor
⇒ So we can use LxmlLinkExtractor by import LinkExtractor with code below:
from scrapy.linkextractors import LinkExtractor
- scrapy.linkextractors.htmlparser.HtmlParserLinkExtractor ⇒ removed in future
- scrapy.linkextractors.sgml.SgmlLinkExtractor ⇒ removed in future
Contructor LxmlLinkExtractor:
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)
- allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
- deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. If not given (or empty) it won’t exclude any links.
- allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links
- deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
- deny_extensions
- restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
- restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as restrict_xpaths
- tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
- attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
- unique (boolean) – whether duplicate filtering should be applied to extracted links.
How to extract links in LxmlLinkExtractor:
class LxmlLinkExtractor(FilteringLinkExtractor): def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(), tags=('a', 'area'), attrs=('href',), canonicalize=True, unique=True, process_value=None, deny_extensions=None, restrict_css=()): ............................ lx = LxmlParserLinkExtractor(tag=tag_func, attr=attr_func, unique=unique, process=process_value) super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny, allow_domains=allow_domains, deny_domains=deny_domains, restrict_xpaths=restrict_xpaths, restrict_css=restrict_css, canonicalize=canonicalize, deny_extensions=deny_extensions) ............................ def extract_links(self, response): html = Selector(response) base_url = get_base_url(response) if self.restrict_xpaths: docs = [subdoc for x in self.restrict_xpaths for subdoc in html.xpath(x)] else: docs = [html] all_links = [] for doc in docs: links = self._extract_links(doc, response.url, response.encoding, base_url) all_links.extend(self._process_links(links)) return unique_list(all_links) class LxmlParserLinkExtractor(object): ................... def _extract_links(self, selector, response_url, response_encoding, base_url): links = [] # hacky way to get the underlying lxml parsed document for el, attr, attr_val in self._iter_links(selector._root): # pseudo lxml.html.HtmlElement.make_links_absolute(base_url) attr_val = urljoin(base_url, attr_val) url = self.process_attr(attr_val) if url is None: continue if isinstance(url, unicode): url = url.encode(response_encoding) # to fix relative links after process_value url = urljoin(response_url, url) link = Link(url, _collect_string_content(el) or u'', nofollow=True if el.get('rel') == 'nofollow' else False) links.append(link) return unique_list(links, key=lambda link: link.url) \ if self.unique else links def extract_links(self, response): html = Selector(response) base_url = get_base_url(response) return self._extract_links(html, response.url, response.encoding, base_url) ...................
extract links with linkextractors
Extract files in html file which links in tags=('script','img') and attrs=('src'):
filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = []) links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen] file_item = FileItem() file_urls = [] if len(links) > 0: for link in links: self.seen.add(link) fullurl = getFullUrl(link.url, self.defaultBaseUrl) file_urls.append(fullurl) file_item['file_urls'] = file_urls
Scrapy Selector Package
Rule in CrawlSpider
Understand about using Rule in CrawlSpider:
- Contructor Rule:
Class Rule(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity)
- Using Rule in CrawlSpider:
class CrawlSpider(Spider): rules = () def __init__(self, *a, **kw): ................... self._compile_rules() def _compile_rules(self): def get_method(method): if callable(method): return method elif isinstance(method, basestring): return getattr(self, method, None) self._rules = [copy.copy(r) for r in self.rules] for rule in self._rules: rule.callback = get_method(rule.callback) rule.process_links = get_method(rule.process_links) rule.process_request = get_method(rule.process_request) def _requests_to_follow(self, response): if not isinstance(response, HtmlResponse): return seen = set() for n, rule in enumerate(self._rules): links = [l for l in rule.link_extractor.extract_links(response) if l not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = Request(url=link.url, callback=self._response_downloaded) r.meta.update(rule=n, link_text=link.text) yield rule.process_request(r)
Thuật toán đệ quy trong CrawlSpider:
- Hàm đệ quy _parse_response
def _parse_response(self, response, callback, cb_kwargs, follow=True): if callback: cb_res = callback(response, **cb_kwargs) or () cb_res = self.process_results(response, cb_res) for requests_or_item in iterate_spider_output(cb_res): yield requests_or_item if follow and self._follow_links: for request_or_item in self._requests_to_follow(response): yield request_or_item
- Step1: Phân tích và lấy tất cả các link trong callback
- Step2: Phân tích và lấy tất cả link theo rules được định nghĩa bởi CrawlSpider nếu follow=True
- Step3: Sau đó Lấy tất cả link ở trên thực hiện việc download nội dung tất cả các link trên
- Step4: Gọi hàm _parse_response để phân tích các nội dung được download trong Step3 và quay lại Step1
- Khi nào hàm đệ quy kết thúc? Kết thúc khi không còn link nào để trả về trong _parse_response
- Tổng quan về thuật toán xử lý trong CrawlSpider:
class CrawlSpider(Spider): ................... def parse(self, response): return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
- Step1: Download tất cả link được định nghĩa trong start_urls
- Step2: Response được download trong Step1 sẽ gọi hàm parse→_parse_response với tùy chọn mặc định follow=True để lấy tất cả các link trong content của response được định nghĩa trong rules
- Step3: Thực hiện việc download tất cả các link lấy được trong Step2
- Step4: Với nội dung được download trong Step3, kiểm tra nếu follow=True thì thực hiện việc lấy tất cả các link theo rules và download tất cả những link này
Example define Rule in download.py
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle ................ class DownloadSpider(CrawlSpider): name = "download" allowed_domains = ["template-help.com"] start_urls = [ "http://livedemo00.template-help.com/wordpress_44396/" ] rules = [ Rule(sle(allow=("/*")), callback='myparse', follow = False, process_request=process_request), Rule(sle(allow=("/*"), tags=('link'), deny_extensions = []), callback='parse_css') ] ................
Code and Debug Scrapy with Eclipse
Scrapy Unittest with trial
Unittest with trial
Refer python unittest: Python Unittest
create simpletest.py with below content:
import unittest class DemoTest(unittest.TestCase): def test_passes(self): pass def test_fails(self): self.fail("I failed.")
Run with trial:
trial simpletest.py
Output:
simpletest DemoTest test_fails ... [FAIL] test_passes ... [OK] =============================================================================== [FAIL] Traceback (most recent call last): File "D:\tools\Python27\lib\unittest\case.py", line 327, in run testMethod() File "D:\simpletest.py", line 6, in test_fails self.fail("I failed.") File "D:\tools\Python27\lib\unittest\case.py", line 408, in fail raise self.failureException(msg) exceptions.AssertionError: I failed. simpletest.DemoTest.test_fails ------------------------------------------------------------------------------- Ran 2 tests in 1.540s FAILED (failures=1, successes=1)
Scrapy Unittest
Install missing Python package
- Install some missing packages with pip:
pip install mock pip install boto pip install django yum install vsftpd
- Install missing packages for running test_pipeline_images.py:
pip uninstall PIL pip install pillow
And on Windows
SET VS90COMNTOOLS=%VS100COMNTOOLS% easy_install pillow
- Install missing module bz2:
- Install bzip2-devel
yum install bzip2-devel
- Then rebuild source python and check bz2 module:
./configure make ./python -c "import bz2; print bz2.__doc__"
- Install python If check bz2 module OK:
make install
- Fix error “Error loading either pysqlite2 or sqlite3 modules”
- Install sqlite-devel
yum install sqlite-devel
- Then rebuild source python and check sqlite3 module:
./configure make ./python -c "import sqlite3;print sqlite3.__doc__"
- install python if sqlite3 check OK:
make install
Run scrapy Unittest
Run below command to run all unittest for scrapy which has installed:
trial scrapy
Or copy bin/runtests.sh in scrapy source to /usr/local/lib/python2.7/site-packages/scrapy/ and run:
./bin/runtests.sh
Go to scrapy source and run below script to run all unittest for scrapy source:
./bin/runscripts.sh
Run python script unittest in \Lib\site-packages\Scrapy-0.22.2-py2.7.egg\scrapy\tests\, for example: test_downloadermiddleware.py:
trial test_downloadermiddleware.py