refer:
code examples:
git clone https://code.google.com/p/scrapy-tutorial/
svn checkout http://scrapy-spider.googlecode.com/svn/trunk/ scrapy-spider-read-only
svn checkout http://kateglo-crawler.googlecode.com/svn/trunk/ kateglo-crawler-read-only
svn checkout http://ina-news-crawler.googlecode.com/svn/trunk/ ina-news-crawler-read-only
hg clone https://code.google.com/p/simple-dm-forums/
svn checkout http://tanqing-web-based-scrapy.googlecode.com/svn/trunk/ tanqing-web-based-scrapy-read-only
svn checkout http://scrapy.googlecode.com/svn/trunk/ scrapy-read-only
git clone https://github.com/geekan/scrapy-examples.git
git clone https://github.com/php5engineer/scrapy-drupal-org.git
The data flow in Scrapy is controlled by the execution engine, and goes like this:
Because yum packages in CenOS6 is python2.6, we need install python 2.7 from source
yum install bzip2-devel yum install sqlite-devel yum install wget https://www.python.org/ftp/python/2.7.7/Python-2.7.7.tgz tar xf Python-2.7.7.tgz cd Python-2.7.7 ./configure make make install
wget https://bootstrap.pypa.io/get-pip.py python2.7 get-pip.py
pip -V pip 1.5.6 from /usr/local/lib/python2.7/site-packages (python 2.7)
yum install libxml2-devel yum install libxslt-devel
pip install lxml==3.1.2
/usr/lib64/python2.6/site-packages/lxml Or /usr/local/lib/python2.7/site-packages/lxml
yum install openssl-devel yum install libffi-devel
wget https://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.14.tar.gz tar xf pyOpenSSL-0.14.tar.gz cd pyOpenSSL-0.14 python2.7 setup.py --help python2.7 setup.py build python2.7 setup.py install
pip install pyopenssl
wget http://twistedmatrix.com/Releases/Twisted/14.0/Twisted-14.0.0.tar.bz2 tar xf Twisted-14.0.0.tar.bz2 cd Twisted-14.0.0 python2.7 setup.py --help python2.7 setup.py build python2.7 setup.py install
pip install scrapy
pip install scrapy --upgrade
wget https://codeload.github.com/scrapy/scrapy/legacy.tar.gz/master tar xf master cd scrapy-scrapy-9d57ecf python setup.py install
upgrade master for fixing error below:
trial test_crawl.py scrapy.tests.test_crawl CrawlTestCase test_delay ... Traceback (most recent call last): File "/usr/local/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/local/lib/python2.7/site-packages/scrapy/tests/mockserver.py", line 198, in <module> os.path.join(os.path.dirname(__file__), 'keys/cert.pem'), File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 104, in __init__ self.cacheContext() File "/usr/local/lib/python2.7/site-packages/Twisted-14.0.0-py2.7-linux-x86_64.egg/twisted/internet/ssl.py", line 113, in cacheContext ctx.use_certificate_file(self.certificateFileName) File "/usr/local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 391, in use_certificate_file _raise_current_error() File "/usr/local/lib/python2.7/site-packages/OpenSSL/_util.py", line 22, in exception_from_error_queue raise exceptionType(errors) OpenSSL.SSL.Error: [('system library', 'fopen', 'No such file or directory'), ('BIO routines', 'FILE_CTRL', 'system lib'), ('SSL routines', 'SSL_CTX_use_certificate_file', 'system lib')]
set PATH=%PATH%;"D:\tools\Python27\";"D:\tools\Python27\Scripts"
set
output:
VS100COMNTOOLS=C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\Tools\ VS110COMNTOOLS=D:\tools\Microsoft Visual Studio 11.0\Common7\Tools\
SET VS90COMNTOOLS=%VS100COMNTOOLS%
⇒ Fix error: Unable to find vcvarsall.bat
pip install -U setuptools
Step by steop install openssl:
SET LIB SET INCLUDE SET VS90COMNTOOLS=%VS100COMNTOOLS% SET INCLUDE=d:\tools\OpenSSL-Win32\include SET LIB=d:\tools\OpenSSL-Win32\lib
And on windows64bit
SET LIB SET INCLUDE SET VS90COMNTOOLS=%VS110COMNTOOLS% SET INCLUDE=d:\tools\OpenSSL-Win64\include SET LIB=d:\tools\OpenSSL-Win64\lib
We set the environment variable VS90COMNTOOLS=%VS100COMNTOOLS% because Visual C++ 2010 are installed on windows with environment VS100COMNTOOLS, not Visual C++ 2008. And python using environment variable VS90COMNTOOLS for finding the compiler on windows, so we need set the variable VS90COMNTOOLS=%VS100COMNTOOLS%
easy_install.exe pyopenssl
Issues for building pyopenssl:
cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -ID:\tools\Python27\include -ID:\tools\Python27\PC /Tccryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.c /Fobuild\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:D:\tools\Python27\libs /LIBPATH:D:\tools\Python27\PCbuild libeay32.lib ssleay32.lib advapi32.lib /EXPORT:init_Cryptography_cffi_444d7397xa22f8491 build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.obj /OUT:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd /IMPLIB:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.lib /MANIFESTFILE:build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest mt.exe -nologo -manifest build\temp.win32-2.7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest -outputresource:build\lib.win32-2.7\cryptography\_Cryptography_cffi_444d7397xa22f8491.pyd;2 error: command 'mt.exe' failed with exit status 31
link.exe /MANIFESTFILE:build\temp.win32-7\Release\cryptography\hazmat\bindings\__pycache__\_Cryptography_cffi_444d7397xa22f8491.pyd.manifest
⇒ Can't create file .manifest
Fix:For fixing this issue, we need update msvc9compiler.py to add build option /MANIFEST in front of code line ld_args.append('/MANIFESTFILE:' + temp_manifest):
ld_args.append('/MANIFEST ') ld_args.append('/MANIFESTFILE:' + temp_manifest)
easy_install.exe pypiwin32
easy_install.exe twisted
Install lxml version 2.x.x
easy_install.exe lxml==2.3
Install lxml 3.x.x:
easy_install lxml-3.4.0.win32-py2.7.exe
Build lxml from source
hg clone git://github.com/lxml/lxml.git lxml cd lxml SET VS90COMNTOOLS=%VS100COMNTOOLS% pip install -r requirements.txt python setup.py build python setup.py install
Go to directory d:\tools\Python27\Scripts\ and run:
pip.exe install scrapy
conda create -n scrapy scrapy python=2
activate scrapy
This tutorial will walk you through these tasks:
Before you start scraping, you will have set up a new Scrapy project. Enter a directory where you’d like to store your code and then run:
scrapy startproject myscrapy
This will create a myscrapy directory with the following contents:
myscrapy/ scrapy.cfg myscrapy/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
These are basically:
This is the code for our first Spider; save it in a file named dmoz_spider.py under the myscrapy/spiders directory:
from scrapy.spider import Spider class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body)
⇒ This code define a spider with basic informations below:
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ]
To run spider with name 'dmoz', go to the project’s top level directory and run:
scrapy crawl dmoz
Output:
2014-06-11 15:52:30+0700 [scrapy] INFO: Scrapy 0.22.2 started (bot: myscrapy) 2014-06-11 15:52:30+0700 [scrapy] INFO: Optional features available: ssl, http11, django 2014-06-11 15:52:30+0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'myscrapy.spiders', 'SPIDER_MODULES': ['myscrapy.spiders'], 'BOT_NAME': 'myscrapy'} 2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeader ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2014-06-11 15:52:30+0700 [scrapy] INFO: Enabled item pipelines: 2014-06-11 15:52:30+0700 [dmoz] INFO: Spider opened 2014-06-11 15:52:30+0700 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2014-06-11 15:52:30+0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2014-06-11 15:52:30+0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 2014-06-11 15:52:31+0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 2014-06-11 15:52:31+0700 [dmoz] INFO: Closing spider (finished) 2014-06-11 15:52:31+0700 [dmoz] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 516, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 16515, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2014, 6, 11, 8, 52, 31, 299000), 'log_count/DEBUG': 4, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2014, 6, 11, 8, 52, 30, 424000)} 2014-06-11 15:52:31+0700 [dmoz] INFO: Spider closed (finished)
Books Resources
Edit spiders\dmoz_spyder.py:
from scrapy.spider import Spider from scrapy.selector import Selector class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): print "------response.url=",response.url sel = Selector(response) sites = sel.xpath('//ul/li') for site in sites: title = site.xpath('a/text()').extract() link = site.xpath('a/@href').extract() desc = site.xpath('text()').extract() print title, link, desc
from scrapy.item import Item, Field class DmozItem(Item): title = Field() link = Field() desc = Field()
from scrapy.spider import Spider from scrapy.selector import Selector from myscrapy.items import DmozItem class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): print "------response.url=",response.url filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) sel = Selector(response) sites = sel.xpath('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.xpath('a/text()').extract() item['link'] = site.xpath('a/@href').extract() item['desc'] = site.xpath('text()').extract() items.append(item) return items
scrapy crawl dmoz -o items.json -t json
After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.
Typical use for item pipelines are:
Follow below steps for ItemPipeline:
ITEM_PIPELINES = {'myscrapy.pipelines.MyscrapyPipeline'}
class MyscrapyPipeline(object): def process_item(self, item, spider): return item
scrapy list
scrapy crawl spidername
http://doc.scrapy.org/en/latest/topics/selectors.html
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.
How to create Selector:
from scrapy.http import HtmlResponse from scrapy.selector import Selector body = '<html><body><span>good</span></body></html>' Selector(text=body).xpath('//span/text()').extract()
⇒output:
[u'good']
from scrapy.http import HtmlResponse from scrapy.selector import Selector response = HtmlResponse(url='http://example.com', body=body) Selector(response=response).xpath('//span/text()').extract()
⇒output:
[u'good']
CrawlSpider sử dụng 2 thuật toán cơ bản:
refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html
Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.
Link extractors classes bundled with Scrapy are provided in the scrapy.linkextractors module. Some basic classes in scrapy.linkextractors used to extract links:
# Top-level imports from .lxmlhtml import LxmlLinkExtractor as LinkExtractor
⇒ So we can use LxmlLinkExtractor by import LinkExtractor with code below:
from scrapy.linkextractors import LinkExtractor
Contructor LxmlLinkExtractor:
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)
How to extract links in LxmlLinkExtractor:
class LxmlLinkExtractor(FilteringLinkExtractor): def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(), tags=('a', 'area'), attrs=('href',), canonicalize=True, unique=True, process_value=None, deny_extensions=None, restrict_css=()): ............................ lx = LxmlParserLinkExtractor(tag=tag_func, attr=attr_func, unique=unique, process=process_value) super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny, allow_domains=allow_domains, deny_domains=deny_domains, restrict_xpaths=restrict_xpaths, restrict_css=restrict_css, canonicalize=canonicalize, deny_extensions=deny_extensions) ............................ def extract_links(self, response): html = Selector(response) base_url = get_base_url(response) if self.restrict_xpaths: docs = [subdoc for x in self.restrict_xpaths for subdoc in html.xpath(x)] else: docs = [html] all_links = [] for doc in docs: links = self._extract_links(doc, response.url, response.encoding, base_url) all_links.extend(self._process_links(links)) return unique_list(all_links) class LxmlParserLinkExtractor(object): ................... def _extract_links(self, selector, response_url, response_encoding, base_url): links = [] # hacky way to get the underlying lxml parsed document for el, attr, attr_val in self._iter_links(selector._root): # pseudo lxml.html.HtmlElement.make_links_absolute(base_url) attr_val = urljoin(base_url, attr_val) url = self.process_attr(attr_val) if url is None: continue if isinstance(url, unicode): url = url.encode(response_encoding) # to fix relative links after process_value url = urljoin(response_url, url) link = Link(url, _collect_string_content(el) or u'', nofollow=True if el.get('rel') == 'nofollow' else False) links.append(link) return unique_list(links, key=lambda link: link.url) \ if self.unique else links def extract_links(self, response): html = Selector(response) base_url = get_base_url(response) return self._extract_links(html, response.url, response.encoding, base_url) ...................
Extract files in html file which links in tags=('script','img') and attrs=('src'):
filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = []) links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen] file_item = FileItem() file_urls = [] if len(links) > 0: for link in links: self.seen.add(link) fullurl = getFullUrl(link.url, self.defaultBaseUrl) file_urls.append(fullurl) file_item['file_urls'] = file_urls
Understand about using Rule in CrawlSpider:
Class Rule(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity)
class CrawlSpider(Spider): rules = () def __init__(self, *a, **kw): ................... self._compile_rules() def _compile_rules(self): def get_method(method): if callable(method): return method elif isinstance(method, basestring): return getattr(self, method, None) self._rules = [copy.copy(r) for r in self.rules] for rule in self._rules: rule.callback = get_method(rule.callback) rule.process_links = get_method(rule.process_links) rule.process_request = get_method(rule.process_request) def _requests_to_follow(self, response): if not isinstance(response, HtmlResponse): return seen = set() for n, rule in enumerate(self._rules): links = [l for l in rule.link_extractor.extract_links(response) if l not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = Request(url=link.url, callback=self._response_downloaded) r.meta.update(rule=n, link_text=link.text) yield rule.process_request(r)
Thuật toán đệ quy trong CrawlSpider:
def _parse_response(self, response, callback, cb_kwargs, follow=True): if callback: cb_res = callback(response, **cb_kwargs) or () cb_res = self.process_results(response, cb_res) for requests_or_item in iterate_spider_output(cb_res): yield requests_or_item if follow and self._follow_links: for request_or_item in self._requests_to_follow(response): yield request_or_item
class CrawlSpider(Spider): ................... def parse(self, response): return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
Example define Rule in download.py
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle ................ class DownloadSpider(CrawlSpider): name = "download" allowed_domains = ["template-help.com"] start_urls = [ "http://livedemo00.template-help.com/wordpress_44396/" ] rules = [ Rule(sle(allow=("/*")), callback='myparse', follow = False, process_request=process_request), Rule(sle(allow=("/*"), tags=('link'), deny_extensions = []), callback='parse_css') ] ................
Refer python unittest: Python Unittest
create simpletest.py with below content:
import unittest class DemoTest(unittest.TestCase): def test_passes(self): pass def test_fails(self): self.fail("I failed.")
Run with trial:
trial simpletest.py
Output:
simpletest DemoTest test_fails ... [FAIL] test_passes ... [OK] =============================================================================== [FAIL] Traceback (most recent call last): File "D:\tools\Python27\lib\unittest\case.py", line 327, in run testMethod() File "D:\simpletest.py", line 6, in test_fails self.fail("I failed.") File "D:\tools\Python27\lib\unittest\case.py", line 408, in fail raise self.failureException(msg) exceptions.AssertionError: I failed. simpletest.DemoTest.test_fails ------------------------------------------------------------------------------- Ran 2 tests in 1.540s FAILED (failures=1, successes=1)
pip install mock pip install boto pip install django yum install vsftpd
pip uninstall PIL
pip install pillow
And on Windows
SET VS90COMNTOOLS=%VS100COMNTOOLS% easy_install pillow
yum install bzip2-devel
./configure make ./python -c "import bz2; print bz2.__doc__"
make install
yum install sqlite-devel
./configure make ./python -c "import sqlite3;print sqlite3.__doc__"
make install
Run below command to run all unittest for scrapy which has installed:
trial scrapy
Or copy bin/runtests.sh in scrapy source to /usr/local/lib/python2.7/site-packages/scrapy/ and run:
./bin/runtests.sh
Go to scrapy source and run below script to run all unittest for scrapy source:
./bin/runscripts.sh
Run python script unittest in \Lib\site-packages\Scrapy-0.22.2-py2.7.egg\scrapy\tests\, for example: test_downloadermiddleware.py:
trial test_downloadermiddleware.py