User Tools

Site Tools


crawler:scrapy

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

crawler:scrapy [2016/03/17 02:23] – [Install and run scrapy on virtual environment with miniconda] admincrawler:scrapy [2022/10/29 16:15] (current) – external edit 127.0.0.1
Line 22: Line 22:
   * **Item Pipeline**: Post-process and store your scraped data.   * **Item Pipeline**: Post-process and store your scraped data.
   * **Feed exports**: Output your scraped data using different formats and storages.   * **Feed exports**: Output your scraped data using different formats and storages.
-  * Link Extractors: Convenient classes to extract links to follow from pages.+  * **Link Extractors**: Convenient classes to extract links to follow from pages.
 ==== Data Flow ==== ==== Data Flow ====
 {{:crawler:scrapy_architecture.png|}}\\ {{:crawler:scrapy_architecture.png|}}\\
Line 149: Line 149:
 SET VS90COMNTOOLS=%VS100COMNTOOLS% SET VS90COMNTOOLS=%VS100COMNTOOLS%
 </code> => Fix error: Unable to find vcvarsall.bat </code> => Fix error: Unable to find vcvarsall.bat
 +  * upgrade setuptools:<code bat>
 +pip install -U setuptools
 +</code>
 === Install pyopenssl === === Install pyopenssl ===
 Step by steop install openssl: Step by steop install openssl:
Line 183: Line 186:
 ld_args.append('/MANIFESTFILE:' + temp_manifest) ld_args.append('/MANIFESTFILE:' + temp_manifest)
 </code> </code>
-=== Install pywin32, twisted, lxml === +=== Install pywin32, twisted ===  
-  * Install pywin32 follow link:  http://sourceforge.net/projects/pywin32/files/+  * Install pywin32 follow link:  http://sourceforge.net/projects/pywin32/files/ or Install with pip<code bat> 
 +easy_install.exe pypiwin32 
 +</code>
   * Go to directory d:\tools\Python27\Scripts\ and run below commands to install twisted, lxml:<code dos>   * Go to directory d:\tools\Python27\Scripts\ and run below commands to install twisted, lxml:<code dos>
 easy_install.exe twisted easy_install.exe twisted
-easy_install.exe lxml 
 </code> </code>
-  * build lxml from source<code python>+=== Install lxml === 
 +Install lxml version 2.x.x 
 +<code bat> 
 +easy_install.exe lxml==2.3 
 +</code> 
 +Install lxml 3.x.x: 
 +  - Step1: Download lxml MS Windows Installer packages  
 +  - Step2: Install with easy_install:<code bat> 
 +easy_install lxml-3.4.0.win32-py2.7.exe 
 +</code> 
 +Build lxml from source<code python>
 hg clone git://github.com/lxml/lxml.git lxml hg clone git://github.com/lxml/lxml.git lxml
 cd lxml cd lxml
Line 395: Line 409:
         return item         return item
  
 +</code>
 +==== Run check spiders And run it ====
 +  * Run list all spiders:<code bat>
 +scrapy list
 +</code>
 +  * Run spider:<code bat>
 +scrapy crawl spidername
 </code> </code>
 ===== Custom APIs ==== ===== Custom APIs ====
Line 420: Line 441:
 </code> =>output: <code>[u'good']</code> </code> =>output: <code>[u'good']</code>
 ==== Rule and linkextractors in CrawlSpider ==== ==== Rule and linkextractors in CrawlSpider ====
-=== link-extractors ===+CrawlSpider sử dụng 2 thuật toán cơ bản: 
 +  - **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó 
 +  - **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download 
 +==== Scrapy linkextractors package ====
 refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html
  
 Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, which **receives a Response object** and **returns a list of scrapy.link.Link objects**. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow. Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, which **receives a Response object** and **returns a list of scrapy.link.Link objects**. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.
 +=== linkextractors classes ===
 Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links: Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links:
   * **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below<code python>   * **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below<code python>
Line 441: Line 465:
   * **allow_domains (str or list)** – a single value or a list of string containing domains which **will be considered for extracting the links**   * **allow_domains (str or list)** – a single value or a list of string containing domains which **will be considered for extracting the links**
   * **deny_domains (str or list)** – a single value or a list of strings containing domains which **won’t be considered for extracting the links**   * **deny_domains (str or list)** – a single value or a list of strings containing domains which **won’t be considered for extracting the links**
 +  * **deny_extensions**
   * **restrict_xpaths (str or list)** – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, **only the text selected by those XPath** will be scanned for links. See examples below.   * **restrict_xpaths (str or list)** – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, **only the text selected by those XPath** will be scanned for links. See examples below.
   * **restrict_css (str or list)** – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has **the same behaviour as restrict_xpaths**   * **restrict_css (str or list)** – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has **the same behaviour as restrict_xpaths**
Line 503: Line 528:
 ................... ...................
 </code> </code>
-=== Rule in CrawlSpider ===+=== extract links with linkextractors === 
 +Extract files in html file which links in **tags=('script','img')** and **attrs=('src')**:<code python> 
 +filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = []) 
 +links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen] 
 +  file_item = FileItem() 
 +  file_urls = []         
 +  if len(links) > 0: 
 +      for link in links: 
 +          self.seen.add(link) 
 +          fullurl = getFullUrl(link.url, self.defaultBaseUrl) 
 +          file_urls.append(fullurl) 
 +      file_item['file_urls'] = file_urls 
 +</code> 
 +==== Scrapy Selector Package ==== 
 +==== Rule in CrawlSpider ====
 Understand about using Rule in CrawlSpider: Understand about using Rule in CrawlSpider:
   * Contructor Rule:<code python>   * Contructor Rule:<code python>
crawler/scrapy.1458181435.txt.gz · Last modified: 2022/10/29 16:15 (external edit)