Differences

This shows you the differences between two versions of the page.

--- crawler:scrapy [2016/03/17 02:23] – [Install and run scrapy on virtual environment with miniconda] admin
+++ crawler:scrapy [2022/10/29 16:15] (current) – external edit 127.0.0.1
@@ Line 22: / Line 22: @@
   * **Item Pipeline**: Post-process and store your scraped data.
   * **Feed exports**: Output your scraped data using different formats and storages.
-  * Link Extractors: Convenient classes to extract links to follow from pages.
+  * **Link Extractors**: Convenient classes to extract links to follow from pages.
 ==== Data Flow ====
 {{:crawler:scrapy_architecture.png|}}\\
@@ Line 149: / Line 149: @@
 SET VS90COMNTOOLS=%VS100COMNTOOLS%
 </code> => Fix error: Unable to find vcvarsall.bat
+  * upgrade setuptools:<code bat>
+pip install -U setuptools
+</code>
 === Install pyopenssl ===
 Step by steop install openssl:
@@ Line 183: / Line 186: @@
 ld_args.append('/MANIFESTFILE:' + temp_manifest)
 </code>
-=== Install pywin32, twisted, lxml ===
+=== Install pywin32, twisted ===
-  * Install pywin32 follow link:  http://sourceforge.net/projects/pywin32/files/
+  * Install pywin32 follow link:  http://sourceforge.net/projects/pywin32/files/ or Install with pip<code bat>
+easy_install.exe pypiwin32
+</code>
   * Go to directory d:\tools\Python27\Scripts\ and run below commands to install twisted, lxml:<code dos>
 easy_install.exe twisted
-easy_install.exe lxml
 </code>
-  * build lxml from source<code python>
+=== Install lxml ===
+Install lxml version 2.x.x
+<code bat>
+easy_install.exe lxml==2.3
+</code>
+Install lxml 3.x.x:
+  - Step1: Download lxml MS Windows Installer packages
+  - Step2: Install with easy_install:<code bat>
+easy_install lxml-3.4.0.win32-py2.7.exe
+</code>
+Build lxml from source<code python>
 hg clone git://github.com/lxml/lxml.git lxml
 cd lxml
@@ Line 395: / Line 409: @@
         return item
+</code>
+==== Run check spiders And run it ====
+  * Run list all spiders:<code bat>
+scrapy list
+</code>
+  * Run spider:<code bat>
+scrapy crawl spidername
 </code>
 ===== Custom APIs ====
@@ Line 420: / Line 441: @@
 </code> =>output: <code>[u'good']</code>
 ==== Rule and linkextractors in CrawlSpider ====
-=== link-extractors ===
+CrawlSpider sử dụng 2 thuật toán cơ bản:
+  - **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó
+  - **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download
+==== Scrapy linkextractors package ====
 refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html
 Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, which **receives a Response object** and **returns a list of scrapy.link.Link objects**. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.
+=== linkextractors classes ===
 Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links:
   * **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below<code python>
@@ Line 441: / Line 465: @@
   * **allow_domains (str or list)** – a single value or a list of string containing domains which **will be considered for extracting the links**
   * **deny_domains (str or list)** – a single value or a list of strings containing domains which **won’t be considered for extracting the links**
+  * **deny_extensions**
   * **restrict_xpaths (str or list)** – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, **only the text selected by those XPath** will be scanned for links. See examples below.
   * **restrict_css (str or list)** – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has **the same behaviour as restrict_xpaths**
@@ Line 503: / Line 528: @@
 ...................
 </code>
-=== Rule in CrawlSpider ===
+=== extract links with linkextractors ===
+Extract files in html file which links in **tags=('script','img')** and **attrs=('src')**:<code python>
+filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = [])
+links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen]
+  file_item = FileItem()
+  file_urls = []
+  if len(links) > 0:
+      for link in links:
+          self.seen.add(link)
+          fullurl = getFullUrl(link.url, self.defaultBaseUrl)
+          file_urls.append(fullurl)
+      file_item['file_urls'] = file_urls
+</code>
+==== Scrapy Selector Package ====
+==== Rule in CrawlSpider ====
 Understand about using Rule in CrawlSpider:
   * Contructor Rule:<code python>