crawler:scrapy
Differences
This shows you the differences between two versions of the page.
crawler:scrapy [2016/03/17 02:23] – [Install and run scrapy on virtual environment with miniconda] admin | crawler:scrapy [2022/10/29 16:15] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 22: | Line 22: | ||
* **Item Pipeline**: Post-process and store your scraped data. | * **Item Pipeline**: Post-process and store your scraped data. | ||
* **Feed exports**: Output your scraped data using different formats and storages. | * **Feed exports**: Output your scraped data using different formats and storages. | ||
- | * Link Extractors: Convenient classes to extract links to follow from pages. | + | |
==== Data Flow ==== | ==== Data Flow ==== | ||
{{: | {{: | ||
Line 149: | Line 149: | ||
SET VS90COMNTOOLS=%VS100COMNTOOLS% | SET VS90COMNTOOLS=%VS100COMNTOOLS% | ||
</ | </ | ||
+ | * upgrade setuptools:< | ||
+ | pip install -U setuptools | ||
+ | </ | ||
=== Install pyopenssl === | === Install pyopenssl === | ||
Step by steop install openssl: | Step by steop install openssl: | ||
Line 183: | Line 186: | ||
ld_args.append('/ | ld_args.append('/ | ||
</ | </ | ||
- | === Install pywin32, twisted, lxml === | + | === Install pywin32, twisted === |
- | * Install pywin32 follow link: http:// | + | * Install pywin32 follow link: http:// |
+ | easy_install.exe pypiwin32 | ||
+ | </ | ||
* Go to directory d: | * Go to directory d: | ||
easy_install.exe twisted | easy_install.exe twisted | ||
- | easy_install.exe lxml | ||
</ | </ | ||
- | * build lxml from source< | + | === Install lxml === |
+ | Install lxml version 2.x.x | ||
+ | <code bat> | ||
+ | easy_install.exe lxml==2.3 | ||
+ | </ | ||
+ | Install lxml 3.x.x: | ||
+ | - Step1: Download lxml MS Windows Installer packages | ||
+ | - Step2: Install with easy_install:< | ||
+ | easy_install lxml-3.4.0.win32-py2.7.exe | ||
+ | </ | ||
+ | Build lxml from source< | ||
hg clone git:// | hg clone git:// | ||
cd lxml | cd lxml | ||
Line 395: | Line 409: | ||
return item | return item | ||
+ | </ | ||
+ | ==== Run check spiders And run it ==== | ||
+ | * Run list all spiders:< | ||
+ | scrapy list | ||
+ | </ | ||
+ | * Run spider:< | ||
+ | scrapy crawl spidername | ||
</ | </ | ||
===== Custom APIs ==== | ===== Custom APIs ==== | ||
Line 420: | Line 441: | ||
</ | </ | ||
==== Rule and linkextractors in CrawlSpider ==== | ==== Rule and linkextractors in CrawlSpider ==== | ||
- | === link-extractors | + | CrawlSpider sử dụng 2 thuật toán cơ bản: |
+ | - **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó | ||
+ | - **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download | ||
+ | ==== Scrapy linkextractors package ==== | ||
refer: http:// | refer: http:// | ||
Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, | Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, | ||
+ | === linkextractors classes === | ||
Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links: | Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links: | ||
* **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below< | * **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below< | ||
Line 441: | Line 465: | ||
* **allow_domains (str or list)** – a single value or a list of string containing domains which **will be considered for extracting the links** | * **allow_domains (str or list)** – a single value or a list of string containing domains which **will be considered for extracting the links** | ||
* **deny_domains (str or list)** – a single value or a list of strings containing domains which **won’t be considered for extracting the links** | * **deny_domains (str or list)** – a single value or a list of strings containing domains which **won’t be considered for extracting the links** | ||
+ | * **deny_extensions** | ||
* **restrict_xpaths (str or list)** – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, **only the text selected by those XPath** will be scanned for links. See examples below. | * **restrict_xpaths (str or list)** – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, **only the text selected by those XPath** will be scanned for links. See examples below. | ||
* **restrict_css (str or list)** – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has **the same behaviour as restrict_xpaths** | * **restrict_css (str or list)** – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has **the same behaviour as restrict_xpaths** | ||
Line 503: | Line 528: | ||
................... | ................... | ||
</ | </ | ||
- | === Rule in CrawlSpider === | + | === extract links with linkextractors === |
+ | Extract files in html file which links in **tags=(' | ||
+ | filesExtractor = sle(allow=("/ | ||
+ | links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen] | ||
+ | file_item = FileItem() | ||
+ | file_urls = [] | ||
+ | if len(links) > 0: | ||
+ | for link in links: | ||
+ | self.seen.add(link) | ||
+ | fullurl = getFullUrl(link.url, | ||
+ | file_urls.append(fullurl) | ||
+ | file_item[' | ||
+ | </ | ||
+ | ==== Scrapy Selector Package ==== | ||
+ | ==== Rule in CrawlSpider | ||
Understand about using Rule in CrawlSpider: | Understand about using Rule in CrawlSpider: | ||
* Contructor Rule:< | * Contructor Rule:< |
crawler/scrapy.1458181435.txt.gz · Last modified: 2022/10/29 16:15 (external edit)