crawler:scrapy
Differences
This shows you the differences between two versions of the page.
crawler:scrapy [2016/08/17 00:50] – [extract links with linkextractors] admin | crawler:scrapy [2022/10/29 16:15] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 149: | Line 149: | ||
SET VS90COMNTOOLS=%VS100COMNTOOLS% | SET VS90COMNTOOLS=%VS100COMNTOOLS% | ||
</ | </ | ||
+ | * upgrade setuptools:< | ||
+ | pip install -U setuptools | ||
+ | </ | ||
=== Install pyopenssl === | === Install pyopenssl === | ||
Step by steop install openssl: | Step by steop install openssl: | ||
Line 441: | Line 444: | ||
- **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó | - **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó | ||
- **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download | - **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download | ||
- | ==== linkextractors package ==== | + | ==== Scrapy |
refer: http:// | refer: http:// | ||
Line 526: | Line 529: | ||
</ | </ | ||
=== extract links with linkextractors === | === extract links with linkextractors === | ||
- | Extract files in html file:< | + | Extract files in html file which links in **tags=(' |
filesExtractor = sle(allow=("/ | filesExtractor = sle(allow=("/ | ||
links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen] | links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen] | ||
Line 537: | Line 540: | ||
file_urls.append(fullurl) | file_urls.append(fullurl) | ||
file_item[' | file_item[' | ||
- | </ | + | </ |
+ | ==== Scrapy Selector Package ==== | ||
==== Rule in CrawlSpider ==== | ==== Rule in CrawlSpider ==== | ||
Understand about using Rule in CrawlSpider: | Understand about using Rule in CrawlSpider: |
crawler/scrapy.1471395004.txt.gz · Last modified: 2022/10/29 16:15 (external edit)