crawler:scrapy
Differences
This shows you the differences between two versions of the page.
| crawler:scrapy [2016/08/17 00:42] – [Rule in CrawlSpider] admin | crawler:scrapy [2022/10/29 16:15] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 149: | Line 149: | ||
| SET VS90COMNTOOLS=%VS100COMNTOOLS% | SET VS90COMNTOOLS=%VS100COMNTOOLS% | ||
| </ | </ | ||
| + | * upgrade setuptools:< | ||
| + | pip install -U setuptools | ||
| + | </ | ||
| === Install pyopenssl === | === Install pyopenssl === | ||
| Step by steop install openssl: | Step by steop install openssl: | ||
| Line 441: | Line 444: | ||
| - **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó | - **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó | ||
| - **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download | - **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download | ||
| - | ==== link-extractors | + | ==== Scrapy linkextractors package |
| refer: http:// | refer: http:// | ||
| Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, | Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, | ||
| + | === linkextractors classes === | ||
| Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links: | Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links: | ||
| * **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below< | * **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below< | ||
| Line 525: | Line 528: | ||
| ................... | ................... | ||
| </ | </ | ||
| + | === extract links with linkextractors === | ||
| + | Extract files in html file which links in **tags=(' | ||
| + | filesExtractor = sle(allow=("/ | ||
| + | links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen] | ||
| + | file_item = FileItem() | ||
| + | file_urls = [] | ||
| + | if len(links) > 0: | ||
| + | for link in links: | ||
| + | self.seen.add(link) | ||
| + | fullurl = getFullUrl(link.url, | ||
| + | file_urls.append(fullurl) | ||
| + | file_item[' | ||
| + | </ | ||
| + | ==== Scrapy Selector Package ==== | ||
| ==== Rule in CrawlSpider ==== | ==== Rule in CrawlSpider ==== | ||
| Understand about using Rule in CrawlSpider: | Understand about using Rule in CrawlSpider: | ||
crawler/scrapy.1471394533.txt.gz · Last modified: (external edit)
