Differences

This shows you the differences between two versions of the page.

--- crawler:scrapy [2016/08/17 00:50] – [extract links with linkextractors] admin
+++ crawler:scrapy [2022/10/29 16:15] (current) – external edit 127.0.0.1
@@ Line 149: / Line 149: @@
 SET VS90COMNTOOLS=%VS100COMNTOOLS%
 </code> => Fix error: Unable to find vcvarsall.bat
+  * upgrade setuptools:<code bat>
+pip install -U setuptools
+</code>
 === Install pyopenssl ===
 Step by steop install openssl:
@@ Line 441: / Line 444: @@
   - **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó
   - **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download
-==== linkextractors package ====
+==== Scrapy linkextractors package ====
 refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html
@@ Line 526: / Line 529: @@
 </code>
 === extract links with linkextractors ===
-Extract files in html file:<code python>
+Extract files in html file which links in **tags=('script','img')** and **attrs=('src')**:<code python>
 filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = [])
 links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen]
@@ Line 537: / Line 540: @@
           file_urls.append(fullurl)
       file_item['file_urls'] = file_urls
 </code>
+==== Scrapy Selector Package ====
 ==== Rule in CrawlSpider ====
 Understand about using Rule in CrawlSpider: