User Tools

Site Tools


crawler:scrapy

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

crawler:scrapy [2016/08/17 00:42] – [Rule in CrawlSpider] admincrawler:scrapy [2022/10/29 16:15] (current) – external edit 127.0.0.1
Line 149: Line 149:
 SET VS90COMNTOOLS=%VS100COMNTOOLS% SET VS90COMNTOOLS=%VS100COMNTOOLS%
 </code> => Fix error: Unable to find vcvarsall.bat </code> => Fix error: Unable to find vcvarsall.bat
 +  * upgrade setuptools:<code bat>
 +pip install -U setuptools
 +</code>
 === Install pyopenssl === === Install pyopenssl ===
 Step by steop install openssl: Step by steop install openssl:
Line 441: Line 444:
   - **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó   - **Thuật toán đệ quy** để tìm tất cả url liên kết với url khởi tạo và tạo thành mạng lưới url liên kết với nó
   - **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download   - **Thuật toán extract links dựa theo rule** để lọc ra những url mà nó muốn download
-==== link-extractors ====+==== Scrapy linkextractors package ====
 refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html refer: http://doc.scrapy.org/en/latest/topics/link-extractors.html
  
 Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, which **receives a Response object** and **returns a list of scrapy.link.Link objects**. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow. Link extractors are objects whose only purpose is to extract links from web pages. The only public method that every link extractor has is **extract_links**, which **receives a Response object** and **returns a list of scrapy.link.Link objects**. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.
 +=== linkextractors classes ===
 Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links: Link extractors classes bundled with Scrapy are provided in the **scrapy.linkextractors** module. Some basic classes in **scrapy.linkextractors** used to extract links:
   * **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below<code python>   * **scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor**. Because alias below<code python>
Line 525: Line 528:
 ................... ...................
 </code> </code>
 +=== extract links with linkextractors ===
 +Extract files in html file which links in **tags=('script','img')** and **attrs=('src')**:<code python>
 +filesExtractor = sle(allow=("/*"), tags=('script','img'), attrs=('src'), deny_extensions = [])
 +links = [l for l in self.filesExtractor.extract_links(response) if l not in self.seen]
 +  file_item = FileItem()
 +  file_urls = []        
 +  if len(links) > 0:
 +      for link in links:
 +          self.seen.add(link)
 +          fullurl = getFullUrl(link.url, self.defaultBaseUrl)
 +          file_urls.append(fullurl)
 +      file_item['file_urls'] = file_urls
 +</code>
 +==== Scrapy Selector Package ====
 ==== Rule in CrawlSpider ==== ==== Rule in CrawlSpider ====
 Understand about using Rule in CrawlSpider: Understand about using Rule in CrawlSpider:
crawler/scrapy.1471394533.txt.gz · Last modified: 2022/10/29 16:15 (external edit)