Scrapy watercress search page crawler

Scrapy watercress search page crawler

Use the scratch crawler framework to crawl the search results of Douban books

Scrapy

Scrapy is an application framework written for crawling website data and extracting structural data

It can be applied to a series of programs including data mining, information processing or storing historical data

It provides base classes for various types of crawlers, such as BaseSpider, crawlespider, etc

Main components

The Scrapy framework is mainly composed of five components

  1. Scheduler
    To put it bluntly, the scheduler is assumed to be a priority queue of URLs, which determines what the next URL to grab is, and removes duplicate URLs. Users can customize the scheduler according to their own needs.
  2. Downloader
    Downloader is the most burdensome of all components. It is used to download resources on the network at high speed
    The code of scripy Downloader is not too complex, but it is efficient. The main reason is that scripy Downloader is based on twisted, an efficient asynchronous model
  3. Spider

    Crawler is the most concerned part of users. Users customize their own crawler (by customizing syntax such as regular expressions) to extract the information they need from specific web pages, that is, the so-called entity (Item). Users can also extract links from them and let scripy continue to grab the next page

  4. Item pipeline

    Entity pipeline, which is used to process entities (items) extracted by spiders

The main functions are to persist entities, verify the effectiveness of entities, and clear unnecessary information

  1. Scratch engine

    The Scrapy engine is the core of the whole framework
    It is used to control debugger, downloader and crawler. In fact, the engine is equivalent to the CPU of the computer, which controls the whole process

Data flow

The data flow in scripy is controlled by the execution engine. The process is as follows:

  1. The engine opens a website, finds the spider that handles the website, and requests the spider for the first URL to crawl (s)
  2. The engine obtains the first URL to crawl from the Spider and schedules it with Request in the scheduler
  3. The engine requests the scheduler for the next URL to crawl
  4. The scheduler returns the next URL to crawl to the engine, and the engine forwards the URL to the downloader through the download middleware (request direction)
  5. Once the page is downloaded, the downloader generates a response for the page and sends it to the engine through the download middleware (response direction)
  6. The engine receives the Response from the downloader and sends it to the Spider for processing through the Spider middleware (input direction)
  7. The Spider handles the Response and returns the crawled Item and the (follow-up) new Request to the engine
  8. The engine gives the crawled Item (returned by Spider) to the Item Pipeline and the Request (returned by Spider) to the scheduler
  9. (from the second step) repeat until there are no more request s in the scheduler, and the engine shuts down the website

Simple use

Create project scratch startproject XXX
Create crawler scratch genspider XXX (crawler name) Com (crawl domain)
Generate the file scratch crawl XXX - O XXX JSON (generate json/csv file)
Run crawler scratch crawl XXX
List all Crawlers

Directory structure of the scene project

Create a new project tutorial with the command "scratch startproject tutorial"

A tutorial directory containing the following will be created

tutorial/                        
    scrapy.cfg            # Project profile
    tutorial/            # Code will be added here after the python module of the project
        __init__.py
        items.py        # item file in project
        pipelines.py    # Pipeline files in the project
        settings.py        # Project settings file
        spiders/        # Directory to place spider code
            __init__.py
            ...

Crawl the watercress search page using scratch

analysis

https://search.douban.com/movie/subject_search?search_text={search_text}&cat=1002&start={start}

search_ Textsearch keyword

cat search category

Number of entries from start

The url rule can be applied to the book and movie search page, and so can the later crawling

After crawling, it is found that the page information cannot be obtained, but a window can be found__ DATA__ The guess data is encrypted into this string

A round of Baidu found that a big man extracted the encrypted js code!

So directly give the link of the boss Douban reading search page window__ DATA__ Decryption of

If this problem is solved, others will be easy to climb

code

See for complete code github warehouse

The extracted JS is in the third_party/main.js

class DoubanBookSearchSpider(scrapy.Spider):
    name = 'douban_book_search'
    allowed_domains = ['douban.com']

    def __init__(self,keyword=None,start=None,*args, **kwargs):
        super(DoubanBookSearchSpider, self).__init__(*args, **kwargs)
        self.keyword = keyword
        self.start = start
        self.start_urls.append(f'https://search.douban.com/book/subject_search?search_text={self.keyword}&cat=1001&start={self.start}')

    def parse(self, response):
        r = re.search('window.__DATA__ = "([^"]+)"', response.text).group(1)
        # Import js
        file_path = pathlib.Path.cwd() / 'third_party/main.js'
        with open(file_path, 'r', encoding='gbk') as f:
            decrypt_js = f.read()
        ctx = execjs.compile(decrypt_js)
        data = ctx.call('decrypt', r)
        for item in data['payload']['items']:
            if item.get('rating', None):
                cover_url = item['cover_url']
                score = item['rating']['value']
                score_num = item['rating']['count']
                url = item['url']
                abstract = item['abstract']
                title = item['title']
                id = item['id']
                yield DouBanBookSearchItem(
                    cover_url=cover_url,
                    score=score,
                    score_num=score_num,
                    url=url,
                    abstract=abstract,
                    title=title,
                    id=id)

reference resources

Crawler framework Scrapy personal summary (detailed) familiar

Architecture Overview

Scrapy crawler framework, introductory case (very detailed)

Douban reading search page window__ DATA__ Decryption of

Tags: Python crawler scrapy

Posted by seavers on Sat, 07 May 2022 00:25:22 +0300