python crawler framework: analyze ajax and crawl to find the plate

1, Web page analysis and crawling fields

1. Crawl field

There are not many crawling fields, only three fields are needed, and the "content" field needs to be crawled in the details page

2. Web page analysis

Starting URL https://www.zhihu.com/explore

The discovery section is a typical ajax loading page.

We open the web page, right-click to check, switch to the Network interface, and click XHR. In this state, all the refreshed items are Ajax loading entries.

Next, we keep dropping down the page

You can see that ajax load entries keep appearing.

The params loaded by this ajax are as follows. The baidu image download crawler I wrote last time is realized by constructing params. In this crawler, I tried to use this method, but returned 404 results, so we realized it by analyzing the url of ajax.

This is the url loaded by ajax

https://www.zhihu.com/node/ExploreAnswerListV2?params=%7B%22offset%22%3A10%2C%22type%22%3A%22day%22%7D

https://www.zhihu.com/node/ExploreAnswerListV2?params=%7B%22offset%22%3A15%2C%22type%22%3A%22day%22%7D

Through the analysis of the above two URLs, we can see that there is only one variable parameter for this url

So we just need to change this parameter.

At the same time, when analyzing the web page, we found that the ajax loading limit is 40 pages, and the ajax loading URL of the last page is

https://www.zhihu.com/node/ExploreAnswerListV2?params=%7B%22offset%22%3A199%2C%22type%22%3A%22day%22%7D

Well, so far, we've finished analyzing the url address of our ajax request. The crawling field part is not analyzed. It is a very simple static web page. You can use xpath.

2, Code and analysis

1. items section

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ZhihufaxianItem(scrapy.Item):
    # title
    title = scrapy.Field()
    # author
    author = scrapy.Field()
    # content
    content = scrapy.Field()

Create crawl field

2. settings section

# -*- coding: utf-8 -*-

# Scrapy settings for zhihufaxian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'zhihufaxian'

SPIDER_MODULES = ['zhihufaxian.spiders']
NEWSPIDER_MODULE = 'zhihufaxian.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'zhihufaxian.middlewares.ZhihufaxianSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'zhihufaxian.middlewares.ZhihufaxianDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'zhihufaxian.pipelines.ZhihufaxianPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Open the itempippline section and modify the user_agent.

3,spider

# -*- coding: utf-8 -*-
import scrapy
from zhihufaxian.items import ZhihufaxianItem


class ZhfxSpider(scrapy.Spider):
    name = 'zhfx'
    allowed_domains = ['zhihu.com']
    start_urls = ['http://zhihu.com/']

    # You can only load 40 pages with ajax
    def start_requests(self):
        base_url = "https://www.zhihu.com/node/ExploreAnswerListV2?"
        for page in range(1, 41):
            if page < 40:
                params = "params=%7B%22offset%22%3A" + str(page*5) + "%2C%22type%22%3A%22day%22%7D"
            else:
                params = "params=%7B%22offset%22%3A" + str(199) + "%2C%22type%22%3A%22day%22%7D"
            url = base_url + params
            yield scrapy.Request(
                url=url,
                callback=self.parse
            )

    def parse(self, response):
        list = response.xpath("//body/div")
        for li in list:
            item = ZhihufaxianItem()
            # title
            item["title"] = "".join(li.xpath(".//h2/a/text()").getall())
            item["title"] = item["title"].replace("\n", "")
            # author
            item["author"] = "".join(li.xpath(".//div[@class='zm-item-answer-author-info']/span[1]/span[1]/a/text()").getall())
            item["author"] = item["author"].replace("\n","")
            details_url = "".join(li.xpath(".//div[@class='zh-summary summary clearfix']/a/@href").getall())
            details_url = "https://www.zhihu.com" + details_url
            yield scrapy.Request(
                url=details_url,
                callback=self.details,
                meta={"item": item}
                )
            
    # Get content from the details page
    def details(self, response):
        item = response.meta["item"]
        item["content"] = "".join(response.xpath("//div[@class='RichContent-inner']/span/p/text()").getall())
        print(item)

Start is constructed first_ Requests method to construct a complete url address. Then give it to parse, and finally go to the details page to get the content field.

4, Summary and understanding

Zhihu is a good website for hands-on training. The content loaded by ajax is a complete web page, not in json format, which saves the trouble of analyzing json.

Just climb directly.

Finally know the body!

above

You can leave a message where you don't understand

Tags: Python Python crawler scrapy

Posted by ethridgt on Tue, 17 May 2022 17:45:20 +0300