1, Web page analysis and crawling fields
1. Crawl field
There are not many crawling fields, only three fields are needed, and the "content" field needs to be crawled in the details page
2. Web page analysis
Starting URL https://www.zhihu.com/explore
The discovery section is a typical ajax loading page.
We open the web page, right-click to check, switch to the Network interface, and click XHR. In this state, all the refreshed items are Ajax loading entries.
Next, we keep dropping down the page
You can see that ajax load entries keep appearing.
The params loaded by this ajax are as follows. The baidu image download crawler I wrote last time is realized by constructing params. In this crawler, I tried to use this method, but returned 404 results, so we realized it by analyzing the url of ajax.
This is the url loaded by ajax
Through the analysis of the above two URLs, we can see that there is only one variable parameter for this url
So we just need to change this parameter.
At the same time, when analyzing the web page, we found that the ajax loading limit is 40 pages, and the ajax loading URL of the last page is
Well, so far, we've finished analyzing the url address of our ajax request. The crawling field part is not analyzed. It is a very simple static web page. You can use xpath.
2, Code and analysis
1. items section
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class ZhihufaxianItem(scrapy.Item): # title title = scrapy.Field() # author author = scrapy.Field() # content content = scrapy.Field()
Create crawl field
2. settings section
# -*- coding: utf-8 -*- # Scrapy settings for zhihufaxian project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'zhihufaxian' SPIDER_MODULES = ['zhihufaxian.spiders'] NEWSPIDER_MODULE = 'zhihufaxian.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'zhihufaxian.middlewares.ZhihufaxianSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'zhihufaxian.middlewares.ZhihufaxianDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'zhihufaxian.pipelines.ZhihufaxianPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Open the itempippline section and modify the user_agent.
3,spider
# -*- coding: utf-8 -*- import scrapy from zhihufaxian.items import ZhihufaxianItem class ZhfxSpider(scrapy.Spider): name = 'zhfx' allowed_domains = ['zhihu.com'] start_urls = ['http://zhihu.com/'] # You can only load 40 pages with ajax def start_requests(self): base_url = "https://www.zhihu.com/node/ExploreAnswerListV2?" for page in range(1, 41): if page < 40: params = "params=%7B%22offset%22%3A" + str(page*5) + "%2C%22type%22%3A%22day%22%7D" else: params = "params=%7B%22offset%22%3A" + str(199) + "%2C%22type%22%3A%22day%22%7D" url = base_url + params yield scrapy.Request( url=url, callback=self.parse ) def parse(self, response): list = response.xpath("//body/div") for li in list: item = ZhihufaxianItem() # title item["title"] = "".join(li.xpath(".//h2/a/text()").getall()) item["title"] = item["title"].replace("\n", "") # author item["author"] = "".join(li.xpath(".//div[@class='zm-item-answer-author-info']/span[1]/span[1]/a/text()").getall()) item["author"] = item["author"].replace("\n","") details_url = "".join(li.xpath(".//div[@class='zh-summary summary clearfix']/a/@href").getall()) details_url = "https://www.zhihu.com" + details_url yield scrapy.Request( url=details_url, callback=self.details, meta={"item": item} ) # Get content from the details page def details(self, response): item = response.meta["item"] item["content"] = "".join(response.xpath("//div[@class='RichContent-inner']/span/p/text()").getall()) print(item)
Start is constructed first_ Requests method to construct a complete url address. Then give it to parse, and finally go to the details page to get the content field.
4, Summary and understanding
Zhihu is a good website for hands-on training. The content loaded by ajax is a complete web page, not in json format, which saves the trouble of analyzing json.
Just climb directly.
Finally know the body!
above
You can leave a message where you don't understand