python uses request+xpath to crawl Douban movie data

background

Because Bi Shi needs movie related data, I want to find a tutorial that can climb movies on the Internet, but basically all tutorials are climbing Douban top250, and no other relevant tutorials are found. So I learned to use python request combined with xpath to climb movie related information.

Don't talk much, say the code

The url I crawled is: https://movie.douban.com/subject/25845392/ , the information on this interface

How to get this interface? Let's first analyze this page
Douban movie list

By checking the parameters of the url, it is found that the paging function may be used. Therefore, to view all movies, you need to see what these parameters are. Open the developer tool to check the network and find that there will be a new request every time you click to load more. The parameter start=140 carried behind the request should be the data requested by each paging, Therefore, we get the url and parameters of the movie display list. Each time, the parameters are increased by 20

After the web page analysis, we began to write code

#First, we need to introduce the library we need
import requests
from lxml import html
import json
import time
import random
import csv
import re
class douban:
	 def __init__(self):
        self.URL = 'https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=%E7%94%B5%E5%BD%B1'
        self.start_num = []
        self.movie_url = []
        # From 0 to 101, each step is 20
        # Define a range method to change the start parameter by increasing 20 each time
        for start_num in range(0, 20, 20):
            self.start_num.append(start_num)
        # Define the header and simulate the browser request
        self.header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}

All right, the preparations are over. Let's write the content of the page obtained through xpath

    def get_top250(self):
        for start in self.start_num:
            start = str(start)
            # The current visit will get a json data format, in which the url is the access page of each movie
            response = requests.get(self.URL, params={'start':start}, headers=self.header)
            page_content = response.content.decode()
            # Parsing json data
            json_data = json.loads(page_content)
            # Define an empty list and add dictionary data
            movieList = []
            # Loop this json
            for key in range(len(json_data['data'])):
                # The dictionary stores data for each movie
                movieDict = {}
                # Get the url in json
                movie_url = json_data['data'][key]['url']
                # Get movie id from url
                rex = re.compile(r'subject/(\d+)')  # Compiling regular expressions
                mo = rex.search(movie_url)  # query
                movie_id = mo.group(1)
                # Get web page html according to url
                page_response = requests.get(movie_url, headers=self.header)
                page_result = page_response.content.decode()
                page_html = html.etree.HTML(page_result)
                # title
                movie_name = page_html.xpath('//*[@id="content"]/h1/span[1]/text()')
                # director
                director = page_html.xpath('//*[@id="info"]/span[1]/span[2]/a/text()')
                # performer
                yanyuan = page_html.xpath('//*[@id="info"]/span[3]/span[2]')
                for yanyuanList in yanyuan:
                    yanyuanData = yanyuanList.xpath('././a/text()')
                # plot
                juqing = page_html.xpath('//*[@id="info"]/span[@property="v:genre"]/text()')
                # country
                country = page_html.xpath(u'//*[@ id="info"]/span[contains(./text(), "producing country:")] / following::text()[1] ')
                # language
                language = page_html.xpath('//*[@ id="info"]/span[contains(./text(), "language:")] / following::text()[1] ')
                # Release time
                push_time = page_html.xpath('//*[@id="info"]/span[11]/text()')
                # Film often
                movie_long = page_html.xpath('//*[@id="info"]/span[13]/text()')
                # score
                pingfen = page_html.xpath('//*[@id="interest_sectl"]/div[1]/div[2]/strong/text()')
                # Cover address
                conver_img = page_html.xpath('//*[@id="mainpic"]/a/img/@src')
                # Remove spaces and line breaks
                # Film introduction
                describe = page_html.xpath('normalize-space(//*[@id="link-report"]/span/text())')

Let me talk about how to get the content of the tag through xpath. It's very simple. Just open the developer tool of the web page and find the tag of the required information, such as the production country

We can get the location of this tag by right clicking copyxpath, but we will find that the data we want is not in the span tag, but after the span tag. What should we do,

#We can use contains to mark the span tag of what information span contains, and then use / following::text()[1] to get the information that is not in the tag
u'//*[@ id="info"]/span[contains(./text(), "producing country:")] / following::text()[1] '


In this way, we can get the information in the page tag. The following is how to store it. See the complete code below in detail. I save the data in csv format, which is very simple and can be stored after being defined directly.
The complete code is as follows:

import requests
from lxml import html
import json
import time
import random
import csv
import re

class Douban:

    def __init__(self):
        self.URL = 'https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=%E7%94%B5%E5%BD%B1'
        self.start_num = []
        self.movie_url = []
        # From 0 to 20, each step is 20. If you want to climb more data, you only need to adjust the middle value
        for start_num in range(0, 20, 20):
            self.start_num.append(start_num)
        self.header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}

    def get_top250(self):
        for start in self.start_num:
            start = str(start)
            response = requests.get(self.URL, params={'start':start}, headers=self.header)
            page_content = response.content.decode()
            # Parsing json data
            json_data = json.loads(page_content)
            # Define an empty list and add dictionary data
            movieList = []
            for key in range(len(json_data['data'])):
                # Define a dictionary to store the data of each movie
                movieDict = {}
                movie_url = json_data['data'][key]['url']
                # Get movie id from url
                rex = re.compile(r'subject/(\d+)')  # Compiling regular expressions
                mo = rex.search(movie_url)  # query
                movie_id = mo.group(1)
                # Get web page html according to url
                page_response = requests.get(movie_url, headers=self.header)
                page_result = page_response.content.decode()
                page_html = html.etree.HTML(page_result)
                # title
                movie_name = page_html.xpath('//*[@id="content"]/h1/span[1]/text()')
                # director
                director = page_html.xpath('//*[@id="info"]/span[1]/span[2]/a/text()')
                # performer
                yanyuan = page_html.xpath('//*[@id="info"]/span[3]/span[2]')
                for yanyuanList in yanyuan:
                    yanyuanData = yanyuanList.xpath('././a/text()')
                # plot
                juqing = page_html.xpath('//*[@id="info"]/span[@property="v:genre"]/text()')
                # country
                country = page_html.xpath(u'//*[@ id="info"]/span[contains(./text(), "producing country:")] / following::text()[1] ')
                # language
                language = page_html.xpath('//*[@ id="info"]/span[contains(./text(), "language:")] / following::text()[1] ')
                # Release time
                push_time = page_html.xpath('//*[@id="info"]/span[11]/text()')
                # Film often
                movie_long = page_html.xpath('//*[@id="info"]/span[13]/text()')
                # score
                pingfen = page_html.xpath('//*[@id="interest_sectl"]/div[1]/div[2]/strong/text()')
                # Cover address
                conver_img = page_html.xpath('//*[@id="mainpic"]/a/img/@src')
                # Remove spaces and line breaks
                # Film introduction
                describe = page_html.xpath('normalize-space(//*[@id="link-report"]/span/text())')
                # Save data
                movieDict['movie_id'] = movie_id
                movieDict['movie_name'] = movie_name
                movieDict['director'] = director
                movieDict['yanyuanData'] = yanyuanData
                movieDict['juqing'] = juqing
                movieDict['country'] = country
                movieDict['language'] = language
                movieDict['push_time'] = push_time
                movieDict['movie_long'] = movie_long
                movieDict['pingfen'] = pingfen
                movieDict['conver_img'] = conver_img
                movieDict['describe'] = describe
                movieList.append(movieDict)
                print("In progress---", movie_url)
                self.random_sleep(1, 0.4)
        return movieList

    # Set the sleep function to prevent the ip from being blocked due to too fast access
    def random_sleep(self, mu, sigma):
        '''Normal distribution random sleep
        :param mu: average value
        :param sigma: The standard deviation determines the fluctuation range
        '''
        secs = random.normalvariate(mu, sigma)
        if secs <= 0:
            secs = mu  # Too small, reset to average
        print("Dormant...")
        time.sleep(secs)

    # Save data
    def writeData(self, movieList):
        with open('douban.csv', 'w', encoding='utf-8', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=['movie_id', 'movie_name', 'director', 'yanyuanData', 'juqing',
                                                   'country', 'language', 'push_time', 'movie_long', 'pingfen',
                                                   'conver_img', 'describe'])
            writer.writeheader()
            for each in movieList:
                writer.writerow(each)

if __name__ == '__main__':
    movieList = []
    cls = Douban()
    movieList = cls.get_top250()
    cls.writeData(movieList)



···········
The first time I write a blog, I may not write well enough. If there is something I don't understand or have questions, I can leave a message or add me qq: 19686862360. Let's learn and make progress together

Tags: Python crawler Data Mining

Posted by chinto09 on Fri, 06 May 2022 12:53:25 +0300