Python crawler actual combat, AcFun barrage video network

CSDN home page: High IQ idiot
Original address: https://blog.csdn.net/qq_44700693/article/details/109124334?utm_source=app

Many people learn python and don't know where to start.
After learning python and mastering the basic grammar, many people don't know where to find cases.
Many people who have done cases do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free video tutorials, e-books, and the source code of the course!?? ¤
QQ group: 623406465

Daily jump:

 

Import

I have summarized the climbing method of station B some time ago: Python crawler: bilibili video download.

This time, I will continue to share AcFun In fact, compared with station B, the anti climbing mechanism of station A is simpler:

Single short video

Get video information

In order to facilitate the analysis and explanation, we will certainly take an example

[fairy UP special] AcFun Family Party - Chengdu Railway Station (today is another lsp day ~ ~)

Directly open the link on the browser and capture the packet. It is found that under the data of # XHR # the first (or a) request loads the real request link of the video:
Although it is only a}m3u8# file in itself, we still have a way to deal with it. Before that, we must find out where the file was sent from or where we can find the link.

After searching all the XHR data, I decided to take a look at the web source code:
When I search the web source code with the request link of {m3u8# file, I find that the link appears in the source code:

Because it is stored in JSON data in the source code:

Therefore, we need to format the data to facilitate data extraction:

After formatting the data, I found that the value of one field is also in the format of "JSON" data, so we can see the following information after formatting the "JSON" data of the second layer:

For the status when we are not logged in, even if the web side cannot be played directly, the "background" has already prepared a playback link for us (station B is to load the current account or the maximum definition that can be viewed when we are not logged in), so we can play ultra-high definition resources without logging in~~

class m3u8_url():
    def __init__(self, f_url):
        self.url = f_url

    def get_m3u8(self):
        global flag, qua, rel_path
        html = requests.get(self.url, headers=headers).text
        first_json = json.loads(re.findall('window.pageInfo = window.*? = (.*?)};', html)[0] + '}', strict=False)
        name = first_json['title'].strip().replace("|",'')
        video_info = json.loads(first_json['currentVideoInfo']['ksPlayJson'], strict=False)['adaptationSet'][0]['representation']

In order to select the definition later, I also crawled the definition:

for quality in video_info:  # definition
    num += 1
    Label[num] = quality['qualityLabel']
print(Label)
choice = int(input("Please select clarity: "))

Download video via m3u8 file address

Now that we can get the address of the m3u8 file of the video, let's start to solve a small problem left over before: how to download the video through the m3u8 file?

First, we get a}m3u8# file as a case:
For convenience, I manually write a}m3u8# file here as an example.

We know that the video links in the}m3u8# file are all ts , so we have to find a way to segment all ts # links are taken out and prefixed to assemble the real and complete links of the video: (here, assume that the original prefix of the video is https://www.acfun.cn/)

urls=[]  # Segment link for saving video
def get_ts_urls():
    with open('123.m3u8',"r") as file:
        lines = file.readlines()
        for line in lines:
            if '.ts' in line:
                print("https://www.acfun.cn/"+line)

Through the above methods, we can get the video link of each segment through {m3u8} file. Next, we will improve the download function:

The basic idea of downloading is the same as that of my previous article: Python crawler: use the most common method to crawl ts files and combine them into mp4 format

class Download(): 
    urls = []  # Segment link for saving video

    def __init__(self, name, m3u8_url, path):
        '''
        :param name: Video name
        :param m3u8_url: Video m3u8 File address
        :param path: Download address
        '''
        self.video_name = name
        self.path = path
        self.f_url = str(m3u8_url).split('hls/')[0] + 'hls/'
        with open(self.path + '/{}.m3u8'.format(self.video_name), 'wb')as f:
            f.write(requests.get(m3u8_url, headers={'user-agent': 'Chrome/84.0.4147.135'}).content)

    def get_ts_urls(self):
        with open(self.path + '/{}.m3u8'.format(self.video_name), "r") as file:
            lines = file.readlines()
            for line in lines:
                if '.ts' in line:
                    self.urls.append(self.f_url + line.replace('\n', ''))

    def start_download(self):
        self.get_ts_urls()
        for url in tqdm(self.urls, desc="Downloading {} ".format(self.video_name)):
            movie = requests.get(url, headers={'user-agent': 'Chrome/84.0.4147.135'})
            with open(self.path + '/{}.flv'.format(self.video_name), 'ab')as f:
                f.write(movie.content)
        os.remove(self.path + '/{}.m3u8'.format(self.video_name))

Code comments:

  • 1. In order to get only the video in the end, the {m3u8# file of the current video is automatically deleted after the video is downloaded.
  • 2,line. Reason for replace ('\ n', ''): every line of the read {m3u8 file contains a "\ n".

Source code and effect

Finally, now we can integrate the code and run it:

import os
import re
import json
import requests
from tqdm import tqdm

path = './'

headers = {
    'referer': 'https://www.acfun.cn/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83'
}

class m3u8_url():
    def __init__(self, f_url):
        self.url = f_url

    def get_m3u8(self):
        global flag, qua, rel_path
        html = requests.get(self.url, headers=headers).text
        first_json = json.loads(re.findall('window.pageInfo = window.videoInfo = (.*?)};', html)[0] + '}', strict=False)
        name = first_json['title'].strip().replace("|",'')
        video_info = json.loads(first_json['currentVideoInfo']['ksPlayJson'], strict=False)['adaptationSet'][0]['representation']
        Label = {}
        num = 0
        for quality in video_info:  # definition
            num += 1
            Label[num] = quality['qualityLabel']
        print(Label)
        choice = int(input("Please select clarity: "))
        Download(name + '[{}]'.format(Label[choice]), video_info[choice - 1]['url'], path).start_download()

class Download():
    urls = []

    def __init__(self, name, m3u8_url, path):
        '''
        :param name: Video name
        :param m3u8_url: Video m3u8 File address
        :param path: Download address
        '''
        self.video_name = name
        self.path = path
        self.f_url = str(m3u8_url).split('hls/')[0] + 'hls/'
        with open(self.path + '/{}.m3u8'.format(self.video_name), 'wb')as f:
            f.write(requests.get(m3u8_url, headers={'user-agent': 'Chrome/84.0.4147.135'}).content)

    def get_ts_urls(self):
        with open(self.path + '/{}.m3u8'.format(self.video_name), "r") as file:
            lines = file.readlines()
            for line in lines:
                if '.ts' in line:
                    self.urls.append(self.f_url + line.replace('\n', ''))

    def start_download(self):
        self.get_ts_urls()
        for url in tqdm(self.urls, desc="Downloading {} ".format(self.video_name)):
            movie = requests.get(url, headers={'user-agent': 'Chrome/84.0.4147.135'})
            with open(self.path + '/{}.flv'.format(self.video_name), 'ab')as f:
                f.write(movie.content)
        os.remove(self.path + '/{}.m3u8'.format(self.video_name))

url1 = input("Enter address: ")
m3u8_url(url1).get_m3u8()

effect:


Oh, take off~~

Fan drama series

Get video information

Since we want to start with fan opera, we must take an example to illustrate:

Loan girlfriend (lsp's again ~ ~)

For this drama, we can get experience directly from the single video analysis method - > start directly from the web source code:

Sure enough, we found JSON data similar to a single video in the source code. We continue to format these data:

As a result, the storage methods and fields of video are the same as those of a single video. In order to reduce the final amount of code, we can adapt both methods to one class:

class m3u8_url():
    def __init__(self, f_url, name=""):
    	'''
        :param f_url: Link to current video
        :param name:  Fan drama name, which is empty by default
        '''
        self.url = f_url
        self.name = name

    def get_m3u8(self):
        global flag, qua, rel_path
        html = requests.get(self.url,  headers=headers).text
        first_json = json.loads(re.findall('window.pageInfo = window.*? = (.*?)};', html)[0] + '}', strict=False)
        if self.name == '':
            name = first_json['title'].strip().replace("|",'')
        else:
            name = self.name
            rel_path = path + first_json['bangumiTitle'].strip()
            if os.path.exists(rel_path):
                pass
            else:
                os.makedirs(rel_path)
        video_info = json.loads(first_json['currentVideoInfo']['ksPlayJson'], strict=False)['adaptationSet'][0]['representation']
        Label = {}
        num = 0
        for quality in video_info:  # definition
            num += 1
            Label[num] = quality['qualityLabel']
        if flag:
            print(Label)
            choice = int(input("Please select clarity: "))
            flag = False
            qua = choice
            Download(name + '[{}]'.format(Label[choice]), video_info[choice - 1]['url'], path).start_download()
        else:
            Download(name + '[{}]'.format(Label[qua]), video_info[qua - 1]['url'], rel_path).start_download()

Code comments:

  • flag: used to judge whether the clarity during downloading has been selected.
  • qua: save the clarity of the selection.
  • rel_path: change the download location of fan drama (under the folder of fan drama name).
  • first_json = json.loads(re.findall(‘window.pageInfo = window.? = (.?)};’, html)[0] + '}', strict=False): change the matching regular expression of video information, which can be used to match single video and Fanju video at the same time.

Knowing how to download an episode, it's impossible to manually enter the link for each episode!!! It's good to meet a fan play with only a few episodes. If you encounter such a play:

Are you coming???

Fan drama series link

Similarly, let's start from the web source code:

Although we can find all the information about Fanju in the source code, not all of it is what we need. We need to see what information we must get first:
When I click the second episode, the address in the browser address bar changes:

https://www.acfun.cn/bangumi/aa6002917_36188_1740687

We can easily find that:

  • https://www.acfun.cn/bangumi/aa6002917 : link to the home page of Fanju.
  • 36188: a string of numbers that I don't know what to use, but I find it useless. They are fixed:

For example:
Loan girlfriend: second sentence ex girlfriend and girlfriend: https://www.acfun.cn/bangumi/aa6002917_36188_1740687
Loan girlfriend: the 3rd Huahai and girlfriend: https://www.acfun.cn/bangumi/aa6002917_36188_1741409
Zhenhun Street: the second sentence: https://www.acfun.cn/bangumi/aa5020166_36188_232386
...
Similarly, when clicking back to the first episode, you can also see the link of the first episode, which can also be written as:
Zhenhun Street: the first sentence: https://www.acfun.cn/bangumi/aa5020166_36188_232383
Lending girlfriend: the first sentence: lending girlfriend: https://www.acfun.cn/bangumi/aa6002917_36188_1739760
...

  • 1740687: the "itemId" of each episode is saved in the "itemId" field in the source code.

So we can write the code to get the video link of each episode:

class Pan_drama():
    def __init__(self, f_url):
        '''
        :param f_url: Link to video home page
        '''
        self.aa = len(str(f_url).split('/')[-1])
        if self.aa == 7:
            self.url = f_url
        elif self.aa > 7:
            self.url = str(f_url).split('_')[0]

    def get_info(self):
        video_info = {}
        html = requests.get(self.url, headers=headers).text
        all_item = json.loads(re.findall('window.bangumiList = (.*?);', html)[0])['items']
        for item in tqdm(all_item, desc="Preparing a play"):
            video_info[item['episodeName'] + '-' + item['title']] = self.url + '_36188_' + str(item['itemId'])
        for name in video_info.keys():
            m3u8_url(video_info[name],name).get_m3u8()

Code comments:

  • self.aa: for better adaptability, simply solve the problem that the link of an episode is passed in, but the whole series can be downloaded.

Source code and effect

Full source code:

import os
import re
import json
import requests
from tqdm import tqdm

path = './'

headers = {
    'referer': 'https://www.acfun.cn/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83'
}

flag = True
qua = 0


class m3u8_url():
    def __init__(self, f_url, name=""):
        '''
        :param f_url: Link to current video
        :param name:  Fan drama name, which is empty by default
        '''
        self.url = f_url
        self.name = name

    def get_m3u8(self):
        global flag, qua, rel_path
        html = requests.get(self.url, headers=headers).text
        first_json = json.loads(re.findall('window.pageInfo = window.*? = (.*?)};', html)[0] + '}', strict=False)
        if self.name == '':
            name = first_json['title'].strip().replace("|", '')
            rel_path=path
        else:
            name = self.name
            rel_path = path + first_json['bangumiTitle'].strip()
            if os.path.exists(rel_path):
                pass
            else:
                os.makedirs(rel_path)
        video_info = json.loads(first_json['currentVideoInfo']['ksPlayJson'], strict=False)['adaptationSet'][0][
            'representation']
        Label = {}
        num = 0
        for quality in video_info:  # definition
            num += 1
            Label[num] = quality['qualityLabel']
        if flag:
            print(Label)
            choice = int(input("Please select clarity: "))
            flag = False
            qua = choice
            Download(name + '[{}]'.format(Label[choice]), video_info[choice - 1]['url'], rel_path).start_download()
        else:
            Download(name + '[{}]'.format(Label[qua]), video_info[qua - 1]['url'], rel_path).start_download()


class Pan_drama():
    def __init__(self, f_url):
        '''
        :param f_url: Link to video home page
        '''
        self.aa = len(str(f_url).split('/')[-1])
        if self.aa == 7:
            self.url = f_url
        elif self.aa > 7:
            self.url = str(f_url).split('_')[0]

    def get_info(self):
        video_info = {}
        html = requests.get(self.url, headers=headers).text
        all_item = json.loads(re.findall('window.bangumiList = (.*?);', html)[0])['items']
        for item in tqdm(all_item, desc="Preparing a play"):
            video_info[item['episodeName'] + '-' + item['title']] = self.url + '_36188_' + str(item['itemId'])
        for name in video_info.keys():
            m3u8_url(video_info[name],name).get_m3u8()


class Download():
    urls = []

    def __init__(self, name, m3u8_url, path):
        '''
        :param name: Video name
        :param m3u8_url: Video m3u8 File address
        :param path: Download address
        '''
        self.video_name = name
        self.path = path
        self.f_url = str(m3u8_url).split('hls/')[0] + 'hls/'
        with open(self.path + '/{}.m3u8'.format(self.video_name), 'wb')as f:
            f.write(requests.get(m3u8_url, headers={'user-agent': 'Chrome/84.0.4147.135'}).content)

    def get_ts_urls(self):
        with open(self.path + '/{}.m3u8'.format(self.video_name), "r") as file:
            lines = file.readlines()
            for line in lines:
                if '.ts' in line:
                    self.urls.append(self.f_url + line.replace('\n', ''))

    def start_download(self):
        self.get_ts_urls()
        for url in tqdm(self.urls, desc="Downloading {} ".format(self.video_name)):
            movie = requests.get(url, headers={'user-agent': 'Chrome/84.0.4147.135'})
            with open(self.path + '/{}.flv'.format(self.video_name), 'ab')as f:
                f.write(movie.content)
        os.remove(self.path + '/{}.m3u8'.format(self.video_name))


url1 = input("Enter address: ")
if url1.split('/')[3] == 'v':
    m3u8_url(url1).get_m3u8()
elif url1.split('/')[3] == 'bangumi':
    Pan_drama(url1).get_info()

Effect example:

Tags: Python

Posted by swamp on Wed, 11 May 2022 05:22:58 +0300