Python crawler js encryption crack, grab Netease cloud music comments to generate word cloud

js cracking process

  • preface
  • Skill points
  • Interface Overview
  • Static web page dynamic web page
  • Page parsing
  • step1: Find Parameters step2: analyze js function step3: analyze parameters step4: verify step5: convert to python code
  • Write crawler

Many people learn python and don't know where to start.
After learning python and mastering the basic grammar, many people don't know where to find cases.
Many people who have done cases do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free video tutorials, e-books, and the source code of the course!?? ¤
QQ group: 623406465

preface


The big obstacle of web crawlers is all kinds of encryption. This includes the login verification code and encryption. js confusion, js parameter encryption, etc. In fact, I have known js encryption before. But there is no in-depth research. Through this practice, I studied the encryption method of Netease cloud music.

Bloggers analyze and share the learning process through the example of Netease cloud music comment encryption.

If you have any questions or do not understand, you can follow my wechat official account and contact me.

Skill points

  • Front end: js knowledge (more important), Google browser debug, packet capture, break point debugging ability (required). And js various encryption functions (understand)
  • python: basic requests. Crypto.Cipher encryption and decryption module.
  • Others: postman (simulation request use), good thinking ability and analysis ability. (the encryption algorithm is a little messy). Another point is the code implementation of js encryption to python.

Interface Overview

Static web page

For the general page whose url changes with the change of the page, Netease cloud still exists. You only need to grab the page for analysis.

Dynamic web page

However, with the popularity of front and rear end separation and the obvious benefits of data separation. More and more data is rendered with ajax. The comments of Netease cloud are even so.
At that time, many websites didn't take much protective measures against excuses. This makes it easy for many websites to obtain results. So far, there are many such excuses. This kind of website crawling is a fool crawling.


However, with the development of front-end technology, the interface has become more and more difficult. Take the comment of Netease cloud: its parameters are very confusing.


What is this string of numbers. Many people will choose to give up when they see such data. Then let me untie the veil of it or something for you.

Page parsing

step1: Find Parameters

You can see that it has two parameters, one is params and the other is encSecKey, which are encrypted. We need to analyze its source. F12 open source and search encSckey

'after searching the encSecKey inside js, I found that it was originally here. After breakpoint debugging, I found that this is the result of the final parameter.

Step 2: analyze js function

This js has more than 4w lines. How can we find useful information in 4w multi line js and clarify the idea here?


This requires your abstract and reverse thinking. Come on, let's start analyzing.

 var bYc7V = window.asrsea(JSON.stringify(i3x), bkY2x(["shed tears", "strong"]), bkY2x(VM8E.md), bkY2x(["love", "girl", "terrified", "laugh"]));
 e3x.data = k4o.cz4D({
 params: bYc7V.encText,
 encSecKey: bYc7V.encSecKey
 })

The above code is the source. Let's ignore the JSON Stringify (i3x) what are these parameters? First find out the window What is asrsea. Not far above you will find:


This is the d function, which is the root of all data and methods. The four parameters d, e, f and g are the parameters we just said don't care.
Analysis from this function: encText is a parameter executed through the b() function twice, and encSecKey is a parameter executed through the c() function. Note that the source of i parameter is a(16) Look at these functions online.

 function a(a) {
 var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = "";
 for (d = 0; a > d; d = 1)
 e = Math.random() * b.length,
 e = Math.floor(e),
 c = b.charAt(e);
 return c
 }
 function b(a, b) {
 var c = CryptoJS.enc.Utf8.parse(b)
 , d = CryptoJS.enc.Utf8.parse("0102030405060708")
 , e = CryptoJS.enc.Utf8.parse(a)
 , f = CryptoJS.AES.encrypt(e, c, {
 iv: d,
 mode: CryptoJS.mode.CBC
 });
 return f.toString()
 }
 function c(a, b, c) {
 var d, e;
 return setMaxDigits(131),
 d = new RSAKeyPair(b,"",c),
 e = encryptedString(d, a)
 }
 function d(d, e, f, g) {
 var h = {}
 , i = a(16);
 return h.encText = b(d, g),
 h.encText = b(h.encText, i),
 h.encSecKey = c(i, e, f),
 h
 }

It can be found that a(16) is a randomly generated number, so we don't need to care about it. At present, b it is the cbc mode encryption of AES. Then we know the rules of encText generation. cbc encryption of AES twice. The offset is 0102030405060708, which is fixed. The key s are different. The function c is RSA encryption with three parameters. The general flow of the whole algorithm is almost a little understood.

Stop here first. Don't analyze the function. We are analyzing the data.

step3: analysis parameters

Back to var byc7v = window Asrsea (JSON. Stringify (i3x), bky2x (["tears", "strong"]), bkY2x(VM8E.md), bkY2x(["love", "girl", "panic", "laugh")) this function. Intuitively, we can feel that some data must have nothing to do with our core parameters, but at most with time stamps.

Find bky2x the source,


If you look for it again, it's not necessary. You can look for such functions. You can copy it to vscode and trace the source to find the source. Analysis is not cumbersome here. Just interrupt the analysis! See how he does it.


In fact, you will find that the last three parameters are fixed (non interactive data).
However, what you want most is the first parameter


The parameters of your heart have been like this for years, so it's almost as expected. Only the first parameter is related to our parameters. offset is the page * 20, R_SO_4_ songid is the id of the current song. In fact, by this time, your i and encSecKey can be saved together. Because the above analysis said that this i is generated randomly, and the encSecKey is also irrelevant to our core parameters, but it is related to i, so a group should be recorded. Used as parameters for ESA encryption and post request.

Now you are not very excited, because really want to come to the surface.

step4: Verification

This step is also a very important part, because you will find it in its js.


Will Netease do anything? Download the original js for testing. Found ha ha, the results are consistent. Then there is no need to change the code of the encryption algorithm.

The architecture diagram is

step5: convert to python code

You need to clone the code of cbc mode of AES with Python. To achieve the effect of encryption, test it. The results are consistent, nice

Write crawler

Let's start writing the crawler. Test the parameters you need with postman first.


No problem, write the crawler. According to your favorite brother. Input the id and generate the word cloud of your love! A song of glorious years for you!

import  requests
import urllib.parse
import base64
from wordcloud import WordCloud
import jieba.analyse
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from Crypto.Cipher import AES
header={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36',
        #'Postman-Token':'4cbfd1e6-63bf-4136-a041-e2678695b419',
        "origin":'https://music.163.com',
        #'referer':'https://music.163.com/song?id=1372035522',
        #'accept-encoding':'gzip,deflate,br',
        'Accept':'*/*',
        'Host':'music.163.com',
        'content-lenth':'472',
        'Cache-Control':'no-cache',
        'content-type': 'application/x-www-form-urlencoded',
        'Connection':'keep-alive',
        #'Cookie':'iuqxldmzr_=32; _ntes_nnid=a6f29f40998c88c693bc910331bd6bea,1558011234325; _ntes_nuid=a6f29f40998c88c693bc910331bd6bea; _ga=GA1.2.2120707788.1559308501; WM_TID=pV2C%2BjTrRwBBAAERUVJojniTwk8%2B8Zta; JSESSIONID-WYYY=nvf%2BggodQRfcT%2BTvBRmANqMrsDeQCxRvqwFsxDr3eJvNNWhGYFhfCXKFkfAfOdbHhpCsMzT39mAeJ7ZamBQZbiwwtnSZD%5CPWRqKxD9t6dGKD3bTVjomjgB39DB07RNIWI32bYKa2H4fg1qQgqI%2FR%2B%2Br%2BZXJvgFg1Vh%2FA2XRj9S4p0EMu%3A1560927288799; WM_NI=DthwcEQf5Ew2NbTIZmSNhSnm%2F8VWsg5RxhkYogvs2luEwZ6m5UhdzbHYPIr654ZBWKV4o22%2BEwb9BvdLS%2BFOmOAEUG%2B8xd8az4CX%2FiAL%2BZkz3syA0onCPkhQwCtL4pkUcjg%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6eed2d650989c9cd1dc4bb6b88eb2c84e979f9aaff773afb6fb83d950bcb19ecce92af0fea7c3b92a88aca898e24f93bafba6f63a8ebe9caad9679192a8b4ed67ede89ab8f26df78eb889ea53adb9ba94b168b79bb9bbb567f78ba885f96a8c87a0aaf13ef7ec96a3d64196eca1d3b12187a9aedac17ea8949dccc545af918fa6d84de9e8b885bb6bbaec8db9ae638394e5bbea72f1adb7a2b365ae9da08ceb5bb59dbcadb77ca98bad8be637e2a3'
        }

def pkcs7padding(text):
    """
    Plaintext use PKCS7 fill
    Final call AES When encrypting a method, a byte Array, which is required to be an integer multiple of 16, so the plaintext needs to be processed
    :param text: Content to be encrypted(Plaintext)
    :return:
    """
    bs = AES.block_size  # 16
    length = len(text)
    bytes_length = len(bytes(text, encoding='utf-8'))
    # While the Chinese code takes up 8 bytes, the English code takes up 1 byte
    padding_size = length if(bytes_length == length) else bytes_length
    padding = bs - padding_size % bs
    # tips: chr(padding) depending on the agreement with other languages, some will use '\ 0'
    padding_text = chr(padding) * padding
    return text + padding_text
def encrypt(key, content):
    """
    AES encryption
    key,iv Use the same
    pattern cbc
    fill pkcs7
    :param key: secret key
    :param content: Encrypted content
    :return:
    """
    key_bytes = bytes(key, encoding='utf-8')
    iv = bytes('0102030405060708', encoding='utf-8')
    cipher = AES.new(key_bytes, AES.MODE_CBC, iv)
    # Processing plaintext
    content_padding = pkcs7padding(content)
    # encryption
    encrypt_bytes = cipher.encrypt(bytes(content_padding, encoding='utf-8'))
    # Recoding
    result = str(base64.b64encode(encrypt_bytes), encoding='utf-8')
    return result
def getcomment(songid,page):
    url="https://music.163.com/weapi/v1/resource/comments/R_SO_4_"+songid+"?csrf_token="
    print(url)
    formdata = {
        "params": "",
        "encSecKey": "c81160c64a08feb6cfed91c1619d5bffd05dd278b685c94a748689edf035ee0436b66aa7019927ce0fedd26aee9a22cdc6743e58a120f9db0126ebb2e61dae3f7ee21088eb747f829bceed9a5bbb9ee7a2eecf1a358feac431acaab17c95b8491a6a955f7c17a02a3e7886390c2cb3b981f4ccbd5163a566d27ace95db073401",
    }

    aes_key = '0CoJUm6Qyw8W8jud'## Invariable
    print('aes_key:' + aes_key)
    # Encrypt English
    source_en = '{"rid":"R_SO_4_'+songid+'","offset":"'+str(page*20)+'","total":"false","limit":"20","csrf_token":""}'

    #offset itself
    print(source_en)
    encrypt_en = encrypt(aes_key, source_en)#First encryption
    print(encrypt_en)
    aes_key='3Unu7SzdXGctW1vA'
    encrypt_en = encrypt(aes_key, str(encrypt_en))  # Second encryption
    print(encrypt_en)
    formdata['params']=encrypt_en
    print(formdata['params'])
    formdata = urllib.parse.urlencode(formdata).encode('utf-8')
    print(formdata)
    req = requests.post(url=url, data=formdata, headers=header)
    return req.json()
if __name__ == '__main__':
    songid='346576'
    page=0
    text=''
    for page in range(10):
        comment=getcomment(songid,page)
        comment=comment['comments']
        for va in comment:
             print (va['content'])
             text+=va['content']
    ags = jieba.analyse.extract_tags(text, topK=50)  # jieba word segmentation, keyword extraction, 40
    print(ags)
    text = " ".join(ags)
    backgroud_Image = plt.imread('tt.jpg')  # If you need personalized word cloud
    wc = WordCloud(background_color="white",
                   width=1200, height=900,
                   mask=backgroud_Image,  # Set background picture

                   #min_font_size=50,
                   font_path="simhei.ttf",
                   max_font_size=200,  # Set font maximum
                   random_state=50,  # Set the number of randomly generated States, that is, the number of color schemes
                   )  # There is a hole in the font. You must set this parameter. Otherwise, a bunch of small boxes will be displayed wc.font_path="simhei.ttf"   # Blackbody
    # wc.font_path="simhei.ttf"
    my_wordcloud = wc.generate(text)
    plt.imshow(my_wordcloud)
    plt.axis("off")
    plt.show()  # If you show it, you need to point by point
    file = 'image/' + str("aita") + '.png'
    wc.to_file(file)

 

Tags: Python Programming crawler Python crawler

Posted by ds111 on Tue, 10 May 2022 10:10:57 +0300