js cracking process
- preface
- Skill points
- Interface Overview
- Static web page dynamic web page
- Page parsing
- step1: Find Parameters step2: analyze js function step3: analyze parameters step4: verify step5: convert to python code
- Write crawler
Many people learn python and don't know where to start.
After learning python and mastering the basic grammar, many people don't know where to find cases.
Many people who have done cases do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free video tutorials, e-books, and the source code of the course!?? ¤
QQ group: 623406465
preface
The big obstacle of web crawlers is all kinds of encryption. This includes the login verification code and encryption. js confusion, js parameter encryption, etc. In fact, I have known js encryption before. But there is no in-depth research. Through this practice, I studied the encryption method of Netease cloud music.
Bloggers analyze and share the learning process through the example of Netease cloud music comment encryption.
If you have any questions or do not understand, you can follow my wechat official account and contact me.
Skill points
- Front end: js knowledge (more important), Google browser debug, packet capture, break point debugging ability (required). And js various encryption functions (understand)
- python: basic requests. Crypto.Cipher encryption and decryption module.
- Others: postman (simulation request use), good thinking ability and analysis ability. (the encryption algorithm is a little messy). Another point is the code implementation of js encryption to python.
Interface Overview
Static web page
For the general page whose url changes with the change of the page, Netease cloud still exists. You only need to grab the page for analysis.
Dynamic web page
However, with the popularity of front and rear end separation and the obvious benefits of data separation. More and more data is rendered with ajax. The comments of Netease cloud are even so.
At that time, many websites didn't take much protective measures against excuses. This makes it easy for many websites to obtain results. So far, there are many such excuses. This kind of website crawling is a fool crawling.
However, with the development of front-end technology, the interface has become more and more difficult. Take the comment of Netease cloud: its parameters are very confusing.
What is this string of numbers. Many people will choose to give up when they see such data. Then let me untie the veil of it or something for you.
Page parsing
step1: Find Parameters
You can see that it has two parameters, one is params and the other is encSecKey, which are encrypted. We need to analyze its source. F12 open source and search encSckey
'after searching the encSecKey inside js, I found that it was originally here. After breakpoint debugging, I found that this is the result of the final parameter.
Step 2: analyze js function
This js has more than 4w lines. How can we find useful information in 4w multi line js and clarify the idea here?
This requires your abstract and reverse thinking. Come on, let's start analyzing.
var bYc7V = window.asrsea(JSON.stringify(i3x), bkY2x(["shed tears", "strong"]), bkY2x(VM8E.md), bkY2x(["love", "girl", "terrified", "laugh"])); e3x.data = k4o.cz4D({ params: bYc7V.encText, encSecKey: bYc7V.encSecKey })
The above code is the source. Let's ignore the JSON Stringify (i3x) what are these parameters? First find out the window What is asrsea. Not far above you will find:
This is the d function, which is the root of all data and methods. The four parameters d, e, f and g are the parameters we just said don't care.
Analysis from this function: encText is a parameter executed through the b() function twice, and encSecKey is a parameter executed through the c() function. Note that the source of i parameter is a(16) Look at these functions online.
function a(a) { var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", c = ""; for (d = 0; a > d; d = 1) e = Math.random() * b.length, e = Math.floor(e), c = b.charAt(e); return c } function b(a, b) { var c = CryptoJS.enc.Utf8.parse(b) , d = CryptoJS.enc.Utf8.parse("0102030405060708") , e = CryptoJS.enc.Utf8.parse(a) , f = CryptoJS.AES.encrypt(e, c, { iv: d, mode: CryptoJS.mode.CBC }); return f.toString() } function c(a, b, c) { var d, e; return setMaxDigits(131), d = new RSAKeyPair(b,"",c), e = encryptedString(d, a) } function d(d, e, f, g) { var h = {} , i = a(16); return h.encText = b(d, g), h.encText = b(h.encText, i), h.encSecKey = c(i, e, f), h }
It can be found that a(16) is a randomly generated number, so we don't need to care about it. At present, b it is the cbc mode encryption of AES. Then we know the rules of encText generation. cbc encryption of AES twice. The offset is 0102030405060708, which is fixed. The key s are different. The function c is RSA encryption with three parameters. The general flow of the whole algorithm is almost a little understood.
Stop here first. Don't analyze the function. We are analyzing the data.
step3: analysis parameters
Back to var byc7v = window Asrsea (JSON. Stringify (i3x), bky2x (["tears", "strong"]), bkY2x(VM8E.md), bkY2x(["love", "girl", "panic", "laugh")) this function. Intuitively, we can feel that some data must have nothing to do with our core parameters, but at most with time stamps.
Find bky2x the source,
If you look for it again, it's not necessary. You can look for such functions. You can copy it to vscode and trace the source to find the source. Analysis is not cumbersome here. Just interrupt the analysis! See how he does it.
In fact, you will find that the last three parameters are fixed (non interactive data).
However, what you want most is the first parameter
The parameters of your heart have been like this for years, so it's almost as expected. Only the first parameter is related to our parameters. offset is the page * 20, R_SO_4_ songid is the id of the current song. In fact, by this time, your i and encSecKey can be saved together. Because the above analysis said that this i is generated randomly, and the encSecKey is also irrelevant to our core parameters, but it is related to i, so a group should be recorded. Used as parameters for ESA encryption and post request.
Now you are not very excited, because really want to come to the surface.
step4: Verification
This step is also a very important part, because you will find it in its js.
Will Netease do anything? Download the original js for testing. Found ha ha, the results are consistent. Then there is no need to change the code of the encryption algorithm.
The architecture diagram is
step5: convert to python code
You need to clone the code of cbc mode of AES with Python. To achieve the effect of encryption, test it. The results are consistent, nice
Write crawler
Let's start writing the crawler. Test the parameters you need with postman first.
No problem, write the crawler. According to your favorite brother. Input the id and generate the word cloud of your love! A song of glorious years for you!
import requests import urllib.parse import base64 from wordcloud import WordCloud import jieba.analyse import matplotlib.pyplot as plt from bs4 import BeautifulSoup from Crypto.Cipher import AES header={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36', #'Postman-Token':'4cbfd1e6-63bf-4136-a041-e2678695b419', "origin":'https://music.163.com', #'referer':'https://music.163.com/song?id=1372035522', #'accept-encoding':'gzip,deflate,br', 'Accept':'*/*', 'Host':'music.163.com', 'content-lenth':'472', 'Cache-Control':'no-cache', 'content-type': 'application/x-www-form-urlencoded', 'Connection':'keep-alive', #'Cookie':'iuqxldmzr_=32; _ntes_nnid=a6f29f40998c88c693bc910331bd6bea,1558011234325; _ntes_nuid=a6f29f40998c88c693bc910331bd6bea; _ga=GA1.2.2120707788.1559308501; WM_TID=pV2C%2BjTrRwBBAAERUVJojniTwk8%2B8Zta; JSESSIONID-WYYY=nvf%2BggodQRfcT%2BTvBRmANqMrsDeQCxRvqwFsxDr3eJvNNWhGYFhfCXKFkfAfOdbHhpCsMzT39mAeJ7ZamBQZbiwwtnSZD%5CPWRqKxD9t6dGKD3bTVjomjgB39DB07RNIWI32bYKa2H4fg1qQgqI%2FR%2B%2Br%2BZXJvgFg1Vh%2FA2XRj9S4p0EMu%3A1560927288799; WM_NI=DthwcEQf5Ew2NbTIZmSNhSnm%2F8VWsg5RxhkYogvs2luEwZ6m5UhdzbHYPIr654ZBWKV4o22%2BEwb9BvdLS%2BFOmOAEUG%2B8xd8az4CX%2FiAL%2BZkz3syA0onCPkhQwCtL4pkUcjg%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6eed2d650989c9cd1dc4bb6b88eb2c84e979f9aaff773afb6fb83d950bcb19ecce92af0fea7c3b92a88aca898e24f93bafba6f63a8ebe9caad9679192a8b4ed67ede89ab8f26df78eb889ea53adb9ba94b168b79bb9bbb567f78ba885f96a8c87a0aaf13ef7ec96a3d64196eca1d3b12187a9aedac17ea8949dccc545af918fa6d84de9e8b885bb6bbaec8db9ae638394e5bbea72f1adb7a2b365ae9da08ceb5bb59dbcadb77ca98bad8be637e2a3' } def pkcs7padding(text): """ Plaintext use PKCS7 fill Final call AES When encrypting a method, a byte Array, which is required to be an integer multiple of 16, so the plaintext needs to be processed :param text: Content to be encrypted(Plaintext) :return: """ bs = AES.block_size # 16 length = len(text) bytes_length = len(bytes(text, encoding='utf-8')) # While the Chinese code takes up 8 bytes, the English code takes up 1 byte padding_size = length if(bytes_length == length) else bytes_length padding = bs - padding_size % bs # tips: chr(padding) depending on the agreement with other languages, some will use '\ 0' padding_text = chr(padding) * padding return text + padding_text def encrypt(key, content): """ AES encryption key,iv Use the same pattern cbc fill pkcs7 :param key: secret key :param content: Encrypted content :return: """ key_bytes = bytes(key, encoding='utf-8') iv = bytes('0102030405060708', encoding='utf-8') cipher = AES.new(key_bytes, AES.MODE_CBC, iv) # Processing plaintext content_padding = pkcs7padding(content) # encryption encrypt_bytes = cipher.encrypt(bytes(content_padding, encoding='utf-8')) # Recoding result = str(base64.b64encode(encrypt_bytes), encoding='utf-8') return result def getcomment(songid,page): url="https://music.163.com/weapi/v1/resource/comments/R_SO_4_"+songid+"?csrf_token=" print(url) formdata = { "params": "", "encSecKey": "c81160c64a08feb6cfed91c1619d5bffd05dd278b685c94a748689edf035ee0436b66aa7019927ce0fedd26aee9a22cdc6743e58a120f9db0126ebb2e61dae3f7ee21088eb747f829bceed9a5bbb9ee7a2eecf1a358feac431acaab17c95b8491a6a955f7c17a02a3e7886390c2cb3b981f4ccbd5163a566d27ace95db073401", } aes_key = '0CoJUm6Qyw8W8jud'## Invariable print('aes_key:' + aes_key) # Encrypt English source_en = '{"rid":"R_SO_4_'+songid+'","offset":"'+str(page*20)+'","total":"false","limit":"20","csrf_token":""}' #offset itself print(source_en) encrypt_en = encrypt(aes_key, source_en)#First encryption print(encrypt_en) aes_key='3Unu7SzdXGctW1vA' encrypt_en = encrypt(aes_key, str(encrypt_en)) # Second encryption print(encrypt_en) formdata['params']=encrypt_en print(formdata['params']) formdata = urllib.parse.urlencode(formdata).encode('utf-8') print(formdata) req = requests.post(url=url, data=formdata, headers=header) return req.json() if __name__ == '__main__': songid='346576' page=0 text='' for page in range(10): comment=getcomment(songid,page) comment=comment['comments'] for va in comment: print (va['content']) text+=va['content'] ags = jieba.analyse.extract_tags(text, topK=50) # jieba word segmentation, keyword extraction, 40 print(ags) text = " ".join(ags) backgroud_Image = plt.imread('tt.jpg') # If you need personalized word cloud wc = WordCloud(background_color="white", width=1200, height=900, mask=backgroud_Image, # Set background picture #min_font_size=50, font_path="simhei.ttf", max_font_size=200, # Set font maximum random_state=50, # Set the number of randomly generated States, that is, the number of color schemes ) # There is a hole in the font. You must set this parameter. Otherwise, a bunch of small boxes will be displayed wc.font_path="simhei.ttf" # Blackbody # wc.font_path="simhei.ttf" my_wordcloud = wc.generate(text) plt.imshow(my_wordcloud) plt.axis("off") plt.show() # If you show it, you need to point by point file = 'image/' + str("aita") + '.png' wc.to_file(file)