Climb the starting point novel monthly ticket list

Trample point

  • First, enter the page of the starting monthly ticket list to step on the spot https://www.qidian.com/rank/yuepiao , after entering the interface, we need to know what we want to get. Here we extract the novel name, author, novel type, novel status, introduction, recent update, update time, and the number of monthly tickets.
  • After knowing what information to obtain, right-click to check (F12) and enter the following interface:
  • Click the select button to locate the title of the novel:
  • Then we found all the information in it

Get web page text

  • First call requests Get get the web page code and write it into the file
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
response = requests.get('https://www.qidian.com/rank/yuepiao', headers=headers)
f = open("M:/a.txt", 'w')
f.write(response.text)
f.close()
  • Press ctrl+f to search the text and find that all the information is in the
  • Write the code to get the web page into a function
def getHtml(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

XPath extract information

  • Looking for the labels of each information, we find that there are 20 novels on one page. The information of each novel is under the li label. Further analyze the position of each element and compare it with other novels to determine the useful information.

  • After analyzing the useful tags, we can use xpath to extract the required information

html = getHtml('https://www.qidian.com/rank/yuepiao')
html = etree.HTML(html)
html = etree.tostring(html)
html = etree.fromstring(html)
# Extract book title
name = html.xpath('//li//div[@class="book-mid-info"]//h4//a[@data-eid="qd_C40"]//text()')
print(len(name), name)
# Extract author
author = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C41"]//text()')
print(len(author), author)
# Extract novel types
types = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()')
print(len(types), types)
# Extract the current state of the novel
status = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//span/text()')
print(len(status), status)
# Introduction to extracted Novels
intro = html.xpath('//li//div[@class="book-mid-info"]//p[@class="intro"]//text()')
intro = [i.strip() for i in intro] # Delete spaces around text
print(len(intro), intro)
# Extract the current latest chapter
update = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()')
update = [i.strip() for i in update] # Delete spaces around text
print(len(update), update)
# Extract latest update time
date = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()')
print(len(date), date)
  • The printed results are as follows, which shows that we have obtained the data we want!

Crack font anti crawl

  • Then get the monthly ticket, but! When checking the number of monthly tickets, I found that the code was garbled

  • A small frame is displayed on the web page. We can't see what it is. Let's look in the just saved web page code file.

  • We saw some &#; Generally, this situation means that the font is anti climbing. I didn't expect that there is anti climbing in the novel ranking list. Since I met it, I will work on it.
  • Turn it up, huh? Just click to find it..
  • We see several URLs in this @ font face. Yes, this is the font. Copy the URL https://qidian.gtimg.com/qd_anti_spider/jUlcIiMg.woff Open it in the new tab and download it directly (or copy the one ending in. TTF) https://qidian.gtimg.com/qd_anti_spider/jUlcIiMg.ttf , here I downloaded both)
  • After getting the font, we go to this website first http://fontstore.baidu.com/static/editor/index.html , take it The ttf file opens in the web site
  • We can see that the font is 0-9, and then use a Python library fontTools to process the font file, which can be installed by using pip install fontTools
from fontTools.ttLib import TTFont

font = TTFont('M:/jUlcIiMg.woff')
font.saveXML('M:/font.xml')
  • Using the above code, you can woff / .ttf to xml format, and then we open the xml file in the browser
  • We found that this thing is exactly the same as the font parsing website just now. That's it! We use the getBestCmap() function of fontTools to get the map.
from fontTools.ttLib import TTFont

font = TTFont('M:/jUlcIiMg.woff')
font.saveXML('M:/font.xml')
print(font.getBestCmap())

output

{100293: 'eight', 100295: 'four', 100296: 'three', 100297: 'one', 100298: 'period', 100299: 'two', 100300: 'nine', 100301: 'five', 100302: 'zero', 100303: 'six', 100304: 'seven'}

  • Wondering why it's different from what you see? In fact, what I just saw in the font parsing website and xml is hexadecimal (starting with 0x), while the output of fontTools is decimal. If you don't believe it, you can knock it with a calculator.
  • After obtaining the mapping, we will manually convert the English numbers into Chinese, and eliminate the useless 100298: 'period', and notice that the font in the web page code is&# ××××××; For the convenience of replacement, we also change the key to this form:
font = TTFont('M:/jUlcIiMg.woff')
font.saveXML('M:/font.xml')
print(font.getBestCmap())
# Build English to digital dictionary
camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
        'nine': 9}
cp = {}
for k,v in font.getBestCmap().items():
    try: # Filter out 100298 non Arabic numerals: 'period
        cp['&#' + str(k) + ';'] = camp[v]
    except KeyError as e:
        pass
print(cp)

Output:

{'𘟅': 8, '𘟇': 4, '𘟈': 3, '𘟉': 1, '𘟋': 2, '𘟌': 9, '𘟍': 5, '𘟎': 0, '𘟏': 6, '𘟐': 7}

  • So far, we have found the font mapping relationship, and then we can directly replace these fonts in the obtained web page code with normal Arabic numerals according to the mapping relationship:
def getHtml(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None


font = TTFont('M:/jUlcIiMg.woff')
font.saveXML('M:/font.xml')
print(font.getBestCmap())

camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
        'nine': 9}
cp = {}
for k, v in font.getBestCmap().items():
    try:
        cp['&#' + str(k) + ';'] = camp[v]
    except KeyError as e:
        pass
print(cp)
# Get the URL code and save it in txt text
html = getHtml('https://www.qidian.com/rank/yuepiao')
f = open('M:/html.txt', 'w')
f.write(html)
f.close()

# Replace the encrypted font in the URL code with a normal number and save it to the text
for key in cp.keys():
    html = re.sub(key, str(cp[key]), html)
f = open('M:/html_change.txt', 'w')
f.write(html)
f.close()
  • After execution, we go to the two texts to see whether the replacement is successful

  • Well, why didn't the replacement succeed? Is it re Sub is wrong? No, we found that the font here is different from the mapping key just obtained
  • Let's look up @ font face and find that the font has changed! The font we just used is julciimg Woff, and here it becomes omkqwdts Woff, it seems that the font is different every time we visit. In that case, we can't download a separate woff file directly.
  • Every time we get the URL code, we use regular to take out the font URL, then download it, and then analyze and replace the font file! To this end, we will change the function of obtaining the URL to the following. After obtaining the URL, directly extract the font URL, and then download and save it as font woff
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}

def getHtml(url):
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return None
    woff = re.search("format\('eot'\); src: url\('(.+?)'\) format\('woff'\)", response.text, re.S)
    fontfile = requests.get(woff.group(1), headers=headers)
    if fontfile.status_code != 200:
        return None
    f = open('M:/font.woff', 'wb')
    f.write(fontfile.content)
    f.close()
    return response.text
  • Test again:
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}

def getHtml(url):
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return None
    woff = re.search("format\('eot'\); src: url\('(.+?)'\) format\('woff'\)", response.text, re.S)
    fontfile = requests.get(woff.group(1), headers=headers)
    if fontfile.status_code != 200:
        return None
    f = open('M:/font.woff', 'wb')
    f.write(fontfile.content)
    f.close()
    return response.text


font = TTFont('M:/font.woff')
print(font.getBestCmap())

camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
        'nine': 9}
cp = {}
for k, v in font.getBestCmap().items():
    try:
        cp['&#' + str(k) + ';'] = camp[v]
    except KeyError as e:
        pass
print(cp)

html = getHtml('https://www.qidian.com/rank/yuepiao')
f = open('M:/html.txt', 'w')
f.write(html)
f.close()

for key in cp.keys():
    html = re.sub(key, str(cp[key]), html)
f = open('M:/html_change.txt', 'w')
f.write(html)
f.close()

Font obtained successfully

Replacement succeeded!!!

  • We write the code dealing with fonts into a function to make it look more beautiful.
def fontProc(text):
    font = TTFont('M:/font.woff')
    camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
            'nine': 9}
    cp = {}
    for k, v in font.getBestCmap().items():
        try:  # Filter useless mappings
            cp['&#' + str(k) + ';'] = camp[str(v)]
        except KeyError as e:
            pass
    for key in cp.keys():
        text = re.sub(key, str(cp[key]), text)
    return text

Get and save information

  • After the font is replaced successfully, we can extract the number of monthly tickets with XPath. So far, our extraction information function is written as follows:
def getBook(html):
    html = etree.HTML(html)
    html = etree.tostring(html)
    html = etree.fromstring(html)
    name = html.xpath('//li//div[@class="book-mid-info"]//h4//a//text()')
    author = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@class="name"]//text()')
    types = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()')
    status = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//span//text()')
    intro = html.xpath('//li//div[@class="book-mid-info"]//p[@class="intro"]//text()')
    intro = [i.strip() for i in intro]
    update = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()')
    update = [i.strip() for i in update]

    date = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()')
    tickets = html.xpath('//li//div[@class="book-right-info"]//div[@class="total"]//p//span//span//text()')
    book = zip(name, author, types, status, intro, update, date, tickets)
    return book
  • For convenience, we write a function to save this information:
def saveInfo(url):
    html = getHtml(url)
    html = fontProc(html)
    book = getBook(html)
    for name, author, types, status, intro, update, date, tickets in book:
        with open('M:/novels.txt', 'a+') as f:
            f.write('Novel title:' + name + '\n')
            f.write('Author:' + author + '   Novel type:' + types + '   Current status:' + status + '\n')
            f.write('Introduction to the novel:' + intro + '\n')
            f.write(update + '   Update time:' + date + '\n')
            f.write('Number of monthly tickets:' + tickets + '\n')
            f.write('\n\n')
  • Run it and try
saveInfo('https://www.qidian.com/rank/yuepiao')

  • Perfect access to the information we want

Get all pages

  • Through the above analysis and operation, we have obtained all the information, but it is not difficult to find that only one page has been obtained. We are ready to climb down all the pages.
    We click Page 2 and find that the website has become https://www.qidian.com/rank/yuepiao?page=2 ,
    Click Page 3 again and find that the website has become https://www.qidian.com/rank/yuepiao?page=3 .
    The rule has been found. The page parameter on the page is the number. Because there are only five pages in total, it is written as:
for page in range(1, 5 + 1):
    url = 'https://www.qidian.com/rank/yuepiao?page=%d'%page
    saveInfo(url)
  • Run it and find that there is a problem, \ xa0 is an extended character set character in latin1, representing whitespace & nbsp
  • We can replace it with white space characters

In the getBook() function:

update = [i.strip() for i in update]

Replace with:

update = [i.strip().replace('\xa0', ' ') for i in update]
  • After the change, run again, Gan, and then:


Similarly, in the getBook function

intro = [i.strip() for i in intro]

Replace with:

intro = [i.strip().replace('\u2022', ' ') for i in intro]
  • We run again, # lying 21lkad@#! four thousand and twelve


Replace again:

intro = [i.strip().replace('\u2022', ' ').replace('\u2003', ' ') for i in intro]
  • Run again and finally succeed! The information we need has been successfully obtained!

Total code (flower spreading)

import requests
from lxml import etree
from fontTools.ttLib import TTFont
import re

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
woffDir = './font.woff'
novelsDir = './novels.txt'


def getHtml(url):
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return None
    woff = re.search("format\('eot'\); src: url\('(.+?)'\) format\('woff'\)", response.text, re.S)
    fontfile = requests.get(woff.group(1), headers=headers)
    if fontfile.status_code != 200:
        return None
    f = open(woffDir, 'wb')
    f.write(fontfile.content)
    f.close()
    response.encoding = response.apparent_encoding
    return response.text


def fontProc(text):
    font = TTFont(woffDir)
    camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,
            'nine': 9}
    cp = {}
    for k, v in font.getBestCmap().items():
        try:  # Filter useless mappings
            cp['&#' + str(k) + ';'] = camp[str(v)]
        except KeyError as e:
            pass
    for key in cp.keys():
        text = re.sub(key, str(cp[key]), text)
    return text


def getBook(html):
    html = etree.HTML(html)
    html = etree.tostring(html)
    html = etree.fromstring(html)
    name = html.xpath('//li//div[@class="book-mid-info"]//h4//a//text()')
    author = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@class="name"]//text()')
    types = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()')
    status = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//span//text()')
    intro = html.xpath('//li//div[@class="book-mid-info"]//p[@class="intro"]//text()')
    intro = [i.strip().replace('\u2022', ' ').replace('\u2003', ' ') for i in intro]
    update = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()')
    update = [i.strip().replace('\xa0', ' ') for i in update]
    date = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()')
    tickets = html.xpath('//li//div[@class="book-right-info"]//div[@class="total"]//p//span//span//text()')
    book = zip(name, author, types, status, intro, update, date, tickets)
    return book


def saveInfo(url):
    html = getHtml(url)
    html = fontProc(html)
    book = getBook(html)
    for name, author, types, status, intro, update, date, tickets in book:
        with open(novelsDir, 'a+') as f:
            f.write('Novel title:' + name + '\n')
            f.write('Author:' + author + '   Novel type:' + types + '   Current status:' + status + '\n')
            f.write('Introduction to the novel:' + intro + '\n')
            f.write(update + '   Update time:' + date + '\n')
            f.write('Number of monthly tickets:' + tickets + '\n')
            f.write('\n\n')


for page in range(1, 5 + 1):
    url = 'https://www.qidian.com/rank/yuepiao?page=%d' % page
    saveInfo(url)

Tags: Python crawler

Posted by JMJimmy on Mon, 09 May 2022 16:51:51 +0300