python+shell backup csdn blog post 2 optimized version

python+shell backup csdn blog post 2 optimized version

in the previous blog post "python+shell backup csdn blog post" , we successfully backed up all our blog posts. However, I'm missing one very important piece of information, which is the date the blog post was updated. The reason is that this data is not saved in the interface provided by CSDN.

So, I need to get this data. Or follow the previous ideas and crawl from the homepage of my blog.

This blog post will not repeat the whole idea, the idea can go to the previous blog post to see.

Optimized get ID PYTHON script

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import  urllib.request
from bs4 import BeautifulSoup
import os

def getid (x):
    url = r'https://blog.csdn.net/fungleo/article/list/'+str(x)
    res = urllib.request.urlopen(url)
    html = res.read().decode('utf-8')
    soup = BeautifulSoup(html,'html.parser')
    links = soup.find_all('div', 'article-item-box')

    for i in links:
        idStr = i['data-articleid']
        timeStr = i.find_all('span', 'date')[0].string
        outStr = '("' + idStr + '", "' + timeStr + '"),\n'
        with open('idtime.txt', 'a+') as f:
            f.write(outStr)
            f.close()

def do ():
    for i in range(14):
        getid(i)

do()
copy

Here, we combine the ID s and dates of the articles into a tuple, separated by commas. Then we can manually add a square bracket to form an array of tuples, which is convenient for us to use later.

Download JSON file

Since I have already downloaded the json file, there is no need to repeat the download here. The official can go to the previous blog post to see how to download the json file.

However, since our idtime.txt file contains not only the ID but also the time, the script of the previous blog post needs to be adjusted:

for i in $(cat idtime.txt | cut -f 2 -d '"'); do sh t.sh $i > $i.json; sleep 1; done
copy

The main adjustment is to use the cut tool to extract the ID.

After downloading all the json files, create a json folder and put all these files in it.

Optimize JSON to MARKDOWN script

First, we manually modify the idtime.txt file, indent it by one space, and modify it to the following format

TIME = [
    #__The second line starts with the original content and is indented by one line__
]
copy

Then rename the file to timeid.py

OK, then there's the script below.

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import os
import json
import timeid

sourceDir = './json'

def getDate(name):
    for i in timeid.TIME:
        if i[0] in name:
            return i[1]
    return '2018-06-29 00:00:00'
def readJson (filPath):
    f = open(filPath, encoding='utf-8')
    data = json.load(f)

    date = getDate(f.name)
    title = data['data']['title']
    saveTitle = title.replace('/', ':')
    content = data['data']['markdowncontent']
    tags = data['data']['tags'].split(',')

    if content:
        mdFile = open('''./markdown/{title}.md'''.format(title=saveTitle), 'a+')
        mdFile.write('title: ' + title + '\n')
        mdFile.write('date: ' + date + ' +0800\n')
        mdFile.write('update: ' + date + ' +0800\n')
        mdFile.write('author: fungleo\n')
        mdFile.write('tags:\n')
        for tag in tags:
            mdFile.write('    -' + tag + '\n')

        mdFile.write('---\n\n')
        mdFile.write(content)



def findJson ():
    for fil in os.listdir(sourceDir):
        filPath = os.path.join(sourceDir, fil)
        readJson(filPath)

findJson()
copy

Run the script again, and all the blog posts will be generated under the markdown folder.

This article was originally created by FungLeo, and reprinting is permitted, but the first link must be reserved for reprinting.

Posted by Edwin Okli on Thu, 05 May 2022 18:02:14 +0300