Python homework - Crawler + visualization + data analysis + database (data analysis)

Personal blog

Python homework - Crawler + visualization + data analysis + database (brief introduction)

Python homework - Crawler + visualization + data analysis + database (crawler)

Python homework - Crawler + visualization + data analysis + database (visualization)

Python homework - Crawler + visualization + data analysis + database (database)

1, Generate lyrics cloud

First, we need to get the lyrics of all the crawled songs and synthesize them into strings

Then extract the Chinese and synthesize the string

text = re.findall('[one-scorpion]+', lyric, re.S)  # Extract Chinese
text = " ".join(text)

Then use jieba to segment words, and save words with a length greater than or equal to 2

word = jieba.cut(text, cut_all=True)  # participle
new_word = []
for i in word:
    if len(i) >= 2:
        new_word.append(i)  # Only add words with length greater than 2
final_text = " ".join(new_word)

Next, select a good-looking picture for the generated word cloud, and you can start to generate!

mask = np.array(Image.open("2.jpg"))
word_cloud = WordCloud(background_color="white", width=800, height=600, max_words=100, max_font_size=80, contour_width=1, contour_color='lightblue', font_path="C:/Windows/Fonts/simfang.ttf", mask=mask).generate(final_text)
# plt.imshow(word_cloud, interpolation="bilinear")
# plt.axis("off")
# plt.show()
word_cloud.to_file(self.keyword+'Ci Yun.png')
os.startfile(self.keyword+'Ci Yun.png')

Contour in WordCloud parameter_ width=1, contour_ Color='lightblue 'refers to the thickness and color of the outline line of the background picture respectively. If it is not set, the outline will not appear, font_path is used to specify the font

After generation, it can be displayed through show or saved locally and opened. The final results are as follows

2, Popular singer song volume pie chart

First, get the list of popular singers and the number of popular singers' songs

Then divide the number of songs of each singer by the total number of songs of all the ten singers to get the proportion of the number of songs of each singer

Next, you can choose which one to highlight, such as Jay Chou in the figure

As follows, you only need to set the value of the protruding part to be large

explode = [0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Next, you can generate a pie chart

plt.figure(figsize=(6, 9))  # Set the drawing size, width and height
plt.rcParams['font.sans-serif'] = ['SimHei']  # Solve the problem of Chinese garbled code
plt.axes(aspect=1)  # Set the shape to be round
plt.pie(x=proportion, labels=name, explode=explode, autopct='%3.1f %%',
shadow=True, labeldistance=1.2, startangle=0, pctdistance=0.8)
plt.title("Proportion of popular singers' songs")
# plt.show()
plt.savefig("Pie chart of the proportion of popular singers' songs.jpg")
os.startfile("Pie chart of the proportion of popular singers' songs.jpg")

Where x is the list of the proportion of songs, labels is the corresponding label (in this figure, the name of the singer), and expand is the highlight mentioned above. The values in these three lists are one-to-one corresponding, autopct is the display method of setting the proportion value, and 3.1f shows a floating-point number with a width of 3 bits (if it is greater than the original output) and an accuracy of 1

You can also choose to show directly, or save it locally and open it again

3, Bar chart of song popularity

Previously, we obtained the information of top500 songs through crawlers (as follows). Now we want to analyze the popularity of songs and generate a histogram

The renderings are as follows:

Originally, I wanted to generate a histogram of the number of popular songs that singers have, but those popular songs in the website that crawls popular songs have no corresponding singers. I also need to go to other websites to get the corresponding singers of each song. It's too troublesome to do so. Interested partners can implement it by themselves

First, we need to get the number of songs in each heat range

The following data list is the number of songs corresponding to the x-tuple range

We only need to traverse the song popularity list, and each time, we can get the number of songs in each popularity range by corresponding popularity +1 in its data list

x = ('0-10', '10-20', '20-30', '30-40', '40-50', '>50')
data = [0, 0, 0, 0, 0, 0]

The next step is to create a histogram. First, solve the problem of Chinese garbled code

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

Then you can use plt Bar is created. The first parameter is abscissa data, the second parameter is ordinate data, the third parameter is histogram fill color, and the fourth parameter is transparency

Title, xlabel, ylabel are obviously the names of the title, abscissa and ordinate of the histogram

plt.bar(x, data, color='steelblue', alpha=0.8)
plt.title("pop500 Song popularity")
plt.xlabel("Song popularity range")
plt.ylabel("Number of songs")
plt.show()

Tags: Front-end Android Back-end Interview

Posted by Dima on Sat, 13 Aug 2022 21:32:58 +0300