Python homework - Crawler + visualization + data analysis + database (brief introduction)
Python homework - Crawler + visualization + data analysis + database (crawler)
Python homework - Crawler + visualization + data analysis + database (visualization)
Python homework - Crawler + visualization + data analysis + database (database)
1, Generate lyrics cloud
First, we need to get the lyrics of all the crawled songs and synthesize them into strings
Then extract the Chinese and synthesize the string
text = re.findall('[one-scorpion]+', lyric, re.S) # Extract Chinese text = " ".join(text)
Then use jieba to segment words, and save words with a length greater than or equal to 2
word = jieba.cut(text, cut_all=True) # participle new_word = [] for i in word: if len(i) >= 2: new_word.append(i) # Only add words with length greater than 2 final_text = " ".join(new_word)
Next, select a good-looking picture for the generated word cloud, and you can start to generate!
mask = np.array(Image.open("2.jpg")) word_cloud = WordCloud(background_color="white", width=800, height=600, max_words=100, max_font_size=80, contour_width=1, contour_color='lightblue', font_path="C:/Windows/Fonts/simfang.ttf", mask=mask).generate(final_text) # plt.imshow(word_cloud, interpolation="bilinear") # plt.axis("off") # plt.show() word_cloud.to_file(self.keyword+'Ci Yun.png') os.startfile(self.keyword+'Ci Yun.png')
Contour in WordCloud parameter_ width=1, contour_ Color='lightblue 'refers to the thickness and color of the outline line of the background picture respectively. If it is not set, the outline will not appear, font_path is used to specify the font
After generation, it can be displayed through show or saved locally and opened. The final results are as follows
2, Popular singer song volume pie chart
First, get the list of popular singers and the number of popular singers' songs
Then divide the number of songs of each singer by the total number of songs of all the ten singers to get the proportion of the number of songs of each singer
Next, you can choose which one to highlight, such as Jay Chou in the figure
As follows, you only need to set the value of the protruding part to be large
explode = [0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Next, you can generate a pie chart
plt.figure(figsize=(6, 9)) # Set the drawing size, width and height plt.rcParams['font.sans-serif'] = ['SimHei'] # Solve the problem of Chinese garbled code plt.axes(aspect=1) # Set the shape to be round plt.pie(x=proportion, labels=name, explode=explode, autopct='%3.1f %%', shadow=True, labeldistance=1.2, startangle=0, pctdistance=0.8) plt.title("Proportion of popular singers' songs") # plt.show() plt.savefig("Pie chart of the proportion of popular singers' songs.jpg") os.startfile("Pie chart of the proportion of popular singers' songs.jpg")
Where x is the list of the proportion of songs, labels is the corresponding label (in this figure, the name of the singer), and expand is the highlight mentioned above. The values in these three lists are one-to-one corresponding, autopct is the display method of setting the proportion value, and 3.1f shows a floating-point number with a width of 3 bits (if it is greater than the original output) and an accuracy of 1
You can also choose to show directly, or save it locally and open it again
3, Bar chart of song popularity
Previously, we obtained the information of top500 songs through crawlers (as follows). Now we want to analyze the popularity of songs and generate a histogram
The renderings are as follows:
Originally, I wanted to generate a histogram of the number of popular songs that singers have, but those popular songs in the website that crawls popular songs have no corresponding singers. I also need to go to other websites to get the corresponding singers of each song. It's too troublesome to do so. Interested partners can implement it by themselves
First, we need to get the number of songs in each heat range
The following data list is the number of songs corresponding to the x-tuple range
We only need to traverse the song popularity list, and each time, we can get the number of songs in each popularity range by corresponding popularity +1 in its data list
x = ('0-10', '10-20', '20-30', '30-40', '40-50', '>50') data = [0, 0, 0, 0, 0, 0]
The next step is to create a histogram. First, solve the problem of Chinese garbled code
plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False
Then you can use plt Bar is created. The first parameter is abscissa data, the second parameter is ordinate data, the third parameter is histogram fill color, and the fourth parameter is transparency
Title, xlabel, ylabel are obviously the names of the title, abscissa and ordinate of the histogram
plt.bar(x, data, color='steelblue', alpha=0.8) plt.title("pop500 Song popularity") plt.xlabel("Song popularity range") plt.ylabel("Number of songs") plt.show()