I finally wrote the last article, two days later than I said before. This series lasted about 23 days (not counting today). On average, except for this one, it basically conforms to the Flag I set up at the beginning. Although the actual learning time is less than half, such learning efficiency is still too low. My overall course progress still follows the course, and there is only this one in practical application.
overall evaluation: OK, this is the first time I have been writing a blog continuously. Although it is posted on my personal blog, I will also post all CSDN blogs after modification. The follow-up arrangement will be made after the final exam. The current idea is to review a series of C language, and the next series of python will be arranged separately. I'm sorry for my poor writing.
Personal blog address: www.cloudray.com club
Stock data analysis directional crawler
Note: this code is only for learning and reference
Requirements:
1: Get the A-share stock code through a website
2: Get some specific information about the stock on another website through the A-share stock code
Clear objectives:
1: Search the stock code list. After querying several websites, it is easier to obtain the A-share stock code from Lianban. You can query the stock code by Region:
https://www.banban.cn/gupiao/ +Address
2: There are two options for querying stock information: Oriental fortune.com or snowball.com. Snowball.com is used here.
By viewing the web page source code, we decide to get the specific information on snowball, including the current price and the rise and fall
URLs are divided into two categories, SZ or SH + stock code
https://xueqiu.com/S/SZ
https://xueqiu.com/S/SH
Technical route:
1.requests+BeautifulSoup+re
2. Scripy framework
The first technical route is temporarily adopted here
Page analysis:
1: Web page analysis on the website of obtaining stock code
Connecting plate net
https://www.banban.cn/gupiao/shanghai/ List of Listed Companies in Shanghai
https://www.banban.cn/gupiao/hebei/shijiazhuang.html List of Listed Companies in Shijiazhuang, Hebei
From the above url, you can see the switching of regions through
https://www.banban.cn/gupiao/ +Address (. html)
You can get A list of A shares in A region
Then take Shanghai as an example to view the source code, and its stock information is stored in

View robots Txt protocol. If it is found that it does not exist, it means that this website can be crawled at will:

2. Acquisition of specific stock information
Snowball net
The following is the url of individual stock information
https://xueqiu.com/S/SH688356
https://xueqiu.com/S/SZ300187
URLs are divided into two categories, SZ or SH + stock code
https://xueqiu.com/S/SZ
https://xueqiu.com/S/SH
After obtaining the stock code, you can try two kinds of links in turn. If the link returns 404, it means that this stock is another one.
View robots Txt protocol
Click any stock information and check the source code (not check the code). It is found that it is complex and inconvenient to analyze,
Therefore, I choose to directly use the program to crawl the standardized output data and combine it with the web page source code for analysis,
html = get_html(inf_url, head='') soup = BeautifulSoup(html,'html.parser') print(soup.prettify())#Standardized output to facilitate the analysis of web pages
Program structure:
Get web page source code_ html(url,head)
Get stock code Stock_Code(html,Stock_List)
Through the stock code, jump the information of individual stocks and climb the current price and Stock_inf(Stock_List,Stock_List_Inf,place)
Main function: select crawling range
Complete code (with notes)
import requests from bs4 import BeautifulSoup import bs4 import re import traceback #Output error message import time def get_html(url,head): html = requests.get(url,headers=head) #Pay attention to climb the head html.raise_for_status() # If the web page access status is wrong, an exception will be generated html.encoding = html.apparent_encoding #It can be optimized manually here return html.text #Get stock code #Returns a list of stock codes in dictionary format def Stock_Code(html,Stock_List): soup = BeautifulSoup(html,'html.parser') a = soup.find_all('a')#All 'a' tags # print(a) for i in a: try: # print(i) href = i.string#Gets its non attribute string # print(href) s = re.findall(r'.+?\(\d{6}\)',href) #Get a string in a format similar to Waigaoqiao (600648) directly #Return is a list!!! if len(s)!=0: # print(s) Stock_List[s[0].split('(')[0]] = s[0].split('(')[1].replace(')','') # print(s[0].split('(')[0],":",s[0].split('(')[1].replace(')','')) except: continue #Query all stocks at one time in the sub function #Obtain the current stock price and trading volume #Return specific information def Stock_inf(Stock_List,Stock_List_Inf,place): lst = len(Stock_List) #How many stocks did you climb in total print("Total crawling{}Total stock{}branch".format(place,lst)) count = 0 t = time.localtime() t = time.strftime("%Y_%m_%d_%H_%M", t) fpath = t+'Stock data.txt' with open(fpath, 'a', encoding='utf-8') as f: f.write('{}Stock data\n'.format(place)) f.write('Stock name\t Stock code\t Current price\t\t Ups and downs\n') #Traverse stock code for stock,code in Stock_List.items(): try: inf_url = "https://xueqiu.com/S/SZ" + code html = get_html(inf_url,head = '') except: inf_url = "https://xueqiu.com/S/SH" + code html = get_html(inf_url, head='') soup = BeautifulSoup(html,'html.parser') # print(soup.prettify())#Standardized output to facilitate the analysis of web pages # print("********************************************") # The div tag with class = stock current is in its descendant node for s in soup.body.descendants: # Note the use of the find method. Since class is a keyword and cannot be used directly, class is used_ #s.find(class_="stock-current") value = s.find(attrs={'class': "stock-current"}) up_down = s.find(attrs={'class': "stock-change"}) if up_down != None: up_down = str(up_down.string) if value != None: value = str(value.strong.string)#Obtained the price of the day if (up_down!=None) and (value!=None): Stock_List_Inf[stock] = [code,value,up_down] st = stock+'\t'+code+'\t'+value+'\t'+up_down+'\n' #Easy to write files with open(fpath, 'a',encoding='utf-8') as f: f.write(st) count = count+1 bar = '['+'*'*int(count*100/lst)+']'#The multi line progress bar won't work for the time being. It seems that there is a bug and sometimes it jumps print('\r Current progress{:.2f}%\t Crawling:{}\t{}'.format(count*100/lst,stock,bar),end='')#\r means to return the cursor position to the beginning of the line break return Stock_List_Inf def main(): place = {"Hefei":"anhui/hefei.html",'Shanghai':"shanghai/"} # place = {"Hefei": "anhui/hefei.html"} #You can choose to add it yourself # places = place ["Hefei"] for pl in list(place.keys()): Stock_List = {} # Dictionary for storing stock codes Stock_List_Inf = {} #Where the final stock information is stored print('Crawling{}Stock of'.format(pl)) Code_Url = "https://www.banban.cn/gupiao/"+ place[pl]# here only takes Shanghai as an example html = get_html(Code_Url,'')#This website does not need to add a header Stock_Code(html,Stock_List)#Get stock code and name Stock_inf(Stock_List,Stock_List_Inf,pl)#Obtain specific information of stock data print('\n{}Stock crawling completed'.format(pl)) print('Stock crawling completed') if __name__ == '__main__': main() #traceback.print_exc() #Print exception information
Operation results
The crawling range can be selected. Here, only A shares in Shanghai and Hefei are taken as examples.
Progress can be displayed as a percentage and a progress bar.
Optimization ideas
1: The structure of traceback is optimized and the exception handling module is added.
2: Add multiline progress bar mode
3: Design UI