Beginner crawler (11) -- stock data analysis, crawler learning is temporarily over

  I finally wrote the last article, two days later than I said before. This series lasted about 23 days (not counting today). On average, except for this one, it basically conforms to the Flag I set up at the beginning. Although the actual learning time is less than half, such learning efficiency is still too low. My overall course progress still follows the course, and there is only this one in practical application.
  overall evaluation: OK, this is the first time I have been writing a blog continuously. Although it is posted on my personal blog, I will also post all CSDN blogs after modification. The follow-up arrangement will be made after the final exam. The current idea is to review a series of C language, and the next series of python will be arranged separately. I'm sorry for my poor writing.
Personal blog address: www.cloudray.com club

Stock data analysis directional crawler

Note: this code is only for learning and reference

Requirements:

1: Get the A-share stock code through a website
2: Get some specific information about the stock on another website through the A-share stock code

Clear objectives:

1: Search the stock code list. After querying several websites, it is easier to obtain the A-share stock code from Lianban. You can query the stock code by Region:
https://www.banban.cn/gupiao/ +Address
2: There are two options for querying stock information: Oriental fortune.com or snowball.com. Snowball.com is used here.
By viewing the web page source code, we decide to get the specific information on snowball, including the current price and the rise and fall
URLs are divided into two categories, SZ or SH + stock code
https://xueqiu.com/S/SZ
https://xueqiu.com/S/SH

Technical route:

1.requests+BeautifulSoup+re
2. Scripy framework
The first technical route is temporarily adopted here

Page analysis:

1: Web page analysis on the website of obtaining stock code

Connecting plate net
https://www.banban.cn/gupiao/shanghai/ List of Listed Companies in Shanghai
https://www.banban.cn/gupiao/hebei/shijiazhuang.html List of Listed Companies in Shijiazhuang, Hebei
From the above url, you can see the switching of regions through
https://www.banban.cn/gupiao/ +Address (. html)
You can get A list of A shares in A region
Then take Shanghai as an example to view the source code, and its stock information is stored in

  • Tags can be obtained by traversal.

    View robots Txt protocol. If it is found that it does not exist, it means that this website can be crawled at will:
  • 2. Acquisition of specific stock information

    Snowball net
    The following is the url of individual stock information
    https://xueqiu.com/S/SH688356
    https://xueqiu.com/S/SZ300187
    URLs are divided into two categories, SZ or SH + stock code
    https://xueqiu.com/S/SZ
    https://xueqiu.com/S/SH
    After obtaining the stock code, you can try two kinds of links in turn. If the link returns 404, it means that this stock is another one.
    View robots Txt protocol

    Click any stock information and check the source code (not check the code). It is found that it is complex and inconvenient to analyze,

    Therefore, I choose to directly use the program to crawl the standardized output data and combine it with the web page source code for analysis,

    html = get_html(inf_url, head='')
    soup = BeautifulSoup(html,'html.parser')
    print(soup.prettify())#Standardized output to facilitate the analysis of web pages
    

    Program structure:

    Get web page source code_ html(url,head)
    Get stock code Stock_Code(html,Stock_List)
    Through the stock code, jump the information of individual stocks and climb the current price and Stock_inf(Stock_List,Stock_List_Inf,place)
    Main function: select crawling range

    Complete code (with notes)

    import requests
    from bs4 import BeautifulSoup
    import bs4
    import re
    import traceback #Output error message
    import time
    
    def get_html(url,head):
    
        html = requests.get(url,headers=head) #Pay attention to climb the head
        html.raise_for_status()  # If the web page access status is wrong, an exception will be generated
        html.encoding = html.apparent_encoding  #It can be optimized manually here
        return html.text
    
    #Get stock code
    #Returns a list of stock codes in dictionary format
    def Stock_Code(html,Stock_List):
        soup = BeautifulSoup(html,'html.parser')
        a = soup.find_all('a')#All 'a' tags
        # print(a)
        for i in a:
            try:
                # print(i)
                href = i.string#Gets its non attribute string
                # print(href)
                s = re.findall(r'.+?\(\d{6}\)',href) #Get a string in a format similar to Waigaoqiao (600648) directly
                #Return is a list!!!
                if len(s)!=0:
                    # print(s)
                    Stock_List[s[0].split('(')[0]] = s[0].split('(')[1].replace(')','')
                    # print(s[0].split('(')[0],":",s[0].split('(')[1].replace(')',''))
            except:
                    continue
    
    #Query all stocks at one time in the sub function
    #Obtain the current stock price and trading volume
    #Return specific information
    def Stock_inf(Stock_List,Stock_List_Inf,place):
        lst = len(Stock_List) #How many stocks did you climb in total
        print("Total crawling{}Total stock{}branch".format(place,lst))
        count = 0
        t = time.localtime()
        t = time.strftime("%Y_%m_%d_%H_%M", t)
        fpath = t+'Stock data.txt'
        with open(fpath, 'a', encoding='utf-8') as f:
            f.write('{}Stock data\n'.format(place))
            f.write('Stock name\t Stock code\t Current price\t\t Ups and downs\n')
        #Traverse stock code
        for stock,code in Stock_List.items():
            try:
                inf_url = "https://xueqiu.com/S/SZ" + code
                html = get_html(inf_url,head = '')
            except:
                inf_url = "https://xueqiu.com/S/SH" + code
                html = get_html(inf_url, head='')
            soup = BeautifulSoup(html,'html.parser')
            # print(soup.prettify())#Standardized output to facilitate the analysis of web pages
            # print("********************************************")
            # The div tag with class = stock current is in its descendant node
            for s in soup.body.descendants:
                # Note the use of the find method. Since class is a keyword and cannot be used directly, class is used_
                #s.find(class_="stock-current")
                value = s.find(attrs={'class': "stock-current"})
                up_down = s.find(attrs={'class': "stock-change"})
                if up_down != None:
                    up_down = str(up_down.string)
                if  value != None:
                    value = str(value.strong.string)#Obtained the price of the day
                if (up_down!=None) and (value!=None):
                    Stock_List_Inf[stock] = [code,value,up_down]
                    st = stock+'\t'+code+'\t'+value+'\t'+up_down+'\n' #Easy to write files
                    with open(fpath, 'a',encoding='utf-8') as f:
                        f.write(st)
                    count = count+1
                    bar = '['+'*'*int(count*100/lst)+']'#The multi line progress bar won't work for the time being. It seems that there is a bug and sometimes it jumps
                    print('\r Current progress{:.2f}%\t Crawling:{}\t{}'.format(count*100/lst,stock,bar),end='')#\r means to return the cursor position to the beginning of the line
                    break
        return Stock_List_Inf
    
    def main():
        place = {"Hefei":"anhui/hefei.html",'Shanghai':"shanghai/"}
        # place = {"Hefei": "anhui/hefei.html"} #You can choose to add it yourself
        # places = place ["Hefei"]
        for pl in list(place.keys()):
            Stock_List = {}  # Dictionary for storing stock codes
            Stock_List_Inf = {} #Where the final stock information is stored
            print('Crawling{}Stock of'.format(pl))
            Code_Url = "https://www.banban.cn/gupiao/"+ place[pl]# here only takes Shanghai as an example
            html = get_html(Code_Url,'')#This website does not need to add a header
            Stock_Code(html,Stock_List)#Get stock code and name
            Stock_inf(Stock_List,Stock_List_Inf,pl)#Obtain specific information of stock data
            print('\n{}Stock crawling completed'.format(pl))
        print('Stock crawling completed')
    
    
    if __name__ == '__main__':
        main()
    #traceback.print_exc() #Print exception information
    

    Operation results

    The crawling range can be selected. Here, only A shares in Shanghai and Hefei are taken as examples.
    Progress can be displayed as a percentage and a progress bar.


    Optimization ideas

    1: The structure of traceback is optimized and the exception handling module is added.
    2: Add multiline progress bar mode
    3: Design UI

    Tags: Python crawler

    Posted by svan_rv on Wed, 18 May 2022 17:11:54 +0300