20211302 comprehensive practice of Chen Linfu Experiment 4

#20211302 comprehensive practice of Chen Linfu Experiment 4

##1. Experimental contents

Python comprehensive applications: crawler, data processing, visualization, machine learning, neural network, game, network security, etc.

## 2. Experimental process and results

(1) Experimental design

1 function

Get the information of the news page, extract the time, title, website information and printout information.

 

2 robots protocol of target website

There are no restrictions on the view of robots under the root directory of the website.

 

3 target website analysis

website: https://www.phei.com.cn/xwxx/index.shtml

View the web page source code: when you view the web page source code, you can see that the relevant information is saved in the html page.

The target information is saved in the HTML page

4 structural design

Technical route: requests beautiful soup

  1. Requests automatically submits the web url request and circularly obtains the html page
  2. Beautiful soup parses each HTML page and extracts the target information
  3. Print out information to screen

(2) realization process

1. Get html page from Requests

Use the general framework of Requests to submit url Requests and obtain web page content.

Get web page source code

def getHTMLText(url):  
    try:  
        headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}  
        r = requests.get(url, timeout=30, headers=headers,allow_redirects=False)  
        print(r.status_code)  
        r.raise_for_status()  # If the status is not 200, an HTTPError exception is thrown  
        r.encoding = r.apparent_encoding  
        #print(r.text[1000:2000])  
        return r.text  
     except:  
     print("Web page access exception!")  

2. Construct url addresses of all pages

https://www.phei.com.cn/xwxx/index.shtml # home page address

https://www.phei.com.cn/xwxx/index_54.shtml # page 1 address

https://www.phei.com.cn/xwxx/index_53.shtml # page 2 address

......

https://www.phei.com.cn/xwxx/index_1.shtml # last page address

 

Define get_urls() function to construct all paging url addresses:

def get_urls(pages):  
    urls = ['https://www.phei.com.cn/xwxx/index.shtml']  
    for i in range(1,pages):  
        page = 55-i  
        url = "https://www.phei.com.cn/xwxx/index_{}.shtml".format(page)  
        urls.append(url)  
    #print(urls)  
    return urls  

3. Beautiful soup parses html pages and extracts information

Use the ". find_all()" method of bs4 to locate the information location:

def parsePage(html):  
    soup = BeautifulSoup(html, 'html.parser')  
    #for tag in soup.find_all(True):  
        #print(tag.name)  
    p = soup.find_all('li','li_b60')  
    #print(p)  
    print(len(p))  
    for i in range(len(p)):  
        print(i)  
    text = p[i]  
   #print(text.prettify())

Using find_all() further accurately extracts the located information:

#bs4 extract information

def parsePage(html):  
    soup = BeautifulSoup(html, 'html.parser')  
    #for tag in soup.find_all(True):  
        #print(tag.name)  
    p = soup.find_all('li','li_b60')  
    #print(p)  
    print(len(p))  
    for i in range(len(p)):  
        print(i)  
        text = p[i]  
        #print(text.prettify())  #Aesthetically print the label tree  
 
        #Get news time  
       time = text.find_all('span')  
       print(time)  
  
         #Get news headlines  
        title = text.find_all('p','li_news_title')  
        print(title)  
   
        #Get news website  
        for link in text.find_all('a'):  
             link_part = link.get('href')  
           print(link_part)  

4. Construct a loop to obtain html and process the main program

pages = 2    #Enter the number of pages  
urls = get_urls(pages)  #Call get_urls() function constructs url address  
for url in urls:  #Loop to get the page and process it  
print(url)  
html = getHTMLText(url)  #Get web page source code  
parsePage(html)  #Parsing and extracting web page information   

Total code:

      import requests  
	from bs4 import BeautifulSoup  
	  
	#Get network source code  
	def getHTMLText(url):  
	    try:  
	        headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}  
	        r = requests.get(url, timeout=30, headers=headers,allow_redirects=False)  
	        print(r.status_code)  
	        r.raise_for_status()  # If the status is not 200, an HTTPError exception is thrown  
	        r.encoding = r.apparent_encoding  
	        #print(r.text[1000:2000])  
	        return r.text  
	    except:  
	        print("Web page access exception!")  
	  
	#bs4 extract information  
	def parsePage(html):  
	    soup = BeautifulSoup(html, 'html.parser')  
	    #for tag in soup.find_all(True):  
	        #print(tag.name)  
	    p = soup.find_all('li','li_b60')  
	    #print(p)  
	    print(len(p))  
	    for i in range(len(p)):  
	        print(i)  
	        text = p[i]  
	        #print(text.prettify())  #Aesthetically print the label tree  
	  
	        #Get news time  
	        time = text.find_all('span')  
	        time_bs = BeautifulSoup(str(time), 'html.parser')  
	        time_text = time_bs.get_text()  
	        print(time_text)  
	  
	        #Get news headlines  
	        title = text.find_all('p','li_news_title')  
	        title_bs = BeautifulSoup(str(title), 'html.parser')  
	        title_text = title_bs.get_text()  
	        print(title_text)  
	  
	        #Get news website  
	        for link in text.find_all('a'):  
	            link_part = link.get('href')  
	            html_url = 'https://www.phei.com.cn'+str(link_part)  
	            print(html_url)  
	  
	def get_urls(pages):  
	    urls = ['https://www.phei.com.cn/xwxx/index.shtml']  
	    for i in range(1,pages):  
	        page = 55-i  
	        url = "https://www.phei.com.cn/xwxx/index_{}.shtml".format(page)  
	        urls.append(url)  
	    #print(urls)  
	    return urls  
	  
	pages = 3  
	urls = get_urls(pages)  
	for url in urls:  
	    print(url)  
	    html = getHTMLText(url)  #Get web page source code  
	    parsePage(html)  #Parsing and extracting web page information

##3 course summary and thoughts

(1) Course summary

Lesson 1 getting to know Python

Python is a simple and easy to learn scripting language that combines interpretability, compilation, interactivity and object-oriented. Python provides high-level data structures. Its syntax, dynamic types and interpretability make it the preferred programming language for developers.

  • Python is an interpretive language: there is no compilation in the development process. Similar to PHP and Perl languages.

  • Python is an interactive language: you can execute code directly after a python prompt > > >.

  • Python is an object-oriented language: Python supports object-oriented style or code encapsulated in object programming technology.

Python is an object-oriented oop scripting language.

Object oriented is a method of establishing a model based on the concept of object (entity), simulating the objective world, analyzing, designing and realizing software. The object-oriented method combines the data and methods into a whole, and then analyzes it

System modeling. The core of python programming idea is to understand functional logic.

Lesson 2 Pyhton language foundation

Operator:

 

 

Indent:

Indentation is for logical lines, so first distinguish between physical and logical lines in the code.

  • Physical line: the code displayed in the code editor. Each line is a physical line.
  • Logical line: the Python interpreter interprets the code. A statement is a logical line.

Lesson 3 process control statement

  • Sequential structure

                                          

  • Conditions and branches

Single branch:

 

Double branch:

 

Multi branch:

  • loop

while loop:

 

for loop:

 

 

Lesson 4 Application of sequence

List, dictionary, collection.

Lesson 5 strings and regular expressions

Python string is a built-in data type in Python. In Python, quotation marks are used to represent strings, such as double quotation marks, single quotation marks'.

Example of string output:

tang_hu_lu = "It's said that ice sugar gourd is sour" # Declaration string

print(tang_hu_lu) # Printout string

print(tang_hu_lu[2:4]) # Export rock sugar

print(tang_hu_lu[3:5]) # Export sugar gourd

 

Lesson 6 functions

In programming, the use of functions can improve the reuse rate and maintainability of code.

Improve reuse rate: in programming, the functions and operations of some codes are the same, but the data are different. In this case, you can write this function as a function module. To use this function, you only need to call this function module.
Improve maintainability: after using functions, code reuse is realized. When a function needs to be checked or modified, only the function corresponding to this function needs to be checked or modified. The modification of the function can make all modules calling the function effective at the same time, which greatly improves the maintainability of the code.

The general format is as follows:

def Function name(parameter list): 
    Function body
  • Formal parameters, called "formal parameters", are not actual variables, also known as virtual variables. Formal parameters are parameters used when defining the function name and function body. They are used to receive parameters passed in when calling the function.

  • Argument, fully known as "actual parameter", is the parameter passed to the function when calling. Arguments can be constants, variables, expressions, functions, and so on. No matter what type of arguments are, they must have definite values when making a function call in order to pass these values to the formal parameters.

 

Lesson 7 object oriented programming

Object oriented: it is to make a group of data structures and methods to deal with them into objects, summarize objects with the same behavior into classes, hide the internal details of the class by encapsulation, generalize the class by inheritance, and realize dynamic classification based on object type through polymorphism.

Class: represents a group (or class) of objects. Each object belongs to a specific class and is called an instance of the class. In object-oriented programming, you write classes that represent things and situations in the real world and create objects based on these classes. When writing a class, you define the common behavior of a large class of objects. When you create objects based on classes, each object automatically has this common behavior, and then you can give each object a unique personality as needed. Creating objects from classes is called instantiation, which allows you to use instances of classes. In object-oriented programming, the term object roughly means a series of data (attributes) and a set of methods to access and operate these data; Objects consist of properties and methods. Attributes are just variables belonging to objects, while methods are functions stored in attributes.

Class members mainly include: fields, methods, and properties

Object oriented three elements

  • Encapsulation: encapsulation is generally considered to bind data and methods of operating data, and access to data can only be through defined interfaces.
  • Inheritance: inheritance is the process of obtaining inheritance information from an existing class and creating a new class. Classes that provide inheritance information are called parent classes (superclasses, base classes); The class that gets inheritance information is called a subclass (derived class).
  • Polymorphism: polymorphism means that objects of different subtypes are allowed to respond differently to the same message. To put it simply, you call the same method with the same object reference, but do different things.

 

Lesson 8 file operation and exception handling

python read file:

When Python reads and writes a file, the first thing to do is to open the file, and then you can read the file at one time or line by line. Open the file using the open function.

Read all the contents of the file. After opening the file with the open function, you can read the contents of the file through read. This method is equivalent to reading the contents of the file into a string of the program at one time. It is very powerful.

# For the file address, please create a new test in the current directory in advance Txt file
file = "test.txt"
# Open file
f = open(file, encoding="utf-8")
# Read all contents of the file
read_str = f.read()
# Close file
f.close()
print(read_str)

Write file: generally refers to writing to the local hard disk.

# For the file address, please create a new test in the current directory in advance Txt file
file = "test.txt"
# Open file
with open(file, mode="w", encoding="utf-8") as f:
    # Write file contents
    f.write("I am the content to be written")

File copy: use the copy method of the shutil object in this module to copy files.

import shutil

shutil.copy("test.txt","aaa.txt")
shutil.copy("test.txt","../aaa.txt") # Copy of different directories

Directory copy: the syntax format of the copytree  method is the same as that of  copy  except that this method is used to copy directories. If there are subdirectories or files under the directory, copy them together.

import shutil
# The first parameter is the old directory, and the second parameter is the new directory
shutil.copytree("../1","a4") 

Lesson 9 Python working with databases

At present, there are two mainstream databases:

  • One is a relational database, such as MySql
  • One is a non relational database, such as mongodb

Lesson 10 Python web programming and Crawlers

urllib # module is a Python standard library. Its value lies in capturing URL resources on the network.

The urllib module in Python 3 includes the following contents.

  • urllib.request: request module, which is used to open and read the URL;
  • urllib.error: exception handling module, capturing urllib Error: an exception is thrown;
  • urllib.parse: URL parsing, which is used to process URL address in crawler program;
  • urllib. Robot parser: parse robots Txt file to determine which content of the target site can be crawled and which can not be crawled.

Beautiful Soup is a python parsing library, which is mainly used to convert HTML tags into Python object tree, and then let us extract data from the object tree.

lxml} library is a Python data parsing library.

requests} Library:

    • url: request address;
    • params: the query string to be sent, which can be dictionary, list, tuple and byte;
    • data: parameters to be passed in the body object, which can be fields, lists, tuples, bytes or file objects;
    • JSON: JSON serialized object;
    • headers: request header, dictionary format;
    • cookies: pass cookie, field or cookie jar type;
    • Files: the most complex parameter, which generally appears in the "POST" request. For example, the format is "name": file object "or {'name': file object}. You can also send multiple files in one request, but it is not used in general crawler scenarios;
    • auth: specify the authentication mechanism;
    • Timeout: the waiting response time of the server. It can be a tuple type retrieved in the source code. This has not been used before, that is (connect timeout, read timeout);
    • allow_redirects: whether redirection is allowed;
    • proxies: agents;
    • verify: SSL verification;
    • stream: streaming request, mainly connected to streaming API;
    • cert: certificate.

(2) Feelings:

python is a high-level language with great potential. After years of development, it plays a more and more important role in programming.

Since I also learned C language this semester, I will compare python with C language from time to time. Compared with C language, python also gives programmers great convenience. Compared with C language, python has more libraries. Because of its powerful libraries, programming is no longer difficult. Python's advantages in crawler and other aspects reflect its powerful function. Although Python is more convenient than C language in many aspects, it also has weaker aspects, such as for loop and so on. Although at the end of a semester, my study of Python is only its basic aspect, the power of Python still has a strong attraction for me.

A semester's study of python has come to an end. By taking python courses, I have a certain understanding of python. Due to the limited practice time, some contents are not firmly mastered, but the learning of python is not only this stage, but there are more contents waiting for me to learn in the future. I also firmly believe that my love for programming will make me stick to it.

##4 reference

https://edu.csdn.net/skill/python/python-3-147?category=8

Posted by bluebyyou on Tue, 24 May 2022 19:02:28 +0300