First job

Assignment ① universites ranking:

Requirements: use the requests and beautiful soup library methods to crawl the given web address( http://www.shanghairanking.cn/rankings/bcur/2020 )According to the data, the screen prints the crawling university ranking information.

code

import urllib.request
from bs4 import BeautifulSoup

def getHtmlText(url):
    html=urllib.request.urlopen(url)
    html=html.read()
    html=html.decode()
    return html

url="http://www.shanghairanking.cn/rankings/bcur/2020"
html=getHtmlText(url) #Get html document
soup = BeautifulSoup(html, "html.parser") #Traversing html documents

print("ranking\t School name\t Provinces and cities\t School type\t Total score")

for tr in soup.find('tbody').children:
    r = tr.find_all("td") #Find all td elements
    #Print text in element
    print(r[0].text.strip()+"\t"+r[1].text.strip()+"\t"+'   '+r[2].text.strip()+"  "+r[3].text.strip()+"\t"+'\t'+r[4].text.strip()) 

Practical results

experience

Through this experiment, we have a further understanding of HTML document tree, understand the general crawling process, know how to find elements, and deepen our understanding of theoretical knowledge.

Operation ② goodprice:

Requirements: use the methods of requests and re database to design the price comparison directional crawler of a mall (choose by yourself), crawl the mall, search the data of the page with the keyword "schoolbag", and crawl the commodity name and price.

code

import urllib.request
from bs4 import BeautifulSoup
import re
import urllib.parse
import w3lib.html
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362"}
url="https://search.jd.com/Search?keyword=%E4%B9%A6%E5%8C%85&enc=utf-8&wq=%E4%B9%A6%E5%8C%85&pvid=9bd25fcc2bda47b1bc33a822f0ec8e11"

def getHtmlText(url,headers):
    req=urllib.request.Request(url,headers=headers)
    html=urllib.request.urlopen(req)
    html=html.read()
    html=html.decode()
    return html

html=getHtmlText(url,headers)
#Remove all span tags and the text content between them from the html document
html=w3lib.html.remove_tags_with_content(html,which_ones=('span',))
soup = BeautifulSoup(html, "html.parser")
i=0
lis=soup.find_all("li",{"data-sku": re.compile("\d+")})
print("Serial number\t Trade name\t\t\t\t\t\t\t\t\t\t\t Price")
#Originally, I didn't want to read level by level, but I didn't find level by level, and there was an error. The keyvalue is always 0...
# For the time being, this is how to find BA QAQ 
for li in lis:
    price1=li.find("div",attrs={"class":"p-price"}).find("strong").find("i")
    price = price1.text
    name1=li.find("div",attrs={"class":"p-name"}).find("a").find("em")
    name=name1.text.strip()#Find the name of the schoolbag and remove all blanks at the beginning and end
    i=i+1
    t='\t'
    print(i,t,name,t,price)

Practical results

experience

After this experiment, we have a preliminary understanding of the anti crawler mechanism. We have a deeper understanding of whether to search and extract information in HTML, whether it is direct keyword fuzzy search or through level-by-level search, and know how to remove the element information we don't need.

Job ③ JPGFileDownload:

Requirement: crawl a given web page( http://xcb.fzu.edu.cn/html/2019ztjy )Or select all JPG format files of the web page

code

'''
1,Webpage html
2,Traverse the document elements and match all with regular expressions jpg/png Ending
3,Extract the image path information and store it in the array
4,Traverse the array, path information, find the picture, and store it in the document at one time
'''
import urllib.request
from bs4 import BeautifulSoup
import re
from pyquery import PyQuery as pq

def getHtmlText(url):
    html=urllib.request.urlopen(url)
    html=html.read()
    html=html.decode()
    return html

url="http://xcb.fzu.edu.cn/"

headers={"User-Agent":" Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362"}
html=getHtmlText(url)
soup = BeautifulSoup(html, "html.parser")
imglist = soup.find_all("img",{"src": re.compile('(jpg|png)$')})
final_img_url=[]
for imgurl in imglist:
    #print(imgurl)
    src=imgurl.attrs['src']
    #Determine whether it is an absolute path or a relative path
    if src.find("http://") == 0: 
        final_img_url.append(src) 
    else:
        final_src=url+src
        #Determine whether the path information has been added to the array
        if(final_src in final_img_url):
            continue
        else:
            final_img_url.append(final_src)
        print(final_src)
#print(final_img_url)

i=0
for url in final_img_url:
    response=urllib.request.urlopen(url)
    img=response.read()
    dir='D:/11/'+str(i)+'.jpg'
    f=open(dir,'wb')
    f.write(img)
    i=i+1

Practical results

experience

Through these three reptile experiments, we have roughly understood the reptile process, and the process is almost the same, that is, there are differences in the selection of specific information or the matching of the information needed; In addition, in this practice, we need to pay attention to the absolute path and relative path.

Posted by kevingarnett2000 on Sun, 15 May 2022 11:41:02 +0300