Assignment ① universites ranking:
Requirements: use the requests and beautiful soup library methods to crawl the given web address( http://www.shanghairanking.cn/rankings/bcur/2020 )According to the data, the screen prints the crawling university ranking information.
code
import urllib.request from bs4 import BeautifulSoup def getHtmlText(url): html=urllib.request.urlopen(url) html=html.read() html=html.decode() return html url="http://www.shanghairanking.cn/rankings/bcur/2020" html=getHtmlText(url) #Get html document soup = BeautifulSoup(html, "html.parser") #Traversing html documents print("ranking\t School name\t Provinces and cities\t School type\t Total score") for tr in soup.find('tbody').children: r = tr.find_all("td") #Find all td elements #Print text in element print(r[0].text.strip()+"\t"+r[1].text.strip()+"\t"+' '+r[2].text.strip()+" "+r[3].text.strip()+"\t"+'\t'+r[4].text.strip())
Practical results
experience
Through this experiment, we have a further understanding of HTML document tree, understand the general crawling process, know how to find elements, and deepen our understanding of theoretical knowledge.
Operation ② goodprice:
Requirements: use the methods of requests and re database to design the price comparison directional crawler of a mall (choose by yourself), crawl the mall, search the data of the page with the keyword "schoolbag", and crawl the commodity name and price.
code
import urllib.request from bs4 import BeautifulSoup import re import urllib.parse import w3lib.html headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362"} url="https://search.jd.com/Search?keyword=%E4%B9%A6%E5%8C%85&enc=utf-8&wq=%E4%B9%A6%E5%8C%85&pvid=9bd25fcc2bda47b1bc33a822f0ec8e11" def getHtmlText(url,headers): req=urllib.request.Request(url,headers=headers) html=urllib.request.urlopen(req) html=html.read() html=html.decode() return html html=getHtmlText(url,headers) #Remove all span tags and the text content between them from the html document html=w3lib.html.remove_tags_with_content(html,which_ones=('span',)) soup = BeautifulSoup(html, "html.parser") i=0 lis=soup.find_all("li",{"data-sku": re.compile("\d+")}) print("Serial number\t Trade name\t\t\t\t\t\t\t\t\t\t\t Price") #Originally, I didn't want to read level by level, but I didn't find level by level, and there was an error. The keyvalue is always 0... # For the time being, this is how to find BA QAQ for li in lis: price1=li.find("div",attrs={"class":"p-price"}).find("strong").find("i") price = price1.text name1=li.find("div",attrs={"class":"p-name"}).find("a").find("em") name=name1.text.strip()#Find the name of the schoolbag and remove all blanks at the beginning and end i=i+1 t='\t' print(i,t,name,t,price)
Practical results
experience
After this experiment, we have a preliminary understanding of the anti crawler mechanism. We have a deeper understanding of whether to search and extract information in HTML, whether it is direct keyword fuzzy search or through level-by-level search, and know how to remove the element information we don't need.
Job ③ JPGFileDownload:
Requirement: crawl a given web page( http://xcb.fzu.edu.cn/html/2019ztjy )Or select all JPG format files of the web page
code
''' 1,Webpage html 2,Traverse the document elements and match all with regular expressions jpg/png Ending 3,Extract the image path information and store it in the array 4,Traverse the array, path information, find the picture, and store it in the document at one time ''' import urllib.request from bs4 import BeautifulSoup import re from pyquery import PyQuery as pq def getHtmlText(url): html=urllib.request.urlopen(url) html=html.read() html=html.decode() return html url="http://xcb.fzu.edu.cn/" headers={"User-Agent":" Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362"} html=getHtmlText(url) soup = BeautifulSoup(html, "html.parser") imglist = soup.find_all("img",{"src": re.compile('(jpg|png)$')}) final_img_url=[] for imgurl in imglist: #print(imgurl) src=imgurl.attrs['src'] #Determine whether it is an absolute path or a relative path if src.find("http://") == 0: final_img_url.append(src) else: final_src=url+src #Determine whether the path information has been added to the array if(final_src in final_img_url): continue else: final_img_url.append(final_src) print(final_src) #print(final_img_url) i=0 for url in final_img_url: response=urllib.request.urlopen(url) img=response.read() dir='D:/11/'+str(i)+'.jpg' f=open(dir,'wb') f.write(img) i=i+1
Practical results
experience
Through these three reptile experiments, we have roughly understood the reptile process, and the process is almost the same, that is, there are differences in the selection of specific information or the matching of the information needed; In addition, in this practice, we need to pay attention to the absolute path and relative path.