The third job of data collection

homework one

Homework ①:

1. Requirements: Specify a website and crawl all the pictures in this website, such as China Meteorological Network ( http://www.weather.com.cn ). Use single-threaded and multi-threaded methods to crawl, respectively.

2. Output information: output the downloaded Url information in the console, store the downloaded image in the images subfile, and give a screenshot.

(1) The code is as follows:

# -*- coding = utf-8 -*-
# @Time: 2020/10/13 18:36
# @Author:CaoLanying
# @File:the_third_project1.py
# @Software:PyCharm
from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
import urllib.request
import threading
import os
#image scraping
def imageSpider(start_url):
    global threads
    global count #count
    try:
        urls=[]
        req = urllib.request.Request(start_url,headers=headers)
        data = urllib.request.urlopen(req)
        data = data.read() #routine

        dammit = UnicodeDammit(data,["utf-8","gbk"])
        data = dammit.unicode_markup
        soup = BeautifulSoup(data,"html.parser")
        images = soup.select("img") #get image tag
        for image in images:
            try:
                src=image["src"]
                url=urllib.request.urljoin(start_url,src)
                if url not in urls:
                    #print(url) #Crawled image address
                    count = count+1
                    T = threading.Thread(target=download,args=(url,count)) #number of threads
                    T.setDaemon(False) #non-daemon thread
                    T.start()
                    threads.append(T) #Add the thread to the thread array
            except Exception as err:
                print(err)
    except Exception as err: \
        print(err)

def download(url,count):
    try:
        if(url[len(url)-4]=="."):
            ext = url[len(url)-4:] #ext is a "."
        else:
            ext=""

        req = urllib.request.Request(url,headers=headers)
        data = urllib.request.urlopen(req,timeout=100)
        data = data.read()
        fobj = open("D:\\pythonProject\\Wulin'course\\images\\" + str(count) + ext, "wb")
        fobj.write(data)
        fobj.close()
        print("downloaded " + str(count) + ext)
    except Exception as err:
        print(err)


#start_url="http://www.weather.com.cn/weather/101280601.shtml"
#start_url="http://www.sziit.edu.cn"
start_url="http://xcb.fzu.edu.cn/#"
headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre)Gecko/2008072421 Minefield/3.0.2pre"}
count=0
threads=[]
imageSpider(start_url)
for t in threads:
    t.join()
print("The End")

(2) Result picture:

(3) Experience:


1. Problems encountered:


2. Solution:
The path was not written correctly, so it was replaced with an absolute path fobj = open("D:\pythonProject\Wulin'course\images\" + str(count) + ext, "wb")

homework two

**
Requirement: Use the scrapy framework to reproduce the job ①.

Output information: same as job ①

**

(1) Each step and code:

1. Write spider

import scrapy

from ..items import GetimagItem
class WeatherSpider(scrapy.Spider):
    name = 'weather'
    allowed_domains = ['p.weather.com']
    start_urls = ['http://p.weather.com.cn/']

    def parse(self, response):
        img_url_list = response.xpath('//img/@src')

        for url in img_url_list.extract():

            item = GetimagItem()
            item["url"] = url
            print(url)
            yield item

        print("ok")

2. Write pipelines

# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import urllib
class GetimagPipeline(object):
    count = 0  # Number of calls to process_item
    def process_item(self,item,spider):
        GetimagPipeline.count += 1
        try:
            url = item["url"] #get url address
            if (url[len(url) - 4] == "."):
                ext = url[len(url) - 4:]  # ext is a "."
            else:
                ext = ""
            req = urllib.request.Request(url)
            data = urllib.request.urlopen(req, timeout=100)
            data = data.read()
            fobj = open("D:\\pythonProject\\Wulin'course\\images2\\" + str(GetimagPipeline.count) + ext, "wb")  # open a file, this
            fobj.write(data) # data input
            fobj.close() # close file
            print("downloaded " + str(GetimagPipeline.count) + ext)
        except Exception as err:
            print(err)
        return item

3. Set settings

(2) Result picture:


(3) Experience:

Learned that the Scrapy framework is a fast, high-level Python-based web crawler framework that crawls web sites and extracts structured data from pages. hhhh, but the first time I used it, I was at a loss. I didn't know what module to write the code under. After carefully reading the examples in the book, I got a little bit of inspiration. After knowing more about the relationship between the process and various frameworks, it feels very convenient to program. ​

Why use the Scrapy framework? Because it is easier to build large-scale scraping projects; asynchronous processing of requests is fast, and the crawling speed is automatically adjusted using an auto-scaling mechanism. In a word, scrapy is very powerful! !

homework three

Requirements: Use the scrapy framework to crawl stock-related information.

Candidate website: Oriental Fortune Network: https://www.eastmoney.com/

​ Sina stock: http://finance.sina.com.cn/stock/

Output information:

No. Stock Code Stock Name Latest Quote Change Change Change Volume Turnover Volume Amplitude Highest Lowest Open Today Closed Yesterday
1 688093 N Shihua 28.47 62.22% 10.92 261,300 760 million 22.34 32.0 28.08 30.2 17.55
2......

(1) The code is as follows:

General framework:

1. Write items

# Define here the models for your scraped items
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class StocksItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    i = scrapy.Field()
    f12 = scrapy.Field()
    f14 = scrapy.Field()
    f2 = scrapy.Field()
    f3 = scrapy.Field()
    f4 = scrapy.Field()
    f5 = scrapy.Field()
    f6 = scrapy.Field()
    f7 = scrapy.Field()
    pass

2. Write spider

import scrapy
import json
from ..items import StocksItem
class MystockSpider(scrapy.Spider):
    name = 'mystock'
    start_urls = ["http://75.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112406817237975028352_1601466960670&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:0+t:6,m:0+t:13,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1601466960895"]
    #start_urls = ["http://quote.eastmoney.com/center/gridlist.html#hs_a_board"]
    def parse(self, response):
        # body_as_unicode() is called to handle unicode encoded data
        count = 0
        result = response.text
        result = result.replace('''jQuery112406817237975028352_1601466960670(''',"").replace(');','')#I'm mad, the outermost ");" should be removed, otherwise it will keep reporting an error. for a long time
        result = json.loads(result)
        for f in result['data']['diff']:
            count += 1
            item = StocksItem()
            item["i"] = str(count)
            item["f12"] = f['f12']
            item["f14"] = f['f14']
            item["f2"] = f['f2']
            item["f3"] = f['f3']
            item["f4"] = f['f4']
            item["f5"] = f['f5']
            item["f6"] = f['f6']
            item["f7"] = f['f7']
            yield item
        print("ok")

3. Write pipelines

# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from openpyxl import Workbook

class StocksPipeline(object):
    wb = Workbook()
    ws = wb.active  # Activate worksheet
    ws.append(["serial number","code","name","Latest price (yuan)","Quote change","Decrease and increase (yuan)", "volume","Turnover(Yuan)","gain"])
    def process_item(self, item, spider):
        line = [item['i'], item['f12'], item['f14'], item['f2'], item['f3'],item['f4'],item['f5'], item['f6'],item['f7']]  # Sort out each item in the data
        self.ws.append(line)  # Add data to xlsx as rows
        self.wb.save(r'C:\Users\Administrator\Desktop\stock.xlsx')  # save xlsx file

        return item

4. Set settings

(2) Result picture:

(3) Experience:

**
With the experience of the second experiment, the third assignment is not so difficult. I heard the teacher say that the data output should be more beautiful, so I went to see how to store the data in the Excel table, and found it very convenient and practical. ​​

    mainly used Scrapy of pipeline.py and python open source library OpenPyxl. Try other storage methods later.

**

1. Problems encountered:
An error is reported when the crawler starts.


2. Solution:

This is set to true

Attached is how srapy uses Excel:

About pipeline

The pipeline is a module in scrapy. After the data is captured by the spider, it will be processed by the pipeline. There are usually several "processes" in the pipeline, and the data will pass through these "processes" in order. If it does not pass a certain "process", it will be discarded. **


Pipelines generally have several uses:

1. Cleaning HTML data (such as cleaning a useless tag)

 
2. Confirm that the data has been crawled (such as confirming whether it contains a specific field)

3. Check for duplicates (filter duplicate data)

4. Save the captured data into the database


What we use here is the last function, just save as xlsx file.

About OpenPyxl

OpenPyxl is a python library for reading and writing Excel 2007 xlsx/xlsm files. Without further ado, let's go straight to the example:

from openpyxl import Workbook

wb = Workbook()  # class instantiation
ws = wb.active  # Activate worksheet

ws['A1'] = 42  # A1 form input data
ws.append(['Kobe', '1997 year', 'defender', 'season reimbursement'])  # add a row of data

wb.save('/home/alexkh/nba.xlsx')  # save document

Scrapy save as excel


Scrapy data is saved as excel and processed in pipeline.py. The specific code is as follows:

#coding=utf-8
from openpyxl import Workbook

class TuniuPipeline(object):  # Set up process one
    self.wb = Workbook()
    self.ws = self.wb.active
    self.ws.append(['News headline', 'News link', 'source website', 'release time', 'Similar news', 'Does it contain a website name?'])  # set header


def process_item(self, item, spider):  # Process specific content
    line = [item['title'], item['link'], item['source'], item['pub_date'], item['similar'], item['in_title']]  # Sort out each item in the data
    self.ws.append(line)  # Add data to xlsx as rows
    self.wb.save('/home/alexkh/tuniu.xlsx')  # save xlsx file
    return item


In order for pipeline.py to take effect, it is also necessary to add settings in the settings.py file, as follows:

ITEM_PIPELINES = {
    'tuniunews.pipelines.TuniuPipeline': 200,  # 200 is to set the process sequence
}


References

  1. The section about Item Pipeline in the Scrapy documentation
    2.OpenPyxl official documentation

Posted by fat creative on Wed, 11 May 2022 16:27:41 +0300