Detailed explanation of Scrapy crawler framework (project actual combat)

Official documents: https://scrapy.org/

1, Introduction to the Scrapy framework

Writing a reptile requires a lot of work. For example: sending network request, data analysis, data storage, anti crawler mechanism (changing ip proxy, setting request header, etc.), asynchronous request, etc. It's a waste of time if you have to write these jobs from scratch every time. Therefore, scripy encapsulates some basic things, and writing crawlers on it can become more efficient (crawling efficiency and development efficiency). Therefore, in the real company, some crawlers with a large number are solved by using the Scrapy framework.


2, Scrapy architecture diagram


3, Scripy workflow

4, Scripy framework module function

Scrapy Engine: the core part of the scrapy framework. Be responsible for communicating and transferring data between Spider and ItemPipeline, Downloader and Scheduler.

Spider: send the links that need to be crawled to the engine. Finally, the engine sends the data requested by other modules to the crawler, and the crawler parses the desired data. This part is written by our developers, because it is up to the programmers to decide which links to climb and which data in the page we need.

Scheduler: it is responsible for receiving the requests sent by the engine, arranging and sorting them in a certain way, and scheduling the order of requests.

Downloader: it is responsible for receiving the download request from the engine, then downloading the corresponding data on the network and returning it to the engine.

Item Pipeline: it is responsible for saving the data passed by the Spider. Where to save it should depend on the needs of the developers themselves.

Downloader middleware: middleware that can extend the communication function between downloader and engine.

Spider middleware: middleware that can extend the communication function between engine and crawler.


5, Scrapy crawler project practice

Crawling for cat's eye movie reviews and other information

1. Install the script

pip install scrapy

2. Create project

The command is scratch startproject followed by our project name

$ scrapy startproject maoyan
New Scrapy project 'maoyan', using template directory 'c:\users\administrator\anaconda3\lib\site-packages\scrapy\templates\project', created in:
    D:\work\Python001-class01\week01\maoyan

You can start your first spider with:
    cd maoyan
    scrapy genspider example example.com

The output message tells us that we have created a new project and prompts us to enter the project to create our crawler. And gave us the command to create the crawler.


3. Create crawler

$cd maoyan
$ ls
maoyan/  scrapy.cfg
$ scrapy genspider movies maoyan.com  
Created spider 'movies' using template 'basic' in module:
  baidu.spiders.movies

movies is our reptile name, Maoyan COM is the domain name to be crawled by our crawler. The name of the crawler is subsequently used to start the crawler, and the configuration file will be initialized. This name will be referenced in many places in the configuration file. The domain name should also be filled in accurately.

The output information tells us that the created crawler uses the basic template and can be imported in a modular way. The syntax is Baidu spiders. movies.


4. Directory structure of Scrapy

Administrator@DESKTOP-MT9BAAK MINGW64 /d/work/Python001-class01/week01/maoyan/maoyan (master)
$ ll
total 10
-rw-r--r-- 1 Administrator 197121    0  6 August 28 00:11 __init__.py
drwxr-xr-x 1 Administrator 197121    0  6 August 28-23:09 __pycache__/
-rw-r--r-- 1 Administrator 197121  357  7 May 17:13 items.py
-rw-r--r-- 1 Administrator 197121 3648  6 August 28 00:21 middlewares.py
-rw-r--r-- 1 Administrator 197121  462  7 May 17:37 pipelines.py
-rw-r--r-- 1 Administrator 197121 3055  7 May 17:06 settings.py
drwxr-xr-x 1 Administrator 197121    0  6 August 28-23:09 spiders/

pipelines and items files are responsible for saving the data passed by the Spider. Where to save it should depend on the needs of the developers themselves.

Middleware is used to define and implement the functions of middleware

settings configuration file of the whole project

spiders / directory is our crawler file

When we finish the whole reptile project, it will be better understood


$ cd spiders/
Administrator@DESKTOP-MT9BAAK MINGW64 /d/work/Python001-class01/week01/maoyan/maoyan/spiders (master)
$ ls
__init__.py  __pycache__/  movies.py

Movies in here Py is our ultimate crawler

For example, we can use different crawlers to create different tasks in the same website. Multiple crawler tasks can also cooperate with each other.

5. settings profile

Explain some common configuration items, which are also the configuration information used in this project. For more detailed configuration file analysis, you can find special documents on the Internet.

cat settings.py

# Scrapy settings for maoyan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'maoyan'

SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'maoyan (+http://www.yourdomain.com) '# in the script USER_AGENT doesn't need to be written manually. When you visit the domain name you filled in for the first time, it will be automatically obtained and applied.

#You can write in the following way in order to deal with anti crawler means
#USER_AGENT_LIST=[        
# 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
#     "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
#     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
#     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
#     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
#     "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
#     "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
#     "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
#     "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
# ]
# import random
# USER_AGENT = random.choice(USER_AGENT_LIST)
#First write a value list, which contains multiple information. Use the random function to help us obtain it randomly. In fact, it is simpler
#In the single method, we can use a third-party library to randomly obtain USER_AGENT.  Later articles will be written.

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1  #Control the climbing rhythm and how long to stop each time
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'maoyan.middlewares.MaoyanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'maoyan.middlewares.MaoyanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'maoyan.pipelines.MaoyanPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


6. Crawler writing

movies.py

# -*- coding: utf-8 -*-
import scrapy
from lxml import etree
from scrapy.selector import Selector  #I wrote the code, but it's useless. I annotated it.
from maoyan.items import MaoyanItem


class MoviesSpider(scrapy.Spider):
    name = 'movies'   #The name specified by the crawler runtime
    allowed_domains = ['maoyan.com']   #Domain name to (restrict) crawl
    start_urls = ['https://maoyan.com/films?showType=3 '] # who to request for the first time (start the twisted asynchronous framework and get the header information)

    def parse(self, response):     #response is to start_urls specifies the return information of the request initiated by the address
        # tree = Selector(response=response)
        tree = etree.HTML(response.text)
        #Parse main page
        link = tree.xpath('//*[@class="channel-detail movie-item-title"]/a')
       # items = []
        for i in range(0, 10):  # Get the top 10 movie information
            item = MaoyanItem()
            # Get details page address
            page_url = "https://maoyan.com" + link[i].attrib['href']
            # item['page_url'] = page_url
            # items.append(item)
        # return items
            yield scrapy.Request(url=page_url, callback=self.parse2)    #Put page_ The request information of URL is passed to parse2, scene Request helps us make a request

    def parse2(self, response):
        # tree = Selector(response=response)
        tree = etree.HTML(response.text)
        # Get details page content
        page_tree = tree.xpath('//*[@class="movie-brief-container"]')

        # Movie title
        name_page = page_tree[0].xpath('h1')[0].text

        # Film type
        type_page = page_tree[0].xpath('ul/li[1]/a')[0].text


        # Release time
        time_page = page_tree[0].xpath('ul/li[3]')[0].text
         
        #storage     
        item = MaoyanItem()
        item['name_page'] = name_page
        item['type_page'] = type_page
        item['time_page'] = time_page

        yield item

Crawler logic:

When the crawler runs, it will first report to start_ The url specified in the URLs configuration item initiates a request to obtain the return information, then runs parse to parse the main web page, obtains the url of the details page, and finally calls back parse2 to parse the details page. Finally, use item to decouple the returned information, pipelines Py for output.

start_requests():
When the crawler starts, the engine will automatically call this method and will only be called once to generate the initial Request object. start_ The requests () method reads start_urls in the URLs list and generate a Request object and send it to the engine. The engine then directs other components to send requests to the website server to download web pages.

But our start_urls can only write one url. Sometimes we need to page the target website. At this time, we need to rewrite start_ The requests () method.

Example:

    def start_requests(self):
        for i in range(0, 10):
            url = f'https://movie.douban.com/top250?start={i*25}'
            yield scrapy.Request(url=url, callback=self.parse)
            # url the url requested
            # Callback callback function. The engine sends the downloaded page (Response object) to the method to perform data analysis
            # Here you can use callback to specify a new function instead of parse as the default callback parameter

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MaoyanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #pass
    
	#Fixed format
    name_page = scrapy.Field()
    type_page = scrapy.Field()
    time_page = scrapy.Field()


pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

class MaoyanPipeline:
    def process_item(self, item, spider):
        return item

We can use items and pipelines to define the storage methods (text, mysql), and I just output them to the screen.

So far, the modified documents have been posted and explained briefly. I sorted out the general process.

7. Run crawler

(base) D:\work\Python001-class01\week01\maoyan\maoyan>scrapy  crawl movies
2020-08-01 10:01:35 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: maoyan)
2020-08-01 10:01:35 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May  6 2020, 11:
45:54) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
2020-08-01 10:01:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-08-01 10:01:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'maoyan',
 'DOWNLOAD_DELAY': 1,
 'NEWSPIDER_MODULE': 'maoyan.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['maoyan.spiders'],
 'USER_AGENT': 'maoyan (+http://www.yourdomain.com)'}
2020-08-01 10:01:35 [scrapy.extensions.telnet] INFO: Telnet Password: ded8971554c0d368
2020-08-01 10:01:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-08-01 10:01:36 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-01 10:01:36 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-01 10:01:36 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-01 10:01:36 [scrapy.core.engine] INFO: Spider opened
2020-08-01 10:01:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-01 10:01:36 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-01 10:01:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/robots.txt> (referer: None)
2020-08-01 10:01:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films?showType=3> (referer: None)
2020-08-01 10:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/1203734> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/1203734>
{'name_page': 'Dorrit's fantasy adventure', 'time_page': '2020-07-24 Released in Chinese Mainland', 'type_page': ' plot '}
2020-08-01 10:01:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/1332034> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/1332034>
{'name_page': 'Most beautiful retrograde', 'time_page': '2020-06-28 Released in Chinese Mainland', 'type_page': ' plot '}
2020-08-01 10:01:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/158> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/158>
{'name_page': 'Talk about the wedding of the great sage of the journey to the West', 'time_page': '2020-07-24 Re screening in Chinese Mainland', 'type_page': ' comedy '}
2020-08-01 10:01:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/1301444> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/1301444>
{'name_page': 'Mr. Miao', 'time_page': '2020-07-31 Released in Chinese Mainland', 'type_page': ' animation '}
2020-08-01 10:01:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/1250952> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/1250952>
{'name_page': 'Son of the weather', 'time_page': '2019-11-01 Released in Chinese Mainland', 'type_page': ' love '}
2020-08-01 10:01:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/344990> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/344990>
{'name_page': 'Chinatown detective 2', 'time_page': '2018-02-16 Released in Chinese Mainland', 'type_page': ' comedy '}
2020-08-01 10:01:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/342485> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/342485>
{'name_page': 'The mystery of arrival', 'time_page': '2020-07-31 Released in Chinese Mainland', 'type_page': ' plot '}
2020-08-01 10:01:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/1218273> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/1218273>
{'name_page': 'Manslaughter', 'time_page': '2020-07-20 Re screening in Chinese Mainland', 'type_page': ' plot '}
2020-08-01 10:01:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/1285808> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/1285808>
{'name_page': 'Alive', 'time_page': '2020-06-24 Released in Korea', 'type_page': ' action '}
2020-08-01 10:01:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://maoyan.com/films/1297466> (referer: https://maoyan.com/films?showType=3)
2020-08-01 10:01:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://maoyan.com/films/1297466>
{'name_page': 'Busan line 2: Peninsula', 'time_page': '2020-07-15 Released in Korea', 'type_page': ' action '}
2020-08-01 10:01:48 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-01 10:01:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4874,
 'downloader/request_count': 12,
 'downloader/request_method_count/GET': 12,
 'downloader/response_bytes': 291383,
 'downloader/response_count': 12,
 'downloader/response_status_count/200': 12,
 'elapsed_time_seconds': 12.738743,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 8, 1, 2, 1, 48, 821792),
 'item_scraped_count': 10,
 'log_count/DEBUG': 22,
 'log_count/INFO': 10,
 'request_depth_max': 1,
 'response_received_count': 12,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 11,
 'scheduler/dequeued/memory': 11,
 'scheduler/enqueued': 11,
 'scheduler/enqueued/memory': 11,
 'start_time': datetime.datetime(2020, 8, 1, 2, 1, 36, 83049)}
2020-08-01 10:01:48 [scrapy.core.engine] INFO: Spider closed (finished)

(base) D:\work\Python001-class01\week01\maoyan\maoyan>

Tags: Python crawler

Posted by justice1 on Wed, 25 May 2022 13:30:41 +0300