Crawl the mv download address of a website based on scratch

Scrapy is well-known. Seeing that the mv of a website is good, it's too troublesome to download it manually, so we use scrapy to grab it.

The basic idea is to study the home page of the website, the list page of movies, and its playing page, obtain the formatted information in the page, the jump relationship from page to page, and finally obtain the download link.

For Scrapy's architecture knowledge, please see Introduction to the principle of Scrapy architecture Or official documents Introduction to sscrapy architecture , let's make it step by step.

1, Start building projects and writing programs

Scratch provides a command line tool to create projects. Execute the following commands on the CMD command line to create projects and crawler files:

#Create project
1) scrapy startproject projectname
#Enter project
2) cd projectname
#To create a crawler file, the Scrapy framework has automatically generated some code
3)scrapy genspider spidername www.xxx.com

2, The key to modifying the code is the spider

1) Open the crawler file spidername Py, sweep calls parse method by default, which is the entry function of processing logic. First analyze the link of the home page, get the address of the movie classification, and then jump to the list page. Pay attention to using yield sweep Request (url = URL, callback = self. Parselistpage) requests the next page, and callback is the callback of the returned parsing method

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['xxx.com']
    start_urls = ['https://xxx.com']

    def parse(self, response):
        #result = response.xpath('/html/body/title/text()').extract_first()   /html/body/div/div/div/ul/li/div/div/
        result = response.xpath('//div[@class="item-menu"]/a/@href').extract()    #ul/li/div/div/div[@class="item-menu"]/a
        #print('-' * 60 )
        #print( result )
        #print('-' * 60)
        for url in result:
            url = self.fullUrl( url )
            yield scrapy.Request(url=url, callback=self.parseListPage )

2) Parse the list page. The code is as follows. The core is the use of xpath and the processing of strings. Finally, get the movie playing address and request.

Note here that the parsed data generates an item object. See items Py file to facilitate data processing or storage through pipeline. With yield data, pipeline can receive this object and process it.

    def parseListPage(self, response ):
        #Parse the list to get mv information and url
        result = response.xpath(
            '//div[@id="list_videos_common_videos_list_items"]/div[@class="item"]')
        print('-' * 60)
        print(len( result ))
        print('-' * 60)
        #pageUrl = response.url
        #mvs = []
        for e in result:
            data = Mv05Item()
            #print( e.xpath('./a[0]/@href').extract_first() )
            # data-rt="1:2c8d63ec93028cf593fa06c9ab7db742:0:164936:1:"
            data["id"] = e.xpath('./a/@data-rt').extract_first().split(":")[3]

            data["url"] = e.xpath('./a/@href').extract_first()
            data["channel"] = getChannelFromURL(response.url)
            yield  data
            yield scrapy.Request(url=self.fullUrl( data["url"] ), callback=self.parsePlayPage )

3) Analyze the movie playing page and obtain the address of the movie file. The code is not listed

4) Verify whether the code logic is correct. I downloaded three typical pages and wrote a test program to verify it.

3, Modify profile

The configuration item of the settings file is very important. It provides the configuration parameters of important functions and the configuration of extension points

#Set the crawl depth. If it exceeds this value, the execution will not continue
DEPTH_LIMIT=3

# Setting user agent is a necessary option to deal with anti crawler
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'
#Compliance with robot agreement
ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# When enabled, when getting data from the same website, scripy will wait for a random value, and the delay time is a random value between 0.5 and 1.5 multiplied by DOWNLOAD_DELAY
RANDOMIZE_DOWNLOAD_DELAY=True

#The maximum value of concurrent requests for a single website. The default value is 8
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#The maximum value of concurrent requests for a single IP. If it is not 0, the current is ignored_ REQUESTS_ PER_ Domain setting, using this IP limit setting.
#CONCURRENT_REQUESTS_PER_IP = 16


About automatic speed limit( AutoThrottle extension)Several parameters of
    AUTOTHROTTLE_ENABLED
    True: Enable; False: Not enabled (default value);

    AUTOTHROTTLE_START_DELAY
    The initial of each reptile download delay,The unit is seconds, and the default value is 5.0

    AUTOTHROTTLE_MAX_DELAY
    Maximum allowed download delay,The unit is seconds in case of latency Caused by too high download delay As it gets higher. The default value is 60.0

    AUTOTHROTTLE_TARGET_CONCURRENCY
    The default value is 1 for the average number of concurrent requests per website.0. This is an average value, which means that the concurrency at a certain time may be higher or lower than this value.

    AUTOTHROTTLE_DEBUG
    In debug mode, the log will print the time consumed for each response latency With the current setting Download_delay duration This allows real-time observation Download_delay Parameter adjustment process. True: Start commissioning; False: Turn off debugging (default)

    CONCURRENT_REQUESTS_PER_DOMAIN
    Concurrent number of single domain name

    CONCURRENT_REQUESTS_PER_IP
    single IP Concurrent quantity

    DOWNLOAD_DELAY
    For the same website The interval between requests. The default value is 0

    RANDOMIZE_DOWNLOAD_DELAY
    True: Start random request interval, 0.5*DOWNLOAD_DELAY To 1.5*DOWNLOAD_DELAY Random number between; False: Do not start; Default is True,But because DOWNLOAD_DELAY The default value of is 0, so it needs to be set in order to really start DOWNLOAD_DELAY

4, Write pipeline and store the data into sqlite database. Pipeline provides open_spider,process_item,close_spider and other methods. For specific usage, see Pipeline project pipeline usage

# Pipeline stored in sqlite database
class Mv05SQLitePipeline(object):
    #start
    def open_spider(self, spider):
        # Start the crawler project and perform the data connection operation
        # The following constants need to be defined in the settings configuration file
        self.db = sqlite3.connect( SQLITE_FILE )
        self.cursor = self.db.cursor()
        if not self.checkExist():
            self.initTable()


    # Insert data into a table
    def process_item(self, item, spider):

           ins = 'insert into movieinfo (id, name, url, durat, rating, channel) values(?,?,?,?,?,?)'
            L = [
                item['id'], item['name'], item['url'], item['durat'], item['rating'], item['channel']
            ]
            # self.cursor.execute("BEGIN TRANSACTION")
            self.cursor.execute(ins, L)
            self.db.commit()
            return item
       

   # Finish storing data and execute in the last step of the project
    def close_spider(self, spider):
        # close_ The spider() function is executed only once after all data are fetched,
        self.cursor.close()
        self.db.close()
        print('Yes close_spider method,The project has been closed')

5, Run project

You can execute the sweep command on the command line to run

scrapy crawl spidername

In order to eliminate the link of the terminal knocking the command, you can customize a running file Mian Py (Note: this file is in the same directory as scratch.cfg), and write the following code:

    from scrapy import cmdline
    # Attention, CmdLine Execute () is to reduce the operation of entering commands. The parameters of this method must be a list.
    # Execute the crawler file to start the project
    cmdline.execute('scrapy crawl spidername'.split())

6, Problems encountered

1. Prompt KeyError: 'Spider not found. The solution is that when executing this command, the value of sweep crawl spidername, spidername and name defined by spider class should be consistent.

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['xxx.com']
    start_urls = ['https://xxx.com']

2. Pipeline does not execute because there is no customized pipeline configured in the setting configuration file. The configuration is as follows.

#Activate pipeline, otherwise it will not work; Multiple pipelines are passed from the bottom to the high value
ITEM_PIPELINES = {
#    'mv05.pipelines.Mv05Pipeline': 100,
    'mv05.pipelines.Mv05SQLitePipeline': 300,
}

3. You need to have knowledge of xpath. For relevant knowledge, see python xpath syntax perhaps xpath Selector Guide for scratch

Overall reference Practical application of Python Scrapy crawler framework This article.

Tags: Python crawler programming language

Posted by reyes99 on Sun, 22 May 2022 20:58:28 +0300