Crawler practice platform for scratch learning 4

preface

The last article talked about how to use the combination of sweep and selenium to crawl data. This article is about how to use selenium to crawl websites that use Ajax to load data and pass the anti crawl.

Environment configuration

All the environments used in this article have been configured in the previous article. If you don't know how to use them, you can move to the previous article article.

Start crawling

antispider1

antispider1 is described as follows:

Docking with WebDriver anti climbing, if it is detected that WebDriver is used, the page will not be displayed, which is suitable for WebDriver anti climbing practice.

WebDriver back crawling indicates that using selenium will be detected.

First try the method mentioned in the previous article.

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


class AntiSpider(scrapy.Spider):
    name = 'antispider1'

    def start_requests(self):
        urls = ['https://antispider1.scrape.center/']
        for a in urls:
            yield SeleniumRequest(url=a, callback=self.parse, wait_time=8, wait_until=EC.presence_of_element_located(
                (By.CLASS_NAME, 'm-b-sm')))

    def parse(self, response, **kwargs):
        print(response.text)
        input()

When running the code, selenium will throw a timeout exception. Because the specified tag is not searched within the specified time, a timeout error is reported.

selenium.common.exceptions.TimeoutException: Message: 

When viewing the page at the same time, it is obvious that it has been detected, and the page content has been deleted by JS. Next, find the detection point and climb back after passing.

First delete the code of selenium waiting element to prevent the browser from exiting due to throwing exceptions, and let the program wait indefinitely on the input function. The crawler will not exit and the browser will not be closed, which is convenient for debugging.

Because there is no good starting point for this anti crawl detection, you can directly open the browser console and search the string Webdriver Forbidden globally. Only one place is found.

It looks like a ternary operator. By judging window navigator. The value of webdriver to determine whether to display the anti crawling interface or load data normally.

Execute window navigator. Webdriver can see that its value is true. There are two methods to modify its return value:

  • Through window navigator. webdriver = undefined
  • Object.defineProperties(navigator, {webdriver:{get:()=>undefined}});

Tested on the latest version of Chrome, these two methods have failed. Although the direct assignment method is successfully executed, it cannot modify the return value. Although the return value can be modified by modifying the attribute, the window navigator. Webdriver will automatically change back to true and needs to be executed before each page is loaded, so the problem becomes how to execute custom commands before the page is loaded.

In selenium, you can use the CDP (chrome devtools protocol) Chrome development tool protocol to solve this problem. The CDP command can load custom code before each page is loaded. In CDP, this command is called page addScriptToEvaluateOnNewDocument.

Through execute_cdp_cmd function executes CDP command with code:

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
            "source": """
            Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
            })"""
        })

In this way, it only needs to be executed once, and Chrome will automatically execute the pre-defined instructions before loading the page each time.

With the method, how to integrate it into the existing framework? Because the third-party package integration selenium is used now, the code of the third-party package cannot be modified directly, and the driver object is in charge of the third-party package. We may not get this object. How can we execute the CDP command?

At this time, you need to browse the code of the third-party package to see where the author saved the driver object and how to get it.

First look at the first file (scratch_selenium / HTTP. Py)

class SeleniumRequest(Request):
    """Scrapy ``Request`` subclass providing additional arguments"""

    def __init__(self, wait_time=None, wait_until=None, screenshot=False, script=None, *args, **kwargs):
        # Comments have been deleted for ease of viewing
        self.wait_time = wait_time
        self.wait_until = wait_until
        self.screenshot = screenshot
        self.script = script

        super().__init__(*args, **kwargs)

It only inherits the Request class of the sweep to facilitate the passing of four parameters to the driver object. Let's look at another file (sweep_selenium / middlewares. Py)

class SeleniumMiddleware:
    """Scrapy middleware handling the requests using selenium"""

    def __init__(self, driver_name, driver_executable_path, driver_arguments,
        browser_executable_path):
        # Comments have been deleted for ease of viewing
        webdriver_base_path = f'selenium.webdriver.{driver_name}'

        driver_klass_module = import_module(f'{webdriver_base_path}.webdriver')
        driver_klass = getattr(driver_klass_module, 'WebDriver')

        driver_options_module = import_module(f'{webdriver_base_path}.options')
        driver_options_klass = getattr(driver_options_module, 'Options')

        driver_options = driver_options_klass()
        if browser_executable_path:
            driver_options.binary_location = browser_executable_path
        for argument in driver_arguments:
            driver_options.add_argument(argument)

        driver_kwargs = {
            'executable_path': driver_executable_path,
            f'{driver_name}_options': driver_options
        }

        self.driver = driver_klass(**driver_kwargs)

    @classmethod
    def from_crawler(cls, crawler):
        """Initialize the middleware with the crawler settings"""

        driver_name = crawler.settings.get('SELENIUM_DRIVER_NAME')
        driver_executable_path = crawler.settings.get('SELENIUM_DRIVER_EXECUTABLE_PATH')
        browser_executable_path = crawler.settings.get('SELENIUM_BROWSER_EXECUTABLE_PATH')
        driver_arguments = crawler.settings.get('SELENIUM_DRIVER_ARGUMENTS')

        if not driver_name or not driver_executable_path:
            raise NotConfigured(
                'SELENIUM_DRIVER_NAME and SELENIUM_DRIVER_EXECUTABLE_PATH must be set'
            )

        middleware = cls(
            driver_name=driver_name,
            driver_executable_path=driver_executable_path,
            driver_arguments=driver_arguments,
            browser_executable_path=browser_executable_path
        )

        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)

        return middleware

    def process_request(self, request, spider):
        """Process a request using the selenium driver if applicable"""

        if not isinstance(request, SeleniumRequest):
            return None

        self.driver.get(request.url)

        for cookie_name, cookie_value in request.cookies.items():
            self.driver.add_cookie(
                {
                    'name': cookie_name,
                    'value': cookie_value
                }
            )

        if request.wait_until:
            WebDriverWait(self.driver, request.wait_time).until(
                request.wait_until
            )

        if request.screenshot:
            request.meta['screenshot'] = self.driver.get_screenshot_as_png()

        if request.script:
            self.driver.execute_script(request.script)

        body = str.encode(self.driver.page_source)

        # Expose the driver via the "meta" attribute
        request.meta.update({'driver': self.driver})

        return HtmlResponse(
            self.driver.current_url,
            body=body,
            encoding='utf-8',
            request=request
        )

    def spider_closed(self):
        """Shutdown the driver when spider is closed"""

        self.driver.quit()

It is a class for downloading middleware. The code is relatively long. Look at it one by one.

Let's start with from_ The crawler method obtains the defined configuration through the configuration file, then creates the object of the current class, and connects the signal that the crawler closes to the spider_ On the closed method, execute the quit method in time to close the browser when the crawler is closed.

Let's look at the initialization method. It accepts four parameters and uses import through the received parameters_ Module method to import the class. Finally, add some parameters to create a driver object and assign it to self Driver, you can find the driver here and find a way to execute the CDP method.

The last is process_ The request method uses the get method to obtain the web page source code, adds the cookie of the request object to the driver object, and performs different actions according to different parameter values, such as waiting, screenshot, execution code, etc. the driver object is exposed through the meta attribute, which is convenient for other middleware to click, slide, page turn, etc. after requesting the page data, and finally returns a THML response object.

Although the driver object is finally exposed through the meta attribute, this is after obtaining the web page source code. We need to execute the corresponding CDP command before loading the web page.

In order to execute commands before page loading, we need to customize our own download middleware, inherit selenium middleware class and modify the parent class initialization method.

In middlewars Py file. Don't forget to import selenium middleware.

class MyDownloadMiddleware(SeleniumMiddleware):
    def __init__(self, driver_name, driver_executable_path, driver_arguments,
                 browser_executable_path):
        super(MyDownloadMiddleware, self).__init__(driver_name, driver_executable_path, driver_arguments,
                                                   browser_executable_path)
        self.driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
            "source": """
            Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
            })"""
        })

First execute the super method to initialize the parent class, and then use the driver object created by the parent class to execute the CDP command. Don't forget to go to settings The value of modifying the download Middleware in py is:

DOWNLOADER_MIDDLEWARES = {
    'learnscrapy.middlewares.MyDownloadMiddleware': 800  # The value here should be larger, because the middleware will not call the subsequent download middleware after returning the response object
}

Re run the crawler, you should see that the page is loaded normally, and the web page source code can be obtained normally, and then add the specific parsing code.

class AntiSpider(scrapy.Spider):
    name = 'antispider1'

    def start_requests(self):
        urls = ['https://antispider1.scrape.center/']
        for a in urls:
            yield SeleniumRequest(url=a, callback=self.parse, wait_time=3, wait_until=EC.presence_of_element_located(
                (By.CLASS_NAME, 'm-b-sm')))

    def parse(self, response, **kwargs):
        result = response.xpath('//div[@class="el-card item m-t is-hover-shadow"]')
        for a in result:
            item = SSR1ScrapyItem()
            item['title'] = a.xpath('.//h2[@class="m-b-sm"]/text()').get()
            item['fraction'] = a.xpath('.//p[@class="score m-t-md m-b-n-sm"]/text()').get().strip()
            item['country'] = a.xpath('.//div[@class="m-v-sm info"]/span[1]/text()').get()
            item['time'] = a.xpath('.//div[@class="m-v-sm info"]/span[3]/text()').get()
            item['date'] = a.xpath('.//div[@class="m-v-sm info"][2]/span/text()').get()
            url = a.xpath('.//a[@class="name"]/@href').get()
            print(response.urljoin(url))
            yield SeleniumRequest(url=response.urljoin(url), callback=self.parse_person, meta={'item': item},
                                  wait_time=3, wait_until=EC.presence_of_element_located((By.CLASS_NAME, 'm-b-sm')))

    def parse_person(self, response):
        item = response.meta['item']
        item['director'] = response.xpath(
            '//div[@class="directors el-row"]//p[@class="name text-center m-b-none m-t-xs"]/text()').get()
        yield item

See for complete code https://github.com/libra146/learnscrapy/tree/antispider1

In fact, there is another way to execute customized JS code before page loading, that is, Chrome expansion, which can be realized by using the expansion similar to the oil monkey plug-in. Due to space limitation, it will not be demonstrated here.

antispider2

antispider2 is described as follows:

The user agent will refuse to respond when a common crawler is detected. It is suitable for user agent anti climbing practice.

Since it is user agent anti crawling, you can use the normal user agent. selenium is not needed for the time being.

I originally wanted to use the fake useragent. Later, I saw that the project had not been updated for more than two years, and it was not generated randomly. I just downloaded some UAS from the Internet and selected them randomly. In this way, there is no need to introduce a dependency. Just climb down the UA and take them randomly.

Add the following code to the download middleware:

class Antispider2DownloaderMiddleware(LearnscrapyDownloaderMiddleware):
    def __init__(self):
        super(Antispider2DownloaderMiddleware, self).__init__()
        with open('ua.json', 'r') as f:
            self.ua = json.load(f)

    def process_request(self, request, spider):
        request.headers.update({'User-Agent': random.choice(self.ua)})

Read the local file, and then in process_ In the request function, you can randomly go to one UA at a time and update the default UA.

See for complete code https://github.com/libra146/learnscrapy/tree/antispider2

antispider3

antispider3 is described as follows:

Butt text offset reverse climbing, the order seen is not necessarily consistent with the source code order, which is suitable for composition word offset reverse climbing practice.

The website uses the text offset and anti climbing. It is speculated that CSS should be used to control the position of web page text to achieve the purpose of anti climbing.

After looking at the rendered web page source code, it is true that the text is offset by changing the value of style. The processing method is to obtain the text together with the style attribute, and then arrange it in ascending order of style to get the correct result: thinking changes life. Look down. Some words have offset, and some words have no offset. It needs to be judged in the code.

Start writing code, parse HTML, get data, get the corresponding style by the way, and get the data in the correct order after processing.

class AntiSpider(scrapy.Spider):
    name = 'antispider3'

    def start_requests(self):
        urls = ['https://antispider3.scrape.center/']
        for a in urls:
            yield SeleniumRequest(url=a, callback=self.parse, wait_time=3, wait_until=EC.presence_of_element_located(
                (By.CLASS_NAME, 'm-b-sm')))

    def parse(self, response, **kwargs):
        result = response.xpath('//div[@class="el-card__body"]')
        for a in result:
            item = Antispider3ScrapyItem()
            chars = {}
            # Anti climbing
            if r := a.xpath('.//h3[@class="m-b-sm name"]//span'):
                for b in r:
                    chars[b.xpath('.//@style').re(r'\d\d?')[0]] = b.xpath('.//text()').get().strip()
                # First, sort with the sorted function, use lambda to specify the value with the index value of 0, that is, sort according to the key value, and then use the zip function to put all strings into
                # In the same tuple, the list function is used to convert the generator into a list, then use the index value to select the tuple where the title is located, and use the join function to connect all strings, that is, the title string
                item['title'] = ''.join(list(zip(*sorted(chars.items(), key=lambda i: i[0])))[1])
            else:
                # No anti climbing
                item['title'] = a.xpath('.//h3[@class="name whole"]/text()').get()
            item['author'] = a.xpath('.//p[@class="authors"]/text()').get().strip()
            url = a.xpath('.//a/@href').get()
            print(response.urljoin(url))
            yield SeleniumRequest(url=response.urljoin(url), callback=self.parse_person, meta={'item': item},
                                  wait_time=3, wait_until=EC.presence_of_element_located((By.CLASS_NAME, 'm-b-sm')))

    def parse_person(self, response):
        item = response.meta['item']
        item['price'] = response.xpath('//p[@class="price"]/span/text()').get()
        item['time'] = response.xpath('//p[@class="published-at"]/text()').get()
        item['press'] = response.xpath('//p[@class="publisher"]/text()').get()
        item['page'] = response.xpath('//p[@class="page-number"]/text()').get()
        item['isbm'] = response.xpath('//p[@class="isbn"]/text()').get()
        yield item

Judge whether the title is crawled backward by judging whether the corresponding element exists, and carry out different processing in the judgment branch.

Due to the time relationship, only one page of data is crawled in the code, so it is enough to prove that the method is feasible.

There is an episode. In fact, the data of this website is also requested through Ajax, that is to say, the data can be obtained directly from the interface request without processing anti climbing measures. Here, the data is obtained from HTML in order to learn text offset anti climbing.

See for complete code https://github.com/libra146/learnscrapy/tree/antispider3

antispider4

antispider4 is described as follows:

The displayed content is not in HTML but hidden in the font file. A text mapping table is set up, which is suitable for font anti climbing practice.

In this case, you need to find the font mapping table and analyze the corresponding relationship between the text and code in the font mapping table before you can crawl normally.

But after I saw this, I found that it didn't seem to be the reverse of the font 😂 (although the website does have a separate font file), the digital content is placed in the CSS style sheet file. Although this is the first time I have seen this anti crawl measure, I think it seems more reasonable to call it CSS anti crawl.

I don't know if the author made a mistake. Let's treat it as CSS anti crawling for the time being.

This anti climbing measure needs to catch the class values of the corresponding numbers in the HTML source code, and then replace the corresponding values in the CSS file. Therefore, the first thing to deal with is the CSS file, not HTML.

After checking, this method is called implicit Style - CSS

In CSS,:: before creates a pseudo element that will become the first child element to match the selected element. The content attribute is often used to add decorative content to an element.

class AntiSpider(scrapy.Spider):
    name = 'antispider4'
    css = {}

    def start_requests(self):
        urls = ['https://antispider4.scrape.center/css/app.654ba59e.css']
        for a in urls:
            # Parsing css
            yield Request(url=a, callback=self.parse_css)

    def parse_css(self, response):
        # According to the law, use regular to find all the attributes that need to be used. Since only the scores are crawled back here, you only need to match a small number of numbers and points.
        result = re.findall(r'\.(icon-\d*?):before{content:"(.*?)"}', response.text)
        for key, value in result:
            self.css[key] = value
        print(self.css)
        # Visit the home page
        yield SeleniumRequest(url='https://antispider4.scrape.center/', callback=self.parse_data,
                              wait_time=3, wait_until=EC.presence_of_element_located((By.CLASS_NAME, 'm-b-sm')))

    def parse_data(self, response):
        result = response.xpath('//div[@class="el-card item m-t is-hover-shadow"]')
        for a in result:
            item = Antispider4ScrapyItem()
            item['title'] = a.xpath('.//h2[@class="m-b-sm"]/text()').get()
            if r := a.xpath('.//p[@class="score m-t-md m-b-n-sm"]//i'):
                item['fraction'] = ''.join([self.css.get(b.xpath('.//@class').get()[5:], '') for b in r])
            item['country'] = a.xpath('.//div[@class="m-v-sm info"]/span[1]/text()').get()
            item['time'] = a.xpath('.//div[@class="m-v-sm info"]/span[3]/text()').get()
            item['date'] = a.xpath('.//div[@class="m-v-sm info"][2]/span/text()').get()
            url = a.xpath('.//a[@class="name"]/@href').get()
            print(response.urljoin(url))
            yield SeleniumRequest(url=response.urljoin(url), callback=self.parse, meta={'item': item},
                                  wait_time=3, wait_until=EC.presence_of_element_located((By.CLASS_NAME, 'm-b-sm')))

    def parse(self, response, **kwargs):
        item = response.meta['item']
        item['director'] = response.xpath(
            '//div[@class="directors el-row"]//p[@class="name text-center m-b-none m-t-xs"]/text()').get()
        yield item

The processing method is to use regular matching to match the content needed in the CSS file, and directly replace it where it needs to be replaced to get the correct score data.

See for complete code https://github.com/libra146/learnscrapy/tree/antispider4

summary

This article only writes about various anti crawling measures for selenium, and anti crawling for IP address or account. The next article will write it.

There are only a few ways for web pages to obtain data, such as HTML, JS, CSS, Ajax, etc., so when encountering anti crawling, first find out how the data is rendered. The remaining problem is to process the data and carry out targeted processing according to the source of the data.

Tags: Python crawler Python crawler scrapy

Posted by hbradshaw on Sun, 08 May 2022 23:47:21 +0300