You say you want to play crawler, but you say you don't understand Python regular expressions, I believe you, so why don't you come and see?

foreword

A regular expression is a special sequence of characters that can help you easily check whether a string matches a certain pattern.

The re module also provides exactly the same functions as these methods, which take a pattern string as their first argument.

re.match function

re.match attempts to match a pattern from the beginning of the string. If the match is not successful at the beginning, match() returns none.

Function syntax:

re.match(pattern, string, flags=0)

Function parameter description:

parameter describe
pattern matching regular expression
string String to match.
flags The flag bit is used to control the matching method of the regular expression, such as: whether it is case-sensitive, multi-line matching, etc.

The re.match method returns a matching object if the match is successful, otherwise it returns None.

We can use the group(num) or groups() match object functions to get the match expression.

match object method describe
group(num=0) A string of entire expressions to match, group() can enter multiple group numbers at once, in which case it will return a tuple containing the values ​​corresponding to those groups.
groups() Returns a tuple containing all group strings, from 1 to the group number contained in .

Example:

import re

line = "I really like you yesterday"

matchObj = re.match( r'(.*) really (.*?) .*', line)

print ("matchObj.group() : ", matchObj.group())print ("matchObj.group(1) : ", matchObj.group(1))print ("matchObj.group(2) : ", matchObj.group(2))

The execution result of the above example is as follows:

matchObj.group() :  I really like you yesterdaymatchObj.group(1) :  ImatchObj.group(2) :  like

re.search method

re.search will look for pattern matches within the string until the first match is found.

Function syntax:

re.search(pattern, string, flags=0)

Function parameter description:

parameter describe
pattern matching regular expression
string String to match.
flags The flag bit is used to control the matching method of the regular expression, such as: whether it is case-sensitive, multi-line matching, etc.

The re.search method returns a matching object if the match is successful, otherwise it returns None.

We can use the group(num) or groups() match object function to get the match expression.

match object method describe
group(num=0) A string of entire expressions to match, group() can enter multiple group numbers at once, in which case it will return a tuple containing the values ​​corresponding to those groups.
groups() Returns a tuple containing all group strings, from 1 to the group number contained in .

Example:

#!/usr/bin/python
import re

line = "I really like you yesterday";

searchObj = re.search(r'(.*) really (.*?) .*', line)

print ("searchObj.group() : ", searchObj.group())
print ("searchObj.group(1) : ", searchObj.group(1))
print ("searchObj.group(2) : ", searchObj.group(2))

The execution result of the above example is as follows:

searchObj.group() : I really like you yesterday
searchObj.group(1) :  I
searchObj.group(2) :  love

The difference between re.match and re.search

re.match only matches the beginning of the string. If the beginning of the string does not match the regular expression, the match fails and the function returns None; while re.search matches the entire string until a match is found.

Example:

#!/usr/bin/python
import re

line = "I really like you yesterday";
matchObj = re.match( r'love', line)
if matchObj:
   print("match --> matchObj.group() : ", matchObj.group())
else:
   print "No match!!"

matchObj = re.search( r'love', line)
if matchObj:
   print "search --> matchObj.group() : ", matchObj.group()
else:
   print "No match!!"

The result of running the above example is as follows:

No match!!
search --> matchObj.group() :  love

search and replace

Python's re module provides re.sub for replacing matches in strings.

grammar:

re.sub(pattern, repl, string, max=0)

The returned string is replaced with the leftmost unique match of RE in the string. If the pattern is not found, the character will be returned unchanged.

The optional parameter count is the maximum number of replacements after pattern matching; count must be a non-negative integer. The default value is 0 to replace all matches.

Example:

import re
 
phone = "2004-959-559 # This is a foreign phone number"
 
# Remove Python comments from strings
num = re.sub(r'#.*$', "", phone)
print("phone number is: ", num)
 
# Remove non-numeric (-) strings
num = re.sub(r'\D', "", phone)
print("phone number is : ", num)

The execution result of the above example is as follows:

phone number :  2004-959-559
 phone number :  2004959559

The repl parameter is a function

The following example multiplies the matched numbers in the string by 2:

Example:

import re
 
# Multiply matching numbers by 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)
 
s = 'A23G4HFD567'
print(re.sub('(?P<value>\d+)', double, s))

The output of the execution is:

A46G8HFD1134

re.compile function

The compile function is used to compile the regular expression and generate a regular expression ( Pattern ) object for use by the match() and search() functions.

The syntax format is:

re.compile(pattern[, flags])

parameter:

  • pattern : a regular expression in string form

  • flags : optional, indicating matching patterns, such as ignoring case, multi-line patterns, etc. The specific parameters are:

    1. re.I ignore case
    2. re.L means special character set \w, \W, \b, \B, \s, \S depends on the current environment
    3. re.M multiline mode
    4. re.S is . and any character including newlines (. does not include newlines)
    5. re.U for special character set \w, \W, \b, \B, \d, \D, \s, \S depends on Unicode character attribute database
    6. re.X ignore spaces and comments after # for readability

Example

>>>import re
>>> pattern = re.compile(r'\d+')                    # for matching at least one digit
>>> m = pattern.match('one12twothree34four')        # Find header, no match
>>> print (m)
None
>>> m = pattern.match('one12twothree34four', 2, 10) # Matches from position 'e', ​​no match
>>> print (m)
None
>>> m = pattern.match('one12twothree34four', 3, 10) # Match from the position of '1', exactly match
>>> print (m)                                         # returns a Match object
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # 0 can be omitted
'12'
>>> m.start(0)   # 0 can be omitted
3
>>> m.end(0)     # 0 can be omitted
5
>>> m.span(0)    # 0 can be omitted
(3, 5)

In the above, a Match object is returned when the match is successful, where:

  • The group([group1, …]) method is used to obtain one or more groups of matched strings. When you want to obtain the entire matched substring, you can use group() or group(0) directly;
  • The start([group]) method is used to obtain the starting position (the index of the first character of the substring) of the substring matched by the group in the whole string, and the default value of the parameter is 0;
  • The end([group]) method is used to obtain the end position of the substring matched by the group in the whole string (the index of the last character of the substring + 1), and the default value of the parameter is 0;
  • The span([group]) method returns (start(group), end(group)).

Let's look at another example:

>>>import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)   # re.I means ignore case
>>> m = pattern.match('Hello World Wide Web')
>>> print (m)                               # If the match is successful, return a Match object
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)                            # Returns the entire substring that matches successfully
'Hello World'
>>> m.span(0)                             # Returns the index of the entire substring that matched successfully
(0, 11)
>>> m.group(1)                            # Returns the first substring whose group matches successfully
'Hello'
>>> m.span(1)                             # Returns the index of the first substring whose group matches successfully
(0, 5)
>>> m.group(2)                            # Returns the substring that matches the second grouping successfully
'World'
>>> m.span(2)                             # Returns the substring that matches the second grouping successfully
(6, 11)
>>> m.groups()                            # Equivalent to (m.group(1), m.group(2), ...)
('Hello', 'World')
>>> m.group(3)                            # No third group exists
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

findall

Finds all substrings in the string matched by the regular expression and returns a list, or an empty list if no matches are found.

Note: match and search are one match and findall matches all.

The syntax format is:

findall(string[, pos[, endpos]])

parameter:

  • string : The string to match.
  • pos : optional parameter, specifying the starting position of the string, default is 0.
  • endpos : an optional parameter that specifies the end position of the string, the default is the length of the string.

Find all numbers in a string:

import re
 
pattern = re.compile(r'\d+')   # find numbers
result1 = pattern.findall('school 123 google 456')
result2 = pattern.findall('sch88ool123google456', 0, 10) 
print(result1)
print(result2)

Output result:

['123', '456']
['88', '12']

re.finditer

Similar to findall, finds all substrings matched by the regular expression in a string and returns them as an iterator.

re.finditer(pattern, string, flags=0)

parameter:

parameter describe
pattern matching regular expression
string String to match.
flags The flag bit is used to control the matching method of the regular expression, such as: whether it is case-sensitive, multi-line matching, etc.

Example:

import re
 
it = re.finditer(r"\d+","12a32bc43jf3")
for match in it:
    print (match.group() )

Output result:

12
32
43
3

re.split

The split method returns a list after splitting the string according to the substrings that can be matched. Its usage is as follows:

re.split(pattern, string[, maxsplit=0, flags=0])

parameter:

parameter describe
pattern matching regular expression
string String to match.
maxsplit The number of splits, maxsplit=1 splits once, the default is 0, and the number of times is not limited.
flags The flag bit is used to control the matching method of the regular expression, such as: whether it is case-sensitive, multi-line matching, etc.

Example:

>>>import re
>>> re.split('\W+', 'school, school, chool.')
['runoob', 'runoob', 'w3cschool', '']
>>> re.split('(\W+)', ' school, school, school.')
['', ' ', 'runoob', ', ', 'school', ', ', 'school', '.', '']
>>> re.split('\W+', ' w3cschool, w3cschool, w3cschool.', 1)
['', 'school, school, school.']
 
>>> re.split('a*', 'hello world')   # split does not split a string that does not find a match
['hello world']

If you have experience in software testing, interface testing, automated testing, continuous integration, and interviews. If you are interested, you can go to 902061117, and there will be occasional sharing of test data in the group. There will also be technical giants, and industry peers will exchange technology together

Tags: Python crawler

Posted by cahva on Wed, 25 May 2022 09:02:15 +0300