Regular expressions complete summary, collection

Regular expressions complete summary, collection

1. General

regular expression is a method of expressing text patterns (i.e. string structures).

There are two ways to create:

One is to use literal values, with slashes indicating the beginning and end.

var regex = /xyz/

The other is to use RegExp constructor.

var regex = new RegExp('xyz'); 

The main difference between them is that the first method creates a new regular expression when the engine compiles the code, and the second method creates a new regular expression at runtime, so the former is more efficient. Moreover, the former is more convenient and intuitive, so in practical applications, we basically use literal to define regular expressions.

2. Instance properties

  • i: ignore case
  • m: multiline mode
  • g: global search

3. Example method

3.1 RegExp.prototype.test()

The test method of the regular instance object returns a Boolean value indicating whether the current pattern can match the parameter string.

/Xiao Zhi/.test('Xiaozhi lifelong learning executor') // true

3.2 RegExp.prototype.exec()

  • reg.exec(str) returns an array of matching results. If there is no match, null is returned. Every time exec is executed, it will match backward
var s = '_x_x';
var r1 = /x/;
var r2 = /y/;
r1.exec(s) // ["x"]
r2.exec(s) // null
  • If there are parentheses () in the expression, it is called group matching. Among the returned results, the first one is the overall matching result, followed by the matching result of each parenthesis
var s = '_x_x';
var r = /_(x)/;
r.exec(s) // ["_x", "x"]

The return array of the exec method also contains the following two properties:

  • input: the entire original string.

  • index: the starting position of the whole pattern matching success (counting from 0).

var r = /a(b+)a/;
var arr = r.exec('_abbba_aba_');
arr // ["abbba", "bbb"]
arr.index // 1
arr.input // "_abbba_aba_"
  • If there is a g option in the expression for global search, you can use exec multiple times, and the next matching starts after the last result
var reg = /a/g;
var str = 'abc_abc_abc'
var r1 = reg.exec(str);
r1 // ["a"]
r1.index // 0
reg.lastIndex // 1
var r2 = reg.exec(str);
r2 // ["a"]
r2.index // 4
reg.lastIndex // 5
var r3 = reg.exec(str);

r3 // ["a"]
r3.index // 8
reg.lastIndex // 9
var r4 = reg.exec(str);

r4 // null
reg.lastIndex // 0

4. Instance method of string

4.1 str.match(reg)

And reg Exec is similar, but if you use the g option, str.match ` returns all results at once.

var s = 'abba';
var r = /a/g;
s.match(r) // ["a", "a"]
r.exec(s) // ["a"]

4.2 str.search(reg)

Returns the first position where the match is successful. If there is no match, it returns - 1.

'_x_x'.search(/x/)// 1

4.3 str.replace(reg,newstr)

Use the first parameter reg to match and the second parameter newstr to replace. If the regular expression does not add the g modifier, it will replace the first successfully matched value, otherwise it will replace all the successfully matched values.

'aaa'.replace('a', 'b')// "baa"

'aaa'.replace(/a/, 'b') // "baa"

'aaa'.replace(/a/g, 'b') // "bbb" 

4.4 str.split(reg[,maxLength])

The second parameter is to limit the maximum number of returned results

5. Matching rules

5.1 literal characters and metacharacters

Most characters are literal in regular expressions, such as / A / matches a and / b / matches b. If a character in a regular expression only represents its literal meaning (like a and b above), they are called "literal characters". In addition to literal characters, some characters have special meanings and do not represent literal meaning. They are called "metacharacters", which mainly include the following.

  • (1) Dot character (.)
    Dot character (.) Matches all characters except carriage return (r), line feed (n), line separator (u2028), and segment separator (u2029).
/c.t/

In the above code, c.t matches any character between C and T, as long as the three characters are on the same line, such as cat, c2t, c-t, etc., but it does not match coot.

  • (2) Position character
    ^Represents the starting position of the string

$indicates the end of the string

// test must appear at the beginning
/^test/.test('test123')   // true

// test must appear at the end
/test$/.test('new test') // true

// There is only test from the start position to the end position
/^test$/.test('test') // true
/^test$/.test('test test') // false
  • (3) Selector (|)
    The vertical bar symbol (|) represents "OR relationship" (OR) in regular expression, that is, cat|dog means matching cat OR dog.
/11|22/.test('911') // true

In the above code, the regular expression specifies that it must match 11 or 22.

5.2 escape character

For those metacharacters with special meaning in regular expressions, if they want to match themselves, they need to be preceded by a backslash. For example, if you want to match +, you have to write +.

/1+1/.test('1+1')// false
/1\+1/.test('1+1')// true

In regular expressions, there are 12 characters that need to be escaped by backslash: ^ [, $, (,), |, *, +,?, {and. It should be noted that if the RegExp method is used to generate regular objects, the escape needs to use two slashes, because the string will be escaped once.

(new RegExp('1\+1')).test('1+1')// false

(new RegExp('1\\+1')).test('1+1')// true

5.3 character class

Character class means that there are a series of characters to choose from, as long as you match one of them. All selectable characters are placed in square brackets. For example, [xyz] means that any one of x, y and z matches.

/[abc]/.test('hello world') // false

/[abc]/.test('apple') // true

There are two characters that have special meanings in the character class.

  • (1) Caret (^) if the first character in square brackets is [^ xyz], it means that it can match except x, y and z:
/[^abc]/.test('hello world') // true

/[^abc]/.test('bbc') // false

If there are no other characters in square brackets, that is, only [^], it means that all characters, including line breaks, are matched. In contrast, the dot is used as a metacharacter (.) Line breaks are not included.

var s = 'Please yes\nmake my day!';
s.match(/yes.*day/) // null
s.match(/yes[^]*day/) // [ 'yes\nmake my day']

In the above code, the string s contains a newline character, and the dot does not include a newline character, so the first regular expression matching fails; The second regular expression [^] contains all characters, so the match is successful.

  • (2) Hyphen (-)
    In some cases, for a continuous sequence of characters, the hyphen (-) is used to provide a simplified form to represent the continuous range of characters. For example, [a-789] can be written as [a], - 569 can be written as [a], and - 569 can be written as [C].
/a-z/.test('b') // false

/[a-z]/.test('b') // true 

The following are legal abbreviations of character classes.

[0-9.,]
[0-9a-fA-F]
[a-zA-Z0-9-]
[1-31]

The last character class [1-31] in the above code does not represent 1 to 31, but only 1 to 3.
In addition, don't use hyphens too much. Set a large range, otherwise it is likely to select unexpected characters. The most typical example is [A-z]. On the surface, it selects 52 letters from uppercase A to lowercase z, but because there are other characters between uppercase and lowercase letters in ASCII coding, unexpected results will appear.

/[A-z]/.test('\\') // true

In the above code, because the ASCII code of the backslash (') is between uppercase and lowercase letters, the result will be selected.

5.4 predefined modes

Predefined patterns are shorthand for some common patterns.

d matches any number between 0-9, equivalent to [0-9].

D matches all characters except 0-9, equivalent to [^ 0-9].

w matches any letters, numbers and underscores, equivalent to [A-Za-z0-9#].

W characters other than all letters, numbers and underscores are equivalent to [^ A-Za-z0-9_].

s matches spaces (including line breaks, tabs, spaces, etc.), equal to [\ t\r\n\v\f].

S matches non whitespace characters, equivalent to [^ \ t\r\n\v\f].

b boundary of matching words.

B matches non word boundaries, that is, inside words

// \s example
/\s\w*/.exec('hello world') // [" world"]

// \Example of b
/\bworld/.test('hello world') // true
/\bworld/.test('hello-world') // true
/\bworld/.test('helloworld') // false

// \Example of B
/\Bworld/.test('hello-world') // false
/\Bworld/.test('helloworld') // true

Usually, regular expressions stop matching when they encounter a newline character (n).

var html = "<b>Hello</b>\n<i>world!</i>";
/.*/.exec(html)[0]// "<b>Hello</b>"

In the above code, the html string contains a newline character and the result is a dot character (.) Do not match the newline character, so the matching result may not conform to the original intention. At this time, using the s character class can include line breaks.

var html = "<b>Hello</b>\n<i>world!</i>";
/[\S\s]*/.exec(html)[0]// "<b>Hello</b>\n<i>world!</i>"

In the above code, [Ss] refers to all characters.

5.5 repetition

The exact number of matches of the pattern, expressed in curly braces ({}). {n} Represents exactly n repetitions, {n,} represents at least n repetitions, {n,m} represents no less than n repetitions and no more than m repetitions.

/lo{2}k/.test('look') // true
/lo{2,5}k/.test('looook') // true

In the above code, the first mode specifies that o occurs twice in a row, and the second mode specifies that o occurs between 2 and 5 times in a row.

5.6 quantifier

*. ? A question mark indicates that a pattern occurs 0 or 1 times, which is equivalent to {0, 1}. ** An asterisk indicates that a pattern occurs 0 or more times, which is equivalent to {0,}. *+ The plus sign indicates that a pattern occurs once or more, which is equivalent to {1,}.

5.7 greedy mode

The three quantifier characters in the previous section are the most likely match by default, that is, until the next character does not meet the matching rules. This is called greedy mode.

var s = 'aaa';
s.match(/a+/) // ["aaa"]

In the above code, the pattern is / A + /, which means that one or more a's are matched. How many a's will be matched? Because the default is greedy mode, it will match until the character a does not appear, so the matching result is 3 a.

If you want to change greedy mode to non greedy mode, you can add a question mark after the quantifier.

var s = 'aaa';
s.match(/a+?/) // ["a"]

In addition to the plus sign of non greedy mode, there are asterisks (*) of non greedy mode and question marks (?) of non greedy mode

  • +?: Indicates that a pattern occurs once or more, and the non greedy pattern is used for matching.

  • *?: Indicates that a pattern appears 0 or more times, and the non greedy pattern is used for matching.

  • ??: A pattern of the table appears 0 or 1 times, and the non greedy pattern is used for matching.

5.8 group matching

  • (1) Overview
    The parentheses of regular expressions indicate group matching, and the patterns in parentheses can be used to match the contents of groups.
/fred+/.test('fredd') // true
/(fred)+/.test('fredfred') // true

In the above code, the first pattern has no parentheses, and the result + only represents the repeated letter d. the second pattern has parentheses, and the result + means that it matches the word fred.

Here is another example of packet capture.

var m = 'abcabc'.match(/(.)b(.)/);
m// ['abc', 'a', 'c']   

In the above code, regular expression / (.) b(.)/ A total of two parentheses are used. The first parenthesis captures a and the second parenthesis captures c.

Note that when using group matching, the g modifier should not be used at the same time, otherwise the match method will not capture the contents of the group.

var m = 'abcabc'.match(/(.)b(.)/g);
m // ['abc', 'abc']

Inside the regular expression, you can also use n to refer to the content matched by parentheses. N is a natural number starting from 1, indicating the parentheses in the corresponding order.

/(.)b(.)\1b\2/.test("abcabc")// true

In the above code, 1 represents the content matched by the first bracket (i.e. a), and 2 represents the content matched by the second bracket (i.e. c).

  • (2) Non capture group
    (?: x) is called a non capturing group, which means that the matching content of the group is not returned, that is, this bracket is not included in the matching result.

Please consider the role of non capture groups. If foo or foofoo needs to be matched, the regular expression should be written as / (foo) {1,2} /, but this will occupy a group matching. At this time, you can use the non capture group to change the regular expression to / (?: foo){1, 2} /, which has the same function as the previous regular expression, but will not output the contents inside the brackets separately.

var m = 'abc'.match(/(?:.)b(.)/);
m // ["abc", "c"]

The pattern in the above code uses a total of two parentheses. The first bracket is a non capture group, so there is no first bracket in the final returned result, only the content matched by the second bracket.

  • (3) Antecedent assertion
    x(?=y) is called positive look ahead. x matches only before y, and y will not be included in the returned result. For example, to match a number followed by a percent sign, it can be written as / d + (? =%) /.
    In "antecedent assertion", the part in parentheses will not be returned.
var m = 'abc'.match(/b(?=c)/);
m // ["b"]

The above code uses a look ahead assertion. b is matched before c, but c corresponding to parentheses will not be returned.

  • (4) Antecedent negative assertion
    x(?!y) is called negative look ahead. x matches only if it is not in front of Y, and y will not be included in the returned result. For example, to match a number that is not followed by a percent sign, write it as / d + (?!%) /.
/\d+(?!\.)/.exec('3.14')// ["14"]

In the above code, the regular expression specifies that only numbers not before the decimal point will be matched, so the result returned is 14.

6. Actual combat

6.1 eliminate spaces at the beginning and end of a string

var str = '  #id div.class  ';
str.replace(/^\s+|\s+$/g, '')   // "#id div.class"

6.2 verification of mobile phone number

var reg = /1[24578]\d{9}/;
reg.test('154554568997'); //true
reg.test('234554568997'); //false

6.3 replace the mobile phone number with*

var reg = /1[24578]\d{9}/;
var str = 'Name: Zhang San mobile phone: 1821099999 gender: Male';
str.replace(reg, '***') //"Name: Zhang San Mobile: * * * gender: male"

6.4 matching page labels

var strHtlm = 'Xiaozhi Xiaozhi<div>222222@.qq.com</div>Xiaozhi Xiaozhi';
var reg = /<(.+)>.+<\/\1>/;
strHtlm.match(reg); // ["<div>222222@.qq.com</div>"]

6.5 replace sensitive words

let str = 'The Communist Party of China, the people's Liberation Army, the people's Republic of China';
let r = str.replace(/China|army/g, input => {   
    let t = '';    
    for (let i = 0; i<input.length; i++) {
        t += '*';
    }    return t;
}) 
console.log(r); //**Communist Party of China * * People's Liberation * People's Republic of China 

6.6 thousand separator

let str = '100002003232322';
let r = str.replace(/(\d)(?=(?:\d{3})+$)/g, '$1,');
console.log(r); //100,002,003,232,322

Tags: Javascript regex

Posted by mikeq on Thu, 05 May 2022 08:30:53 +0300