In depth study of regular expressions

.

catalogue

Chapter 1 regular expression character matching introduction

Chapter 2: regular expression position matching strategy

Chapter 3 the role of regular expression parentheses

What is discussed in each chapter?

Regular is a matching pattern, either matching characters or matching positions.

Chapter 1 and Chapter 2 explain the basis of regularity from this point of view.

You can use parentheses to capture data in regular, either grouping references in the API or back references in regular.

This is the topic of Chapter 3, which explains the role of regular brackets.

=====================================================================================

Chapter 1 regular expression character matching introduction

Regular expressions are matching patterns that match either characters or positions. Please remember this sentence.

Content:

  1. Two kinds of fuzzy matching
  2. Character group
  3. classifier
  4. Branching structure
  5. Case analysis

1 two kinds of fuzzy matching

It doesn't make much sense if regular only has exact matching. For example, / hello /, can only match the substring of "hello" in the string.

var regex = /hello/;
console.log( regex.test("hello") ); //true

Regular expressions are powerful because they can achieve fuzzy matching.

Fuzzy matching has two directions of "fuzziness": horizontal fuzziness and vertical fuzziness.

1.1 horizontal fuzzy matching

Horizontal fuzziness means that the length of a regular matching string is not fixed, but can be in many cases.

The way to realize it is to use quantifiers. For example, {m,n} indicates that there are at least m consecutive occurrences and at most N consecutive occurrences.

For example, / ab{2,5}c / means to match such a string: the first character is "a", followed by 2 to 5 characters "b", and finally the character "c". The tests are as follows:

var regex = /ab{2,5}c/g;
var string = "abc abbc abbbc abbbbc abbbbbc abbbbbbc";
console.log(string.match(regex));
//["abbc", "abbbc", "abbbbc", "abbbbbc"]

Remove g (only match once)

var regex = /ab{2,5}c/;
var string = "abc abbc abbbc abbbbc abbbbbc abbbbbbc";
console.log(string.match(regex));
//["abbc", index: 4, input: "abc abbc abbbc abbbbc abbbbbc abbbbbbc", groups: undefined]

1.2 longitudinal fuzzy matching

Vertical ambiguity refers to a regular matching string. When it is specific to a certain character, it can not be a certain character, and there can be many possibilities.

It is implemented by using character groups. For example, [abc] indicates that the character can be any one of the characters "a", "b" and "c".

For example, / a[123]b / can match the following three strings: "a1b", "a2b" and "a3b". The tests are as follows:

var regex = /a[123]b/g;
var string = "a0b a1b a2b a3b a4b";
console.log( string.match(regex) );
// ["a1b", "a2b", "a3b"]

2. Character group

It should be emphasized that although it is called character group (character class), it is only one of them. For example, [abc] means that a character is matched, which can be one of "a", "b" and "c".

2.1 range representation

What if there are too many characters in the character group? Range notation can be used.

For example, [123456abcdefGHIJKLM] can be written as [1-6a-fG-M]. Use hyphen - to omit and abbreviate.

Because hyphens have special purposes, what should we do to match any one of "a", "-" and "z"?

It cannot be written as [a-z] because it represents any of the lowercase characters.

It can be written as follows: [- az] or [az -] or [a\-z]. That is, either at the beginning, or at the end, or escape. In short, it's OK not to let the engine think it's a range representation.

2.2 exclude character groups

Another case of vertical fuzzy matching is that a character can be anything, but it can't be "a", "b" or "c".

This is the concept of excluding character groups (antisense character groups). For example, [^ abc], indicates any character except "a", "b" and "c". The first digit of the character group is put ^ (CARET), indicating the concept of negation.

Of course, there are corresponding range representations.

2.3 common abbreviations

With the concept of character group, we can understand some common symbols. Because they are the abbreviations of the system.

\d namely[0-9]. Indicates a number. Memory mode: its English is digit(Number).

\D namely[^0-9]. Represents any character other than a number.

\w namely[0-9a-zA-Z_]. Represents numbers, uppercase and lowercase letters, and underscores. Memory mode: w yes word Abbreviation of, also known as word character.

\W yes[^0-9a-zA-Z_]. Non word characters.

\s yes[ \t\v\n\r\f]. Indicates blank characters, including space, horizontal tab, vertical tab, line feed, carriage return and page feed. Memory mode: s yes space character The first letter of.

\S yes[^ \t\v\n\r\f].  Non blank character.

.namely[^\n\r\u2028\u2029]. Wildcard, representing almost any character. Except for line breaks, carriage returns, line separators, and segment separators. Memory: think about ellipsis...Each point in a can be understood as a placeholder for anything similar.

What if you want to match any character? You can use any of [\ d\D], [\ w\W], [\ s\S], and [^].

3. Quantifier

Quantifiers are also called repetition. After mastering the exact meaning of {m,n}, you only need to remember some abbreviations.

3.1 abbreviation

{m,} Indicates that at least m Times.

{m} Equivalent to{m,m},Indicates presence m Times.

? Equivalent to{0,1},Indicates presence or absence. Memory mode: the meaning of the question mark, yes?

+ Equivalent to{1,},Indicates at least one occurrence. Memory method: the plus sign means to add. You have to have one before you consider adding.

* Equivalent to{0,},Indicates that it occurs any time and may not occur. Memory: look at the stars in the sky. There may not be one, there may be a few scattered, and you may not be able to count them.

3.2 greedy matching and inert matching

Take the following example:

var regex = /\d{2,5}/g;
var string = "123 1234 12345 123456";
console.log( string.match(regex) ); 
// ["123", "1234", "12345", "12345"]

Where regular / \ d{2,5} /, indicates that the number appears 2 to 5 times in a row. It will match 2-digit, 3-digit, 4-digit and 5-digit consecutive numbers.

But it is greedy, it will match as much as possible. If you can give me six, I'll take five. If you can give me three, I'll take three. Anyway, as long as it is within the scope of ability, the more the better.

We know that sometimes greed is not a good thing (see the last example of the article). And inert matching means as few matches as possible:

var regex = /\d{2,5}?/g;
var string = "123 1234 12345 123456";
console.log( string.match(regex) );
// ["12", "12", "34", "12", "34", "12", "34", "56"]

Where / \ d{2,5}/ Said that although two to five times are OK, when two is enough, we won't try again.

Inert matching can be achieved by adding a question mark after the quantifier, so all inert matching situations are as follows:

{m,n}?
{m,}?
??
+?
*?

The memory method of inert matching is: add a question mark after the quantifier and ask whether you are satisfied or greedy?

4. Select multiple branches

A pattern can realize horizontal and vertical fuzzy matching. Multi select branches can support any one of multiple sub modes.

The specific form is as follows: (p1|p2|p3), where p1, p2 and p3 are sub patterns separated by | (pipe character), indicating any one of them.

For example, to match "good" and "nice", you can use / good|nice /. The tests are as follows:

var regex = /good|nice/g;
var string = "good idea, nice try.";
console.log( string.match(regex) ); 
// ["good", "nice"]

But there is a fact that we should pay attention to. For example, when I use / good|goodbye / to match the "goodbye" string, the result is "good":

var regex = /good|goodbye/g;
var string = "goodbye";
console.log( string.match(regex) ); 
//["good"]

Change the regular to / goodbye|good /, and the result is:

var regex = /goodbye|good/g;
var string = "goodbye";
console.log( string.match(regex) ); 
//["goodbye"]

In other words, the branch structure is also inert, that is, when the front matches, the latter will not try again.

Chapter 2 regular expression position matching introduction

Regular expressions are matching patterns that match either characters or positions. Please remember this sentence.

However, most people pay less attention to the matching position when learning regularization.

This chapter talks about the general of regular matching positions.

The contents include:

  1. What is location?
  2. How to match location?
  3. Location characteristics
  4. Analysis of several application examples

1. What is location?

Position is the position between adjacent characters.

2. How to match the location?

In ES5, there are 6 anchor characters:

^ $ \b \B (?=p) (?!p)

2.1 ^ and $

^(CARET) matches the beginning of a line in a multiline match.

$(dollar sign) matches the end of the line, and matches the end of the line in multi line matching.

For example, we replace the beginning and end of the string with "#" (the position can be replaced with character!)

var result = "hello".replace(/^|$/g, '#');
console.log(result);
// #hello#

In multi line matching mode, the two are the concept of line, which requires our attention:

var result = "I\nlove\njavascript".replace(/^|$/gm, '#');
console.log(result);
/**
*#I#
*#love#
*#javascript#
*/

2.2 \b and \ B

\b is the word boundary, specifically the position between \ W and \ W, including the position between \ W and ^ and the position between \ W and $.

For example, a file name is \ b in "[JS] Lesson_01.mp4", as follows:

var result = "[JS] Lesson_01.mp4".replace(/\b/g, '#');
console.log(result);
//[#JS#] #Lesson_01#.#mp4#

After knowing the concept of \ B, then \ B is relatively easy to understand.

\B is the opposite of \ B, not the word boundary. For example, in all positions in the string, deduct \ B, and the rest are \ B.

Specifically, it is the position between \ W and \ W, \ W and \ W, ^ and \ W, \ W and $.

For example, in the above example, replace all \ B with "#":

var result = "[JS] Lesson_01.mp4".replace(/\B/g, '#');
console.log(result);
// #[J#S]# L#e#s#s#o#n#_#0#1.m#p#4

2.3 (?=p) and (?! p)

(? = p), where p is a sub mode, that is, the position in front of P.

For example (? = l), it indicates the position in front of the 'l' character, for example:

var result = "hello".replace(/(?=l)/g, '#');
console.log(result); 
// he#l#lo

And (?! p) is the opposite of (? = p), for example:

var result = "hello".replace(/(?!l)/g, '#');
 
console.log(result);
//  #h#ell#o#

3. Location characteristics

For the understanding of position, we can understand it as the empty character "".

For example, "hello" string is equivalent to the following form:

"hello" == "" + "h" + "" + "e" + "" + "l" + "" + "l" + "o" + "";

Also equivalent to:

"hello" == "" + "" + "hello"

Therefore, there is no problem in writing / ^ hello $/ as / ^ ^ hello $$/:

var result = /^^hello$$$/.test("hello");
console.log(result); 
// true

4. Relevant cases

4.1 regularity that doesn't match anything

/.^/

4.2 thousands separator representation of numbers

For example, change "12345678" into "12345678".

It can be seen that the corresponding position needs to be replaced with ",".

What is the idea?

4.2.1 make the last comma

Use (? = \ d{3} $) to:

var result = "12345678".replace(/(?=\d{3}$)/g, ',')
console.log(result); 
// 12345,678

4.2.2 get out all commas

Because of the position where the comma appears, it is required that the following three numbers form a group, that is, \ d{3} appears at least once.

You can use the quantifier +:

var result = "12345678".replace(/(?=(\d{3})+$)/g, ',')
console.log(result);
// 12,345,678

4.2.3 matching other cases

After writing the regular, we need to verify several more cases. At this time, we will find the problem:

var result = "123456789".replace(/(?=(\d{3})+$)/g, ',')
console.log(result);
// ,123,456,789

Because the above regularity only means that if the number from the end to the front is a multiple of 3, replace the position in front of it with a comma. That's why this problem arises.

How to solve it? We require that the matching to this position cannot be the beginning.

We know that ^ can be used at the beginning of matching, but what if this position is not the beginning?

easy, (?! ^), did you think of it? The tests are as follows:

var string1 = "12345678",
string2 = "123456789";
reg = /(?!^)(?=(\d{3})+$)/g;
 
var result = string1.replace(reg, ',')
var result2 = string2.replace(reg,",")
console.log(result); // 12,345,678
console.log(result2); // 123,456,789

Chapter 3 the role of regular expression parentheses

Whether the use of brackets is handy is a side standard to measure the mastery level of regularity.

The role of parentheses can be explained in a few words. Parentheses provide grouping for us to quote.

When referring to a group, there are two situations: referring to it in JavaScript and referring to it in regular expressions.

Although the content of this chapter is relatively simple, I also want to write longer.

The contents include:

  1. Grouping and branching structure
  2. Capture packet
  3. Back reference
  4. Non capture packet
  5. Relevant cases

1. Grouping and branching structure

These two are the most intuitive and primitive functions of parentheses.

1.1 grouping

We know that / A + / matches consecutive "a", and to match consecutive "ab", we need to use / (ab) + /.

The brackets provide grouping function to make the quantifier + act on the whole "ab". The test is as follows:

var regex = /(ab)+/g;
var string = "ababa abbb ababab";
console.log( string.match(regex) );
// ["abab", "ab", "ababab"]

1.2 branch structure

In the multi-choice branch structure (p1|p2), the role of parentheses here is also self-evident, providing all the possibilities of sub expressions.

For example, to match the following string:

I love JavaScript

I love Regular Expression
var regex = /^I love (JavaScript|Regular Expression)$/;
console.log( regex.test("I love JavaScript") );
console.log( regex.test("I love Regular Expression") );
// => true

If you remove the brackets in the regular, that is, / ^ I love JavaScript|Regular Expression $/, the matching strings are "I love JavaScript" and "Regular Expression". Of course, this is not what we want.

2. Reference grouping

This is an important role of parentheses. With it, we can extract data and perform more powerful replacement operations.

To use the benefits it brings, you must use the API of the implementation environment.

Take the date as an example. Assuming that the format is yyyy MM DD, we can first write a simple regular:

var regex = /\d{4}-\d{2}-\d{2}/;

Then modify it to the bracketed version:

var regex = /(\d{4})-(\d{2})-(\d{2})/;

Why use this regular?

For example, you can extract the year, month and day:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
console.log( string.match(regex) ); 
// ["2017-06-12", "2017", "06", "12", index: 0, input: "2017-06-12", groups: undefined]

An array returned by match. The first element is the overall matching result, then the matching content of each group (in parentheses), then the matching subscript, and finally the input text. (Note: if there is a modifier g in the regular array, the format of the array returned by match is different).

In addition, you can also use the exec method of regular objects:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
console.log( regex.exec(string) );
// ["2017-06-12", "2017", "06", "12", index: 0, input: "2017-06-12", groups: undefined]

You can also use the constructor's global attributes $1 to $9 to get:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
 
regex.test(string); // Regular operation, for example
//regex.exec(string);
//string.match(regex);
 
console.log(RegExp.$1); // "2017"
console.log(RegExp.$2); // "06"
console.log(RegExp.$3); // "12"

2.2 replacement

For example, how to replace the yyyy MM DD format with mm/dd/yyyy?

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
var result = string.replace(regex, "$2/$3/$1");
console.log(result); 
//  06/12/2017

In the second parameter of replace, $1, $2, $3 refers to the corresponding group. Equivalent to the following form:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
var result = string.replace(regex, function() {
    return RegExp.$2 + "/" + RegExp.$3 + "/" + RegExp.$1;
});
console.log(result); 
// 06/12/2017

Also equivalent to:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
var result = string.replace(regex, function(match, year, month, day) {
    return month + "/" + day + "/" + year;
});
console.log(result);
// 06/12/2017

3. Back reference

In addition to using the corresponding API to refer to groups, you can also refer to groups in the regular itself. However, you can only refer to the previous grouping, that is, reverse reference.

Let's take the date as an example.

For example, if you want to write a regular match, you can support the following three formats:

2016-06-12

2016/06/12

2016.06.12

The first thing you might think of is:

var regex = /\d{4}(-|\/|\.)\d{2}(-|\/|\.)\d{2}/;
var string1 = "2017-06-12";
var string2 = "2017/06/12";
var string3 = "2017.06.12";
var string4 = "2016-06/12";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // true
console.log( regex.test(string4) ); // true

Where / and Escape required. Although it matches the requirements, it also matches the data of "2016-06 / 12".

Suppose we want to make the delimiters consistent? You need to use a back reference:

var regex = /\d{4}(-|\/|\.)\d{2}\1\d{2}/;
var string1 = "2017-06-12";
var string2 = "2017/06/12";
var string3 = "2017.06.12";
var string4 = "2016-06/12";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // true
console.log( regex.test(string4) ); // false

Note that \ 1 in it indicates the group before the reference (- | \ / | \.). No matter what it matches (such as -), \ 1 matches the same specific character.

After we know the meaning of \ 1, the concepts of \ 2 and \ 3 will be understood, that is, they refer to the second and third groups respectively.

See here, at this time, I'm afraid you will have three questions.

3.1 what about bracket nesting?

The left bracket (open bracket) shall prevail. For example:

var regex = /^((\d)(\d(\d)))\1\2\3\4$/;
var string = "1231231233";
console.log( regex.test(string) ); // true
console.log( RegExp.$1 ); // 123
console.log( RegExp.$2 ); // 1
console.log( RegExp.$3 ); // 23
console.log( RegExp.$4 ); // 3

3.2 what does \ 10 mean?

Another question may be whether \ 10 represents the 10th group, or \ 1 and 0?

The answer is the former, although it is rare to see \ 10 in a regular. The tests are as follows:

var regex = /(1)(2)(3)(4)(5)(6)(7)(8)(9)(#) \10+/;
var string = "123456789# ######"
console.log( regex.test(string) );//true

3.3 what happens if you reference a group that does not exist?

Because the back reference refers to the previous group, but when we refer to a nonexistent group in the regular, the regular will not report an error, but only match the character itself of the back reference. For example, \ 2 matches "\ 2". Note that "\ 2" indicates a change of meaning to "2".

var regex = /\1\2\3\4\5\6\7\8\9/;
console.log( regex.test("\1\2\3\4\5\6\7\8\9") ); //true
console.log( "\1\2\3\4\5\6\7\8\9".split("") );//["", "", "", "", "", "", "", "8", "9"]

 

 

4. Non capture packet

The groups in the previous article will capture the data they match for subsequent reference, so they are also called capture groups.

If you only want the original function of parentheses, but don't reference it, that is, it is neither referenced in the API nor backreferenced in the regular. At this time, you can use non capture grouping (?: p). For example, the first example in this article can be modified to:

var regex = /(?:ab)+/g;
var string = "ababa abbb ababab";
console.log( string.match(regex) );
// ["abab", "ab", "ababab"]

 

 

 

 

.

Tags: regex

Posted by Fabis94 on Thu, 05 May 2022 10:04:09 +0300