How to write a JSON parser?

preface

While working some time ago, I encountered the following problems. The JSON passed to me by the back end, in which the id field uses the format of number, but the size of id exceeds the range of 2 ^ 53 - 1 ~ 2 ^ 53 - 1. Cause JSON Number overflow during parse. The back end is unwilling to modify the interface. Finally used json-bigint This library parses JSON instead of JSON parse. Parses large numbers directly into strings.

I was curious about the working principle of JSON bigint, so I read the source code of JSON bigint and found that the principle was not complex, so I wrote this article for your reference.

Use of JsonBigint

First, let's introduce the basic use of JsonBigint

// install
npm install json-bigint
import JSONbig from 'json-bigint';

const json = '{ "value" : 9223372036854775807, "v2": 123 }';
// 9223372036854776000 overflowed
JSON.parse(json).value 
// '9223372036854775807' convert large numbers to strings
JSONbig.parse(json).value

The principle of JsonBigint

The principle of JsonBigint is mainly to parse each character in JSON one by one, and parse value into object, array, number,, string, boolean value, etc. according to different rules.

Directory structure of JsonBigint

JsonBigint mainly exposes two APIs, jsonbig Parse and jsonbig Stringify, let's mainly look at jsonbig Parse method. JSONbig. The code of Parse method is mainly in parse JS file.

index.js

Through the entry file index JS can know the parse function, and it will return a function. And expose the return function to the API.

var json_parse = require('./lib/parse.js');

// Call json_parse and expose the return value to the parse attribute
module.exports.parse = json_parse();

parse.js

parse.js is jsonbig Where the core code of parse is located, I deleted part of the judgment on special cases, and only retained the core part of the source code for your understanding. Next, let's interpret the core source code.

Entry function

First, let's explain the parameters and variables of the entry function

  • The source parameter is the json string we need to parse
  • at, index. We need to parse json character by character from beginning to end, so the index is initially equal to 0.
  • Regular string, ch is equal to the current parsing string, and the default is null.
  • text, a copy of the source parameter
function (source) {
    var result;
    text = source + '';
    at = 0;
    ch = ' ';
    result = value();
    white();
    if (ch) {
        error('Syntax error');
    }
    return result;
};

Then the entry parameter calls the value function, which will start parsing the text variable and return the parsed content. After parsing, if there are redundant non space characters that have not been parsed, json is illegal. Throw an error, otherwise return the result returned by value.

value

Because json is in the format of string before parsing, we can judge what kind of json is according to the first character of string.

  • If it is {, it indicates that the json should be object after parsing
  • If it is [, it indicates that the json should be array after parsing
  • If it is ", it means that it should be a string after parsing (the standard of JSON is to use")
  • If it's -- it's a number, it's just a negative number
  • If the beginning string is within 0 ~ 9, it indicates that it is a string. If not, handle according to boolean or null
value = function () {
    white();
    switch (ch) {
      case '{':
        return object();
      case '[':
        return array();
      case '"':
        return string();
      case '-':
        return number();
      default:
        return ch >= '0' && ch <= '9' ? number() : word();
    }
  };

In the value function, the first sentence calls the white function. What does the white function do?

The white function reads the json string word by word (excluding space characters) and assigns the read string to the ch variable. According to the ch variable and the above rules, we start to use different functions to start parsing

white & next

The white function is mainly used to delete redundant space characters in json. A while loop will be opened in the white function. If ch is a space string, the loop condition of CH & & ch < = '' will return true, and the while loop will continue until ch is no longer a space string.

// white
white = function () {
    while (ch && ch <= ' ') {
        next();
    }
},

The next function will retrieve the character at the corresponding position in the json string according to the index at. The at index increases automatically. The next function is also used to judge whether the parameters are equal to ch. if they are not equal, an error will be thrown.

// next function
next = function (c) {
    if (c && c !== ch) {
      error("Expected '" + c + "' instead of '" + ch + "'");
    }
    ch = text.charAt(at);
    at += 1;
    return ch;
},

object

Let's first look at what a json of object type looks like "{" key": value}" or "{}". We can see that the first character is "{" and the second non empty character must and should be "or}. Otherwise, json is illegal

If it is}, it means that json is an empty object. Just return the empty object directly.

"If yes, the json attribute is." The between "and" should be a string, which is the first attribute of object.

If neither of them is valid, the object json is illegal. You need to throw an error.

For value, the type of value may be objcet, array, boolean, string and number, which are uncertain

object = function () {
    var key,
    object = Object.create(null);

    if (ch === '{') {
      // Read whether ch is equal to "{" and read the second character
      next('{');
      // If the second character is a space character, the white function attempts to read the first non space character
      // And assign the second non space character to ch
      white();
      // If the second character is "}", it means that the object is an empty object. You can directly return the empty object
      if (ch === '}') {
        next('}');
        return object;
      }
      // If not}, and json is legal, the second non whitespace character should be“
      // Between {and colon, if json is legal, it is the key of object in the format of "key"
      while (ch) {
        // string reads the content between two ''
        key = string();
        // After reading the key, read it backwards
        white();
        // The first non empty string after key should be:, otherwise it is illegal.
        next(':');
        // : the attribute value corresponding to the key should be followed
        // The type of attribute value is not fixed. We also need to use the value function to try to judge the type of attribute in and do different processing
        // Value will return the parsed attribute value and return.
        // We add key and value to the empty object
        object[key] = value();
        // After getting the value, read it backwards
        // If} is read, ojbect parsing is completed and object is returned
        white();
        if (ch === '}') {
          next('}');
          return object;
        }
        // If it is read, it indicates that there are other attributes, and it will enter the next iteration
        next(',');
        white();
      }
    }
    error('Bad object');
  };

string

In json, the string content must be enclosed in two double quotes. For example, {"key":"value"}.

The string function reads the contents between two double quotes and returns. If the last string is read and the next one is not read, it means that the string is not closed and an illegal error is thrown

string = function () {
    var string = '';
    // If yes, yes“
    // The while loop keeps trying to read to the second“
    // And assign the content between the two "to a string
    // Finally, return the string
    if (ch === '"') {
      var startAt = at;
      while (next()) {
        if (ch === '"') {
          if (at - 1 > startAt) string += text.substring(startAt, at - 1);
          next();
          return string;
        }
      }
    }
    error('Bad string');
  },

array

Let's first take a look at what a json of array type looks like "[value, value]" after the first character [which is either the first content of the array or].

If yes], the array is an empty array. Just return an empty array.

If not, the array is not empty. Since the type of contents in the array is not fixed, we also need to use the value function to try to judge the type of contents in the array. Then do different treatment. Until the] character is read, then the entire array is returned.

array = function () {

    var array = [];
    // If it is an array type, the first character must be [. If not, it means it is illegal array
    if (ch === '[') {
      next('[');
      // Try to read the second non whitespace string
      white();
      // If the second non blank character is], it indicates that it is an empty string and returns an empty array directly
      if (ch === ']') {
        next(']');
        return array;
      }
      // If the second non space character is not]
      // Because the type of the contents in the array is uncertain, we need to use the value function to read the contents and return.
      while (ch) {
        array.push(value());
        white();
        // After reading the first content, if the following character is], it means that the array has been read and the array can be returned
        if (ch === ']') {
          next(']');
          return array;
        }
        // After reading the first content, if the following character is a comma, it indicates that there are other contents in the array and enters the next cycle
        next(',');
        white();
      }
    }
    error('Bad array');
  },

number

number = function () {
    var number,
      string = '';

    // If the first string is -, it means that number may be a negative number. Continue to look back
    if (ch === '-') {
      string = '-';
      next('-');
    }

    // If it is a character between 0 and 9, the string is accumulated
    while (ch >= '0' && ch <= '9') {
      string += ch;
      next();
    }

    // In case of decimal point processing
    if (ch === '.') {
      string += '.';
      while (next() && ch >= '0' && ch <= '9') {
        string += ch;
      }
    }

    // If it is the treatment of scientific counting method
    if (ch === 'e' || ch === 'E') {
      string += ch;
      next();
      if (ch === '-' || ch === '+') {
        string += ch;
        next();
      }
      while (ch >= '0' && ch <= '9') {
        string += ch;
        next();
      }
    }
    // Convert a string to a number and assign the number to the number variable
    number = +string;

    // If number is nan, or the government infinity isfinish returns false
    // For example, isfinish ('-') returns false
    // If false is returned, an error is thrown
    if (!isFinite(number)) {
      error('Bad number');
    } else {
      // If the length of the string is greater than 15, it indicates that the size of number has overflowed, and we return the string
      if (string.length > 15)
        return string
      else
       // If the length of the string is less than 15, we return the numeric type
        return number
    }

word

word function is mainly used to handle boolean type and null.

Let's first look at what boolean type and null look like in json. "{" key1":true,"key2":false,"key3":null}". In json, they are ordinary characters that are not wrapped in double quotes.

word = function () {
    switch (ch) {
      // If the first character is t, the next characters must be t r u e in turn, otherwise an error will be thrown
      case 't':
        next('t');
        next('r');
        next('u');
        next('e');
        // Return true
        return true;
      // If the first character is f, the next characters must be f a l s e in turn, otherwise an error will be thrown
      case 'f':
        next('f');
        next('a');
        next('l');
        next('s');
        next('e');
        // Return false
        return false;
      // If the first character is n, the next characters must be n u l l in turn, otherwise an error will be thrown
      case 'n':
        next('n');
        next('u');
        next('l');
        next('l');
        // Return null
        return null;
    }
    error("Unexpected '" + ch + "'");
  },

Tags: Javascript node.js Front-end html5

Posted by new_to_php2004 on Sat, 30 Apr 2022 10:33:04 +0300