Author Wen Shaojin (high speed railway)
Source: Alibaba developer official account
1 common string encoding
Common string encodings are:
- LATIN1 can only save ASCII characters, also known as ISO-8859-1.
- UTF-8 variable length byte encoding. A character needs to be represented by 1, 2 or 3 bytes. Since Chinese usually requires three bytes, UTF-8 coding of Chinese scenes usually requires more space. The alternative scheme is GBK/GB2312/GB18030.
- UTF-16 has two bytes, and one character needs to be represented by two bytes, also known as UCS-2 (2-byte Universal Character Set). According to the distinction between large and small ends, UTF-16 has two forms, UTF-16BE and UTF-16LE. The default UTF-16 refers to UTF-16BE. char in Java language is UTF-16LE coding.
- GB18030 variable length byte coding. A character needs to be represented by 1, 2 or 3 bytes. Similar to UTF8, Chinese only needs 2 characters, which means that Chinese saves byte size. The disadvantage is that it is not commonly used internationally.
2 encoding conversion performance
The conversion between UTF-16 and UTF-8 is complex and usually has poor performance.

static int encodeUTF8(char[] utf16, int off, int len, byte[] dest, int dp) { int sl = off + len, last_offset = sl - 1; while (off < sl) { char c = utf16[off++]; if (c < 0x80) { // Have at most seven bits dest[dp++] = (byte) c; } else if (c < 0x800) { // 2 dest, 11 bits dest[dp++] = (byte) (0xc0 | (c >> 6)); dest[dp++] = (byte) (0x80 | (c & 0x3f)); } else if (c >= '\uD800' && c < '\uE000') { int uc; if (c < '\uDC00') { if (off > last_offset) { dest[dp++] = (byte) '?'; return dp; } char d = utf16[off]; if (d >= '\uDC00' && d < '\uE000') { uc = (c << 10) + d + 0xfca02400; } else { throw new RuntimeException("encodeUTF8 error", new MalformedInputException(1)); } } else { uc = c; } dest[dp++] = (byte) (0xf0 | ((uc >> 18))); dest[dp++] = (byte) (0x80 | ((uc >> 12) & 0x3f)); dest[dp++] = (byte) (0x80 | ((uc >> 6) & 0x3f)); dest[dp++] = (byte) (0x80 | (uc & 0x3f)); off++; // 2 utf16 } else { // 3 dest, 16 bits dest[dp++] = (byte) (0xe0 | ((c >> 12))); dest[dp++] = (byte) (0x80 | ((c >> 6) & 0x3f)); dest[dp++] = (byte) (0x80 | (c & 0x3f)); } } return dp; }

Relevant code address [1].
Since char in Java is UTF-16LE encoded, if you need to convert char [] to UTF-16LE encoded byte [], you can use sun misc. The unsafe #copymemory method copies quickly. For example:
static int writeUtf16LE(char[] chars, int off, int len, byte[] dest, final int dp) { UNSAFE.copyMemory(chars , CHAR_ARRAY_BASE_OFFSET + off * 2 , dest , BYTE_ARRAY_BASE_OFFSET + dp , len * 2 ); dp += len * 2; return dp; }

3 coding of Java string
Different versions of JDK String have different implementations, resulting in different performance. char is UTF-16 code, but String can have LATIN1 code after JDK 9.
3.1. String before JDK implementation
static class String { final char[] value; final int offset; final int count; }

Before Java 6, String The String object generated by the subString method shares a char[] value with the original String object, which will cause the char [] of the String returned by the subString method to be referenced and cannot be recycled by GC. Therefore, many libraries will avoid using subString method for JDK 6 and below.
3.2. String implementation of JDK 7 / 8
static class String { final char[] value; }

After JDK 7, the offset and count fields are removed from the String, value Length is the original count. This avoids the problem of subString referencing large char [] and makes optimization easier. Therefore, the String operation performance in JDK7/8 is much better than that in Java 6.
3.3. Implementation of JDK 9 / 10 / 11
static class String { final byte code; final byte[] value; static final byte LATIN1 = 0; static final byte UTF16 = 1; }

After JDK 9, the value type changes from char [] to byte [], and a field code is added. If all characters are ASCII characters, use value and Latin coding; If there is any non ASCII character, it is encoded with UTF16. This mixed coding method makes English scenes occupy less memory. The disadvantage is that the String API performance of Java 9 may not be as good as JDK 8. In particular, the incoming char [] construction string will be compressed into Latin encoded byte [], which will be reduced by 10% in some scenarios.
4 method of quickly constructing string
In order to realize that the string is immutable, there will be a copy process when constructing the string. If you want to increase the overhead of constructing the string, you should avoid such copy.
For example, the following is the implementation of a constructor of JDK8 String
public final class String { public String(char value[]) { this.value = Arrays.copyOf(value, value.length); } }

In JDK8, there is a constructor that does not copy, but this method is not public. A trick is needed to implement methodhandles Lookup & lambdametafactory is called by binding reflection. The code describing this technique is later in the article.
public final class String { String(char[] value, boolean share) { // assert share : "unshared not supported"; this.value = value; } }

There are three ways to quickly construct characters:
- Use methodhandles Lookup & lambdametafactory bind reflection
- Related methods of using JavaLangAccess
- Use Unsafe to construct directly
Among the three methods, 1 and 2 have similar performance. 3 is slightly slower than 1 and 2, but they are much faster than direct new string. The test data of JDK8 using JMH are as follows:
Benchmark Mode Cnt Score Error Units StringCreateBenchmark.invoke thrpt 5 784869.350 ± 1936.754 ops/ms StringCreateBenchmark.langAccess thrpt 5 784029.186 ± 2734.300 ops/ms StringCreateBenchmark.unsafe thrpt 5 761176.319 ± 11914.549 ops/ms StringCreateBenchmark.newString thrpt 5 140883.533 ± 2217.773 ops/ms

After JDK 9, for scenes with all ASCII characters, direct construction can achieve better results.
4.1 based on methodhandles Lookup & lambdametafactory bind the method of fast constructing string of reflection.
Relevant code address [2].
4.1.1 JDK8 fast construction string
public static BiFunction< char[], Boolean, String> getStringCreatorJDK8() throws Throwable { Constructor< MethodHandles.Lookup> constructor = MethodHandles.Lookup.class.getDeclaredConstructor(Class.class, int.class); constructor.setAccessible(true); MethodHandles lookup = constructor.newInstance( String.class , -1 // Lookup.TRUSTED ); MethodHandles.Lookup caller = lookup.in(String.class); MethodHandle handle = caller.findConstructor( String.class, MethodType.methodType(void.class, char[].class, boolean.class) ); CallSite callSite = LambdaMetafactory.metafactory( caller , "apply" , MethodType.methodType(BiFunction.class) , handle.type().generic() , handle , handle.type() ); return (BiFunction) callSite.getTarget().invokeExact(); }

4.1.2 JDK 11 fast string construction method
public static ToIntFunction< String> getStringCode11() throws Throwable { Constructor< MethodHandles.Lookup> constructor = MethodHandles.Lookup.class.getDeclaredConstructor(Class.class, int.class); constructor.setAccessible(true); MethodHandles.Lookup lookup = constructor.newInstance( String.class , -1 // Lookup.TRUSTED ); MethodHandles.Lookup caller = lookup.in(String.class); MethodHandle handle = caller.findVirtual( String.class, "coder", MethodType.methodType(byte.class) ); CallSite callSite = LambdaMetafactory.metafactory( caller , "applyAsInt" , MethodType.methodType(ToIntFunction.class) , MethodType.methodType(int.class, Object.class) , handle , handle.type() ); return (ToIntFunction< String>) callSite.getTarget().invokeExact(); }

if (JDKUtils.JVM_VERSION == 11) { Function< byte[], String> stringCreator = JDKUtils.getStringCreatorJDK11(); byte[] bytes = new byte[]{'a', 'b', 'c'}; String apply = stringCreator.apply(bytes); assertEquals("abc", apply); }

4.1.3 JDK 17 fast string construction method
In JDK 17, methodhandles Lookup uses reflection Registerfields toFILTER protects lookupClass and allowedModes. The method of modifying allowedModes found on the Internet is not available.
In JDK 17, MethodHandlers can only be used by configuring JVM startup parameters. As follows:
--add-opens java.base/java.lang.invoke=ALL-UNNAMED

public static BiFunction< byte[], Charset, String> getStringCreatorJDK17() throws Throwable { Constructor< MethodHandles.Lookup> constructor = MethodHandles.Lookup.class.getDeclaredConstructor(Class.class, Class.class, int.class); constructor.setAccessible(true); MethodHandles.Lookup lookup = constructor.newInstance( String.class , null , -1 // Lookup.TRUSTED ); MethodHandles.Lookup caller = lookup.in(String.class); MethodHandle handle = caller.findStatic( String.class, "newStringNoRepl1", MethodType.methodType(String.class, byte[].class, Charset.class) ); CallSite callSite = LambdaMetafactory.metafactory( caller , "apply" , MethodType.methodType(BiFunction.class) , handle.type().generic() , handle , handle.type() ); return (BiFunction< byte[], Charset, String>) callSite.getTarget().invokeExact(); }

if (JDKUtils.JVM_VERSION == 17) { BiFunction< byte[], Charset, String> stringCreator = JDKUtils.getStringCreatorJDK17(); byte[] bytes = new byte[]{'a', 'b', 'c'}; String apply = stringCreator.apply(bytes, StandardCharsets.US_ASCII); assertEquals("abc", apply); }

4.2 rapid construction based on JavaLangAccess
Through the JavaLangAccess provided by SharedSecrets, you can also not copy the construction string, but this is troublesome. The API s of JDK 8/11/17 are different. It is inconvenient for a set of code to be compatible with different JDK versions, so it is not recommended to use it.
JavaLangAccess javaLangAccess = SharedSecrets.getJavaLangAccess(); javaLangAccess.newStringNoRepl(b, StandardCharsets.US_ASCII);

Realize fast string construction based on Unsafe
public static final Unsafe UNSAFE; static { Unsafe unsafe = null; try { Field theUnsafeField = Unsafe.class.getDeclaredField("theUnsafe"); theUnsafeField.setAccessible(true); unsafe = (Unsafe) theUnsafeField.get(null); } catch (Throwable ignored) {} UNSAFE = unsafe; } //////////////////////////////////////////// Object str = UNSAFE.allocateInstance(String.class); UNSAFE.putObject(str, valueOffset, chars);

Note: after JDK 9, the implementation is different, for example:
Object str = UNSAFE.allocateInstance(String.class); UNSAFE.putByte(str, coderOffset, (byte) 0); UNSAFE.putObject(str, valueOffset, (byte[]) bytes);

4.4 application of techniques for fast string construction:
The following method formats the date as a string, and the performance will be very good.
public String formatYYYYMMDD(Calendar calendar) throws Throwable { int year = calendar.get(Calendar.YEAR); int month = calendar.get(Calendar.MONTH) + 1; int dayOfMonth = calendar.get(Calendar.DAY_OF_MONTH); byte y0 = (byte) (year / 1000 + '0'); byte y1 = (byte) ((year / 100) % 10 + '0'); byte y2 = (byte) ((year / 10) % 10 + '0'); byte y3 = (byte) (year % 10 + '0'); byte m0 = (byte) (month / 10 + '0'); byte m1 = (byte) (month % 10 + '0'); byte d0 = (byte) (dayOfMonth / 10 + '0'); byte d1 = (byte) (dayOfMonth % 10 + '0'); if (JDKUtils.JVM_VERSION >= 9) { byte[] bytes = new byte[] {y0, y1, y2, y3, m0, m1, d0, d1}; if (JDKUtils.JVM_VERSION == 17) { return JDKUtils.getStringCreatorJDK17().apply(bytes, StandardCharsets.US_ASCII); } if (JDKUtils.JVM_VERSION <= 11) { return JDKUtils.getStringCreatorJDK11().apply(bytes); } return new String(bytes, StandardCharsets.US_ASCII); } char[] chars = new char[]{ (char) y0, (char) y1, (char) y2, (char) y3, (char) m0, (char) m1, (char) d0, (char) d1 }; if (JDKUtils.JVM_VERSION == 8) { return JDKUtils.getStringCreatorJDK8().apply(chars, true); } return new String(chars); }

5 fast traversal of strings
No matter what version of JDK, String Charat is a big overhead. The optimization effect of JIT is not good, and the overhead of parameter index range detection cannot be eliminated. It is better to directly operate the value array in String.
public final class String { private final char value[]; public char charAt(int index) { if ((index < 0) || (index >= value.length)) { throw new StringIndexOutOfBoundsException(index); } return value[index]; } }

In the version after JDK 9, charAt is more expensive
public final class String { private final byte[] value; private final byte coder; public char charAt(int index) { if (isLatin1()) { return StringLatin1.charAt(value, index); } else { return StringUTF16.charAt(value, index); } } }

5.1 get string Value method
Get string The method of value is as follows:
- Use Field reflection
- Use Unsafe
The comparison data of Unsafe and Field reflection in JDK 8 JMH are as follows:
Benchmark Mode Cnt Score Error Units StringGetValueBenchmark.reflect thrpt 5 438374.685 ± 1032.028 ops/ms StringGetValueBenchmark.unsafe thrpt 5 1302654.150 ± 59169.706 ops/ms

5.1.1 using reflection to get string value
static Field valueField; static { try { valueField = String.class.getDeclaredField("value"); valueField.setAccessible(true); } catch (NoSuchFieldException ignored) {} } //////////////////////////////////////////// char[] chars = (char[]) valueField.get(str);

5.1.2 using Unsafe to get string value
static long valueFieldOffset; static { try { Field valueField = String.class.getDeclaredField("value"); valueFieldOffset = UNSAFE.objectFieldOffset(valueField); } catch (NoSuchFieldException ignored) {} } //////////////////////////////////////////// char[] chars = (char[]) UNSAFE.getObject(str, valueFieldOffset);

static long valueFieldOffset; static long coderFieldOffset; static { try { Field valueField = String.class.getDeclaredField("value"); valueFieldOffset = UNSAFE.objectFieldOffset(valueField); Field coderField = String.class.getDeclaredField("coder"); coderFieldOffset = UNSAFE.objectFieldOffset(coderField); } catch (NoSuchFieldException ignored) {} } //////////////////////////////////////////// byte coder = UNSAFE.getObject(str, coderFieldOffset); byte[] bytes = (byte[]) UNSAFE.getObject(str, valueFieldOffset);

6 faster encodeUTF8 method
When you can get string directly Value, you can directly encodeUTF8 it, which is better than string GetBytes (standardcharsets. Utf_8) has much better performance.
6.1 JDK8 high performance encodeUTF8 method
public static int encodeUTF8(char[] src, int offset, int len, byte[] dst, int dp) { int sl = offset + len; int dlASCII = dp + Math.min(len, dst.length); // ASCII only optimized loop while (dp < dlASCII && src[offset] < '\u0080') { dst[dp++] = (byte) src[offset++]; } while (offset < sl) { char c = src[offset++]; if (c < 0x80) { // Have at most seven bits dst[dp++] = (byte) c; } else if (c < 0x800) { // 2 bytes, 11 bits dst[dp++] = (byte) (0xc0 | (c >> 6)); dst[dp++] = (byte) (0x80 | (c & 0x3f)); } else if (c >= '\uD800' && c < ('\uDFFF' + 1)) { //Character.isSurrogate(c) but 1.7 final int uc; int ip = offset - 1; if (c >= '\uD800' && c < ('\uDBFF' + 1)) { // Character.isHighSurrogate(c) if (sl - ip < 2) { uc = -1; } else { char d = src[ip + 1]; // d >= '\uDC00' && d < ('\uDFFF' + 1) if (d >= '\uDC00' && d < ('\uDFFF' + 1)) { // Character.isLowSurrogate(d) uc = ((c << 10) + d) + (0x010000 - ('\uD800' << 10) - '\uDC00'); // Character.toCodePoint(c, d) } else { dst[dp++] = (byte) '?'; continue; } } } else { // if (c >= '\uDC00' && c < ('\uDFFF' + 1)) { // Character.isLowSurrogate(c) dst[dp++] = (byte) '?'; continue; } else { uc = c; } } if (uc < 0) { dst[dp++] = (byte) '?'; } else { dst[dp++] = (byte) (0xf0 | ((uc >> 18))); dst[dp++] = (byte) (0x80 | ((uc >> 12) & 0x3f)); dst[dp++] = (byte) (0x80 | ((uc >> 6) & 0x3f)); dst[dp++] = (byte) (0x80 | (uc & 0x3f)); offset++; // 2 chars } } else { // 3 bytes, 16 bits dst[dp++] = (byte) (0xe0 | ((c >> 12))); dst[dp++] = (byte) (0x80 | ((c >> 6) & 0x3f)); dst[dp++] = (byte) (0x80 | (c & 0x3f)); } } return dp; }

Example of using encodeUTF8 method
char[] chars = UNSAFE.getObject(str, valueFieldOffset); // ensureCapacity(chars.length * 3) byte[] bytes = ...; // int bytesLength = IOUtils.encodeUTF8(chars, 0, chars.length, bytes, bytesOffset);

In this way, encodeUTF8 operation will not have redundant arrayCopy operation, and the performance will be improved.
6.1.1 performance test comparison
Test code
public class EncodeUTF8Benchmark { static String STR = "01234567890ABCDEFGHIJKLMNOPQRSTUVWZYZabcdefghijklmnopqrstuvwzyz one two three four five six seven eight nine ten"; static byte[] out; static long valueFieldOffset; static { out = new byte[STR.length() * 3]; try { Field valueField = String.class.getDeclaredField("value"); valueFieldOffset = UnsafeUtils.UNSAFE.objectFieldOffset(valueField); } catch (NoSuchFieldException e) { e.printStackTrace(); } } @Benchmark public void unsafeEncodeUTF8() throws Exception { char[] chars = (char[]) UnsafeUtils.UNSAFE.getObject(STR, valueFieldOffset); int len = IOUtils.encodeUTF8(chars, 0, chars.length, out, 0); } @Benchmark public void getBytesUTF8() throws Exception { byte[] bytes = STR.getBytes(StandardCharsets.UTF_8); System.arraycopy(bytes, 0, out, 0, bytes.length); } public static void main(String[] args) throws RunnerException { Options options = new OptionsBuilder() .include(EncodeUTF8Benchmark.class.getName()) .mode(Mode.Throughput) .timeUnit(TimeUnit.MILLISECONDS) .forks(1) .build(); new Runner(options).run(); } }

test result
EncodeUTF8Benchmark.getBytesUTF8 thrpt 5 20690.960 ± 5431.442 ops/ms EncodeUTF8Benchmark.unsafeEncodeUTF8 thrpt 5 34508.606 ± 55.510 ops/ms

From the results, the encoding cost of calling encodeUTF8 method directly through unsafe + is 58% of that of newString utf8.
6.2 JDK9/11/17 high performance encodeUTF8 method
public static int encodeUTF8(byte[] src, int offset, int len, byte[] dst, int dp) { int sl = offset + len; while (offset < sl) { byte b0 = src[offset++]; byte b1 = src[offset++]; if (b1 == 0 && b0 >= 0) { dst[dp++] = b0; } else { char c = (char)(((b0 & 0xff) << 0) | ((b1 & 0xff) << 8)); if (c < 0x800) { // 2 bytes, 11 bits dst[dp++] = (byte) (0xc0 | (c >> 6)); dst[dp++] = (byte) (0x80 | (c & 0x3f)); } else if (c >= '\uD800' && c < ('\uDFFF' + 1)) { //Character.isSurrogate(c) but 1.7 final int uc; int ip = offset - 1; if (c >= '\uD800' && c < ('\uDBFF' + 1)) { // Character.isHighSurrogate(c) if (sl - ip < 2) { uc = -1; } else { b0 = src[ip + 1]; b1 = src[ip + 2]; char d = (char) (((b0 & 0xff) << 0) | ((b1 & 0xff) << 8)); // d >= '\uDC00' && d < ('\uDFFF' + 1) if (d >= '\uDC00' && d < ('\uDFFF' + 1)) { // Character.isLowSurrogate(d) uc = ((c << 10) + d) + (0x010000 - ('\uD800' << 10) - '\uDC00'); // Character.toCodePoint(c, d) } else { return -1; } } } else { // if (c >= '\uDC00' && c < ('\uDFFF' + 1)) { // Character.isLowSurrogate(c) return -1; } else { uc = c; } } if (uc < 0) { dst[dp++] = (byte) '?'; } else { dst[dp++] = (byte) (0xf0 | ((uc >> 18))); dst[dp++] = (byte) (0x80 | ((uc >> 12) & 0x3f)); dst[dp++] = (byte) (0x80 | ((uc >> 6) & 0x3f)); dst[dp++] = (byte) (0x80 | (uc & 0x3f)); offset++; // 2 chars } } else { // 3 bytes, 16 bits dst[dp++] = (byte) (0xe0 | ((c >> 12))); dst[dp++] = (byte) (0x80 | ((c >> 6) & 0x3f)); dst[dp++] = (byte) (0x80 | (c & 0x3f)); } } } return dp; }

Example of using encodeUTF8 method
byte coder = UNSAFE.getObject(str, coderFieldOffset); byte[] value = UNSAFE.getObject(str, coderFieldOffset); if (coder == 0) { // ascii arraycopy } else { // ensureCapacity(chars.length * 3) byte[] bytes = ...; // int bytesLength = IOUtils.encodeUTF8(value, 0, value.length, bytes, bytesOffset); }

In this way, encodeUTF8 operation will not have redundant arrayCopy operation, and the performance will be improved.
7 important reminders
The above skills are not for novices. Improper use will easily lead to bugs. If you don't understand them thoroughly, please don't use them!
Reference link:
[1]fastjson2/IOUtils.java at 2.0.3 · alibaba/fastjson2 · GitHub
[2]fastjson2/JDKUtils.java at 2.0.3 · alibaba/fastjson2 · GitHub
This article is the original content of Alibaba cloud and cannot be reproduced without permission.