The chief point of difference between simdjson and systems such as RapidJSON and sajson lies in the “Stage 1” of simdjson; the part of the system that detects the locations of structural and pseudo-structural characters from the input. This system operates with large scale and regular operations over 64 bytes of the input at once, and can take advantage of both SIMD operations (when examining characters and character classes) as well as 64-bit bitwise operations (when applying transformations on the masks obtained from the SIMD operations). As such, it can achieve “economies of scale” as compared to a step-by-step approach.
一次性读入64 bytes的数据,然后同时操作?
Stage 1 proceeds as follows:
- Validation of UTF-8 across the whole input.
- Detection of odd-length sequences of backslashes (that will result in escaping the subsequent character)
- Detection of which characters are “inside” quotes, by filtering out escaped quote characters from the previous step, then doing a parallel prefix sum over XOR (using the PCLMULQDQ instruction with an argument of -1 as the multiplier) to turn our escaped quote mask into a mask showing which bits are between a pair of quotes.
- Detection of structural characters and whitespace via table-based lookup (implemented with a pair of VPSHUFB instructions).
- Detection of pseudo-structural characters (those characters I talked about in the summary of stage 1 that need to be exposed to the subsequent stage 2 for error detection and atom handling).
- Conversion of the bitmask containing structural and pseudo-structural characters into a series of indexes.