Fast UTF-8 sequence validation
Nelson H. F. Beebe
beebe at math.utah.edu
Tue May 18 18:20:38 CDT 2021
I recently recorded a BibTeX entry in
http://www.math.utah.edu/pub/tex/bib/unicode.html#Keiser:2021:VUL
for a new paper that has just been published in a Wiley journal:
Validating UTF-8 in less than one instruction per byte
Software --- Practice and Experience 51(5) 950--964 May 2021
https://doi.org/10.1002/spe.2920
A preprint is available at
https://arxiv.org/abs/2010.03090
The authors exploit vector instructions in recent AMD/Intel x86_64 and
ARM v7 NEON processors to achieve high throughput that in some cases
exceeds that of the Standard C library function memcpy() for mostly
ASCII sequences, and for random UTF-8 sequences, runs at 1/4 to 1/2
the speed of memcpy().
C++ code implementing their work is freely available at
https://github.com/lemire/validateutf8-experiments
and the paper's references contain links to earlier papers on fast
validation and transformation of Unicode character sequences.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- University of Utah FAX: +1 801 581 4148 -
- Department of Mathematics, 110 LCB Internet e-mail: beebe at math.utah.edu -
- 155 S 1400 E RM 233 beebe at acm.org beebe at computer.org -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------
More information about the Unicode
mailing list