Fast UTF-8 sequence validation

Nelson H. F. Beebe beebe at math.utah.edu
Tue May 18 18:20:38 CDT 2021


I recently recorded a BibTeX entry in

	http://www.math.utah.edu/pub/tex/bib/unicode.html#Keiser:2021:VUL

for a new paper that has just been published in a Wiley journal:

	Validating UTF-8 in less than one instruction per byte
	Software --- Practice and Experience 51(5) 950--964 May 2021
	https://doi.org/10.1002/spe.2920

A preprint is available at

	https://arxiv.org/abs/2010.03090

The authors exploit vector instructions in recent AMD/Intel x86_64 and
ARM v7 NEON processors to achieve high throughput that in some cases
exceeds that of the Standard C library function memcpy() for mostly
ASCII sequences, and for random UTF-8 sequences, runs at 1/4 to 1/2
the speed of memcpy().

C++ code implementing their work is freely available at

	https://github.com/lemire/validateutf8-experiments

and the paper's references contain links to earlier papers on fast
validation and transformation of Unicode character sequences.

-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: beebe at math.utah.edu  -
- 155 S 1400 E RM 233                       beebe at acm.org  beebe at computer.org -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------


More information about the Unicode mailing list