utfcpp/doc/references/faster-utf8-strlen.html

<html><head>
<title>Even faster UTF-8 character counting</title>
</head><body>
<div class="banner">
  <h1><a href="/blog/">Daemonic Dispatches</a></h1>
  Musings from Colin Percival
</div>
<div class="nonbanner">
<div class="content">
<h2>Even faster UTF-8 character counting</h2>
I recently came across two articles,
"<a href="http://canonical.org/~kragen/strlen-utf8.html">Counting
characters in UTF-8 strings is fast</a>" by Kragen Sitaker, and
"<a href="http://porg.es/blog/counting-characters-in-utf-8-strings-is-faster">Counting
characters in UTF-8 strings is fast(er)</a>" by George Pollard, which
provide a series of successively faster ways of (as the article names
suggest) counting the number of UTF-8 characters in a NUL-terminated
string.  We can do better.
<p>
Kragen takes the approach of examining each byte in sequence and
asking if (a) is it the terminating NUL, and (b) is it the first of a
UTF-8 character.  This last test is quite easy: Bytes <tt>0x01</tt>
through <tt>0x7F</tt> in UTF-8 represent the corresponding ASCII
characters, while bytes <tt>0xC0</tt> through <tt>0xFF</tt> are
the first byte of a multi-byte character.  This results in the
following inner loop (modulo some style changes to make it easier to
compare this against later versions):
<pre>
	while (s[i]) {
		if ((s[i] & 0xC0) != 0x80)
			j++;
		i++;
	}
	return (j);
</pre>
<p>
Kragen continues by comparing this to an optimized version written
in x86 assembly language by Aristotle Pagaltzis; Aristotle's
version cleverly takes advantage of the <tt>shl</tt> instruction
setting the sign, carry, and zero flags, but otherwise applies
exactly the same algorithm:
<pre>
loopa:	dec %ecx
loopb:	lodsb
	shl $1, %al
	js loopa
	jc loopb
	jnz loopa
</pre>
However, this assembly language version, like Kragen's C version,
inspects each of the bytes one by one, which inherently limits
the performance.
<p>
George Pollard makes the assumption that the input string is valid
UTF-8, and notices that by looking at the first byte of a multibyte
character, we can determine the length of the character: If the
first byte is between <tt>0xC0</tt> and <tt>0xDF</tt>, the UTF-8
character has two bytes; if it is between <tt>0xE0</tt> and
<tt>0xEF</tt>, the UTF-8 character has 3 bytes; and if it is
<tt>0xF0</tt> and <tt>0xFF</tt>, the UTF-8 character has 4 bytes.
After reading the first byte of a multibyte character, George
skips over the trailing bytes.  He also fast-paths the handling
of ASCII characters, treating characters as signed bytes in order
to distinguish between ASCII and non-ASCII characters, while
giving a wonderful example of using a <tt>goto</tt> to jump from
the middle of one loop into the middle of another:
<pre>
	while (s[i] > 0) {
ascii:
		i++;
	}

	count += i - iBefore;

	while (s[i]) {
		if (s[i] > 0) {
			iBefore = i;
			goto ascii;
		} else {
			switch (0xF0 & s[i]) {
			case 0xE0:
				i += 3;
				break;
			case 0xF0:
				i += 4;
				break;
			default:
				i += 2;
				break;
			}
		}

		count++;
	}
</pre>
While this code is considerably faster than both Kragen's C code
and Aristotle's assembly code, it suffers from two performance
limiting factors: First, it uses conditional branches which will
only be consistently predicted correctly if all of the characters
encountered have the same length; and second, it still inspects
characters one by one.
<p>
This can be improved in three ways:
<ul><li>
Instead of using conditional branches, identify the initial bytes
of UTF-8 characters using logical operations only.
</li><li>
Instead of handling one character at once, vectorize: Handle lots
of bytes in parallel.
</li><li>
In order to reduce the cost of waiting for memory, prefetch data
if possible.
</li></ul>
Making these improvements gave me the following code:
<pre>
#define ONEMASK ((size_t)(-1) / 0xFF)

static size_t
cp_strlen_utf8(const char * _s)
{
	const char * s;
	size_t count = 0;
	size_t u;
	unsigned char b;

	/* Handle any initial misaligned bytes. */
	for (s = _s; (uintptr_t)(s) &amp; (sizeof(size_t) - 1); s++) {
		b = *s;

		/* Exit if we hit a zero byte. */
		if (b == '\0')
			goto done;

		/* Is this byte NOT the first byte of a character? */
		count += (b &gt;&gt; 7) & ((~b) &gt;&gt; 6);
	}

	/* Handle complete blocks. */
	for (; ; s += sizeof(size_t)) {
		/* Prefetch 256 bytes ahead. */
		__builtin_prefetch(&amp;s[256], 0, 0);

		/* Grab 4 or 8 bytes of UTF-8 data. */
		u = *(size_t *)(s);

		/* Exit the loop if there are any zero bytes. */
		if ((u - ONEMASK) &amp; (~u) &amp; (ONEMASK * 0x80))
			break;

		/* Count bytes which are NOT the first byte of a character. */
		u = ((u &amp; (ONEMASK * 0x80)) &gt;&gt; 7) &amp; ((~u) &gt;&gt; 6);
		count += (u * ONEMASK) &gt;&gt; ((sizeof(size_t) - 1) * 8);
	}

	/* Take care of any left-over bytes. */
	for (; ; s++) {
		b = *s;

		/* Exit if we hit a zero byte. */
		if (b == '\0')
			break;

		/* Is this byte NOT the first byte of a character? */
		count += (b &gt;&gt; 7) &amp; ((~b) &gt;&gt; 6);
	}

done:
	return ((s - _s) - count);
}
</pre>
<p>
How much faster is this?  I put together a
<a href="strlentest.c">a slightly improved version of Kragen's
benchmark code</a>, using a buffer filled with valid UTF-8 text
instead of his more artificial test cases, and ran it on an
Opteron 848 @ 2.2 GHz running FreeBSD 7.0-RELEASE-p1 after compiling
with gcc 4.2.1 with the -O3 flag set.  Some notes to help decipher
the output:
<ul><li>
The function names and their meanings are
<ul><li>
<tt>gcc_strlen</tt> = the strlen() function as compiled by gcc;
</li><li>
<tt>kjs_strlen</tt> = Kragen's non-UTF-8 strlen function;
</li><li>
<tt>cp_strlen</tt> = my non-UTF-8 strlen function (not shown here, but
see the source code if you're interested);
</li><li>
<tt>kjs_strlen_utf8</tt> = Kragen's UTF-8 character counter;
</li><li>
<tt>gp_strlen_utf8</tt> = George's UTF-8 character counter; and
</li><li>
<tt>cp_strlen_utf8</tt> = my UTF-8 character counter.
</li></ul>
</li><li>
The test strings are "hello, world", "na&#xef;ve", and
"&#x3053;&#x3093;&#x306b;&#x3061;&#x306f;".
</li><li>
The values printed on each line before the colon are the result
computed -- the number of bytes for <tt>strlen</tt> functions, and the
number of UTF-8 characters for <tt>strlen_utf8</tt> functions; the
values after the colon are the mean and standard deviation time taken
in seconds.
</li></ul>
<p>
The improvement is striking:
<pre>
testing 33554424 bytes of repeated "hello, world":
                      gcc_strlen =   33554424: 0.034169 +/- 0.000090
                      kjs_strlen =   33554424: 0.049529 +/- 0.000280
                       cp_strlen =   33554424: 0.011357 +/- 0.000030
                 kjs_strlen_utf8 =   33554424: 0.060930 +/- 0.000031
                  gp_strlen_utf8 =   33554424: 0.049675 +/- 0.000294
                  cp_strlen_utf8 =   33554424: 0.014049 +/- 0.000047
testing 33554430 bytes of repeated "na?ve":
                      gcc_strlen =   33554430: 0.034168 +/- 0.000069
                      kjs_strlen =   33554430: 0.049544 +/- 0.000287
                       cp_strlen =   33554430: 0.011348 +/- 0.000021
                 kjs_strlen_utf8 =   27962025: 0.061020 +/- 0.000291
                  gp_strlen_utf8 =   27962025: 0.059726 +/- 0.000029
                  cp_strlen_utf8 =   27962025: 0.014041 +/- 0.000043
testing 33554430 bytes of repeated "?????":
                      gcc_strlen =   33554430: 0.034157 +/- 0.000088
                      kjs_strlen =   33554430: 0.049437 +/- 0.000018
                       cp_strlen =   33554430: 0.011438 +/- 0.000286
                 kjs_strlen_utf8 =   11184810: 0.060919 +/- 0.000032
                  gp_strlen_utf8 =   11184810: 0.027454 +/- 0.000031
                  cp_strlen_utf8 =   11184810: 0.014133 +/- 0.000287
</pre>
Not only is vectorized character counting faster than the "look at a
byte, skip a few" approach, it isn't even close: Even when the
characters are 3 bytes each (as in the case of
"&#x3053;&#x3093;&#x306b;&#x3061;&#x306f;"), the vectorized approach
wins by a factor of 2; and its lead is larger when the skipping
approach can't skip as many bytes.  Moreover, vectorized character
counting is only 30% slower than a vectorized strlen and more than
twice as fast as a non-vectorized strlen -- although given that
character counting runs at slightly faster than one byte per clock
cycle, it's not surprising that non-vectorized code can't keep up!
<p>
Can we do better?  I think so.  My code uses 64-bit integer registers
to manipulate 8 bytes at once; this is the same size as MMX registers,
so those probably won't be very useful, but with SSE2 16 can be
manipulated at once, which could provide another doubling of the
performance.
<p>
Beyond a doubling?  Well, the first rule of optimization is to start
by finding a good algorithm -- and any algorithm in which the critical
path involves counting UTF-8 characters in a 32 megabyte NUL-terminated
string is doing something wrong.  This is very much a toy problem; but
the lesson it teaches is worth remembering: Vectorization is good!
<p>
<div class="trailer">
  Posted at 2008-06-05 09:20 | <a href="2008-06-05-faster-utf8-strlen.html">Permanent link</a> |
<a href="2008-06-05-faster-utf8-strlen.html#disqus_thread">Comments</a>
</div>

<span class="post-byline">
<span class="author publisher-anchor-color"><a href="#" data-action="profile" data-user="25571647" data-role="username">Philippe Verdy</a></span>
</span>
<span class="bullet time-ago-bullet" aria-hidden="true">•</span>
<a href="#comment-787106245" data-role="relative-time" class="time-ago" title="Sunday, February 3 2013 5:49 AM">a year ago</a>

<p>There's a potential bug with the vectorized version, because it is reading data past the end of string in this line:<br>  u = *(size_t *)(s);</p><p>Effectively, if the final zero byte occurs where it is not at end of a 4-byte or 8-byte aligned block, then this will read extra bytes (up to 7 extra bytes when compiling with a 64-bit size_t). In some cases where memory is strictly protected, any attempt to read after this zero byte may mean reafing past the end of buffer and will cause an exception (SEGV).</p><p>As there's no exception handler to detect this, the function will cause the caller to not receive the expected result (it would be safe to catch the exception at this poiint as another way to break the loop instead of depending *only* on the presence of the zero byte). One case where this exception will occur is when the function is used to get the length of a string stored in a protected area, or in a memory-mapped I/O area (where the extra reads may not cause an exception, but could have other impact, or would cause other exception), so the function will only be usable to measure the length in memory allocated in standard memory.</p><p>But even in standard memory (allocated by malloc(), or strings allocated in buffers on the caller stack), there could also exist protection fences enforced by hardware.</p><p>But the issue is even more severe with the prefetch instruction, which attempts to read 256 bytes ahead without knowing if these 256 bytes are in a valid segment usable in the current process ; a SEGV fault is very likely to occur for strings allocated on the heap or on the caller's stack near the top of stack.</p><p>For this reason, there's in fact NO safe way to parallelize the strlen() function as the ONLY condition is the first location of the zero byte. You HAVE TO read bytes one after the other, or make sure that your program allocates all its strings in memory segments that are size_t aligned (but there's no safe way to use the prefetch, unless all your strangs are allocated from buffers that have at least 256 bytes of valid memory allocated after the end of string).</p><p>The role of the prefetch is not to optiize the CPU L3 cache (a 32-byte ahead prefetch would be enough in most cases with current processors) but to anticipate the page load when the segment of memory has been pages out and must be read from disk (but in that case the prefetch will require much more time than the time to process 256 bytes in only 32 loops of the parallelized code that treats 8 bytes on each loop): 32 loops will take just about a few hundreds nanoseconds, but to perform a page-in from disk, it could take a dozen of milliseconds or more (need to page out some other memory), and so you won't avoid the wait time.</p><p>In my opinion the prefetch is absolutely unnecessary as the gain is almost inexistant (except for a *single* line within the CPU L1 data cache, where it will help acelerate loading that line from the L2 data cache and possibly from the unified L3 cache or from a larger external memory cache on the systel bus, if supported by the north bridge between the CPU and the physical RAM, but such external cache almost never exist today, except on costly systems built witjh north bridges hosting multiples CPUs and sharing the same amount of external RAM).</p><p>Note that hardware memory fences are now used in strict systems tuned for security or in some debuggers trying to detect illegal accesses past end of buffers. It is now a reality. It may even happen that some antivirus solutions (or system settings) implement these secure fences on all buffers allocated on stack on in the process local heap. But most memory allocation routines (malloc and so on) are now allocating memory within aligned blocks, because it helps a lot the implementation of free() and of gargage collectors (like the "oops" used in Java VMs to extend the VM size to more than 3GB while still using 32-bit pointers) : it is considerably faster to allocate a few more bytes while preserving the alignment (of course this does not help protectnig memory accesses between multiple threads of the same process, as some debuggers are trying to do for stress tests, but general these extra fences are only protecting from illegal writes, but not from read-only accesses ; as strlen() is only readonly, it will still be safe to assume that read accesses in the same 4-byte or 8-byte block are possible).</p><p>If you want to use prefetch, it will only be safe if those prefetches occur only within the size of the same memory allocator block size, and useful ONLY if this allocator block size is larger than the L1 or L2 CPU cache line (for example if the memory allocator only returns blocks that are 32-byte aligned in size, and the CPU L1 cache line is only 4 bytes or 8 bytes, then you'll save a few nanoseconds if you prefetch the memory at end of the allocator block:</p><p>__builtin_prefetch ((char*)((size_t)s | (size_t)31), 0, 0);</p><p>This way of using prefetches means that it will not always prefetch something if the current loop is already processing the last 4-byte or 8-byte block within the current 32-byte aligned memory block returned by a memory allocator, but care sould be taken that it will not attempt to read past the top of stack if the string was in fact allocated near the top of the caller's stack. To ensure that this top of stack access will never be reached, the strlen() routine could allocate its own dummy local 32-byte buffer (and make sure that an optimizing compiler will not discard this unused local buffer)</p><p>Another note : the gcc documentation says that the prefetch will not fault if the given address is invalid, as long as the the expression to compute this address is valid. However validity does not mean that such preparation (here for read access only according to the second parameter=0, the third parameter=0 being used to tune the persistance or locality of the prefectch in the cache) will not have any undesired side effect (for example when processing strings stored in memory-mapped hardware I/O space). If the default implementation of strlen in gcc standard libraries for C does not use prefetch, it is because no such assumption is made, the standard library is built to be usable in all cases, for all types of memory blocks.</p>

</div>

</div>
</body></html>