Even faster UTF-8 character counting

I recently came across two articles, "Counting characters in UTF-8 strings is fast" by Kragen Sitaker, and "Counting characters in UTF-8 strings is fast(er)" by George Pollard, which provide a series of successively faster ways of (as the article names suggest) counting the number of UTF-8 characters in a NUL-terminated string. We can do better.

Kragen takes the approach of examining each byte in sequence and asking if (a) is it the terminating NUL, and (b) is it the first of a UTF-8 character. This last test is quite easy: Bytes 0x01 through 0x7F in UTF-8 represent the corresponding ASCII characters, while bytes 0xC0 through 0xFF are the first byte of a multi-byte character. This results in the following inner loop (modulo some style changes to make it easier to compare this against later versions):

	while (s[i]) {
		if ((s[i] & 0xC0) != 0x80)
			j++;
		i++;
	}
	return (j);

Kragen continues by comparing this to an optimized version written in x86 assembly language by Aristotle Pagaltzis; Aristotle's version cleverly takes advantage of the shl instruction setting the sign, carry, and zero flags, but otherwise applies exactly the same algorithm:

loopa:	dec %ecx
loopb:	lodsb
	shl $1, %al
	js loopa
	jc loopb
	jnz loopa

However, this assembly language version, like Kragen's C version, inspects each of the bytes one by one, which inherently limits the performance.

George Pollard makes the assumption that the input string is valid UTF-8, and notices that by looking at the first byte of a multibyte character, we can determine the length of the character: If the first byte is between 0xC0 and 0xDF, the UTF-8 character has two bytes; if it is between 0xE0 and 0xEF, the UTF-8 character has 3 bytes; and if it is 0xF0 and 0xFF, the UTF-8 character has 4 bytes. After reading the first byte of a multibyte character, George skips over the trailing bytes. He also fast-paths the handling of ASCII characters, treating characters as signed bytes in order to distinguish between ASCII and non-ASCII characters, while giving a wonderful example of using a goto to jump from the middle of one loop into the middle of another:

	while (s[i] > 0) {
ascii:
		i++;
	}

	count += i - iBefore;

	while (s[i]) {
		if (s[i] > 0) {
			iBefore = i;
			goto ascii;
		} else {
			switch (0xF0 & s[i]) {
			case 0xE0:
				i += 3;
				break;
			case 0xF0:
				i += 4;
				break;
			default:
				i += 2;
				break;
			}
		}

		count++;
	}

While this code is considerably faster than both Kragen's C code and Aristotle's assembly code, it suffers from two performance limiting factors: First, it uses conditional branches which will only be consistently predicted correctly if all of the characters encountered have the same length; and second, it still inspects characters one by one.

This can be improved in three ways:

Instead of using conditional branches, identify the initial bytes of UTF-8 characters using logical operations only.
Instead of handling one character at once, vectorize: Handle lots of bytes in parallel.
In order to reduce the cost of waiting for memory, prefetch data if possible.

Making these improvements gave me the following code:

#define ONEMASK ((size_t)(-1) / 0xFF)

static size_t
cp_strlen_utf8(const char * _s)
{
	const char * s;
	size_t count = 0;
	size_t u;
	unsigned char b;

	/* Handle any initial misaligned bytes. */
	for (s = _s; (uintptr_t)(s) & (sizeof(size_t) - 1); s++) {
		b = *s;

		/* Exit if we hit a zero byte. */
		if (b == '\0')
			goto done;

		/* Is this byte NOT the first byte of a character? */
		count += (b >> 7) & ((~b) >> 6);
	}

	/* Handle complete blocks. */
	for (; ; s += sizeof(size_t)) {
		/* Prefetch 256 bytes ahead. */
		__builtin_prefetch(&s[256], 0, 0);

		/* Grab 4 or 8 bytes of UTF-8 data. */
		u = *(size_t *)(s);

		/* Exit the loop if there are any zero bytes. */
		if ((u - ONEMASK) & (~u) & (ONEMASK * 0x80))
			break;

		/* Count bytes which are NOT the first byte of a character. */
		u = ((u & (ONEMASK * 0x80)) >> 7) & ((~u) >> 6);
		count += (u * ONEMASK) >> ((sizeof(size_t) - 1) * 8);
	}

	/* Take care of any left-over bytes. */
	for (; ; s++) {
		b = *s;

		/* Exit if we hit a zero byte. */
		if (b == '\0')
			break;

		/* Is this byte NOT the first byte of a character? */
		count += (b >> 7) & ((~b) >> 6);
	}

done:
	return ((s - _s) - count);
}

How much faster is this? I put together a a slightly improved version of Kragen's benchmark code, using a buffer filled with valid UTF-8 text instead of his more artificial test cases, and ran it on an Opteron 848 @ 2.2 GHz running FreeBSD 7.0-RELEASE-p1 after compiling with gcc 4.2.1 with the -O3 flag set. Some notes to help decipher the output:

The function names and their meanings are
- gcc_strlen = the strlen() function as compiled by gcc;
- kjs_strlen = Kragen's non-UTF-8 strlen function;
- cp_strlen = my non-UTF-8 strlen function (not shown here, but see the source code if you're interested);
- kjs_strlen_utf8 = Kragen's UTF-8 character counter;
- gp_strlen_utf8 = George's UTF-8 character counter; and
- cp_strlen_utf8 = my UTF-8 character counter.
The test strings are "hello, world", "naïve", and "こんにちは".
The values printed on each line before the colon are the result computed -- the number of bytes for strlen functions, and the number of UTF-8 characters for strlen_utf8 functions; the values after the colon are the mean and standard deviation time taken in seconds.

The improvement is striking:

testing 33554424 bytes of repeated "hello, world":
                      gcc_strlen =   33554424: 0.034169 +/- 0.000090
                      kjs_strlen =   33554424: 0.049529 +/- 0.000280
                       cp_strlen =   33554424: 0.011357 +/- 0.000030
                 kjs_strlen_utf8 =   33554424: 0.060930 +/- 0.000031
                  gp_strlen_utf8 =   33554424: 0.049675 +/- 0.000294
                  cp_strlen_utf8 =   33554424: 0.014049 +/- 0.000047
testing 33554430 bytes of repeated "na?ve":
                      gcc_strlen =   33554430: 0.034168 +/- 0.000069
                      kjs_strlen =   33554430: 0.049544 +/- 0.000287
                       cp_strlen =   33554430: 0.011348 +/- 0.000021
                 kjs_strlen_utf8 =   27962025: 0.061020 +/- 0.000291
                  gp_strlen_utf8 =   27962025: 0.059726 +/- 0.000029
                  cp_strlen_utf8 =   27962025: 0.014041 +/- 0.000043
testing 33554430 bytes of repeated "?????":
                      gcc_strlen =   33554430: 0.034157 +/- 0.000088
                      kjs_strlen =   33554430: 0.049437 +/- 0.000018
                       cp_strlen =   33554430: 0.011438 +/- 0.000286
                 kjs_strlen_utf8 =   11184810: 0.060919 +/- 0.000032
                  gp_strlen_utf8 =   11184810: 0.027454 +/- 0.000031
                  cp_strlen_utf8 =   11184810: 0.014133 +/- 0.000287

Not only is vectorized character counting faster than the "look at a byte, skip a few" approach, it isn't even close: Even when the characters are 3 bytes each (as in the case of "こんにちは"), the vectorized approach wins by a factor of 2; and its lead is larger when the skipping approach can't skip as many bytes. Moreover, vectorized character counting is only 30% slower than a vectorized strlen and more than twice as fast as a non-vectorized strlen -- although given that character counting runs at slightly faster than one byte per clock cycle, it's not surprising that non-vectorized code can't keep up!

Can we do better? I think so. My code uses 64-bit integer registers to manipulate 8 bytes at once; this is the same size as MMX registers, so those probably won't be very useful, but with SSE2 16 can be manipulated at once, which could provide another doubling of the performance.

Beyond a doubling? Well, the first rule of optimization is to start by finding a good algorithm -- and any algorithm in which the critical path involves counting UTF-8 characters in a 32 megabyte NUL-terminated string is doing something wrong. This is very much a toy problem; but the lesson it teaches is worth remembering: Vectorization is good!

Posted at 2008-06-05 09:20 | Permanent link | Comments

Philippe Verdy a year ago

There's a potential bug with the vectorized version, because it is reading data past the end of string in this line:
u = *(size_t *)(s);

Effectively, if the final zero byte occurs where it is not at end of a 4-byte or 8-byte aligned block, then this will read extra bytes (up to 7 extra bytes when compiling with a 64-bit size_t). In some cases where memory is strictly protected, any attempt to read after this zero byte may mean reafing past the end of buffer and will cause an exception (SEGV).

As there's no exception handler to detect this, the function will cause the caller to not receive the expected result (it would be safe to catch the exception at this poiint as another way to break the loop instead of depending *only* on the presence of the zero byte). One case where this exception will occur is when the function is used to get the length of a string stored in a protected area, or in a memory-mapped I/O area (where the extra reads may not cause an exception, but could have other impact, or would cause other exception), so the function will only be usable to measure the length in memory allocated in standard memory.

But even in standard memory (allocated by malloc(), or strings allocated in buffers on the caller stack), there could also exist protection fences enforced by hardware.

But the issue is even more severe with the prefetch instruction, which attempts to read 256 bytes ahead without knowing if these 256 bytes are in a valid segment usable in the current process ; a SEGV fault is very likely to occur for strings allocated on the heap or on the caller's stack near the top of stack.

For this reason, there's in fact NO safe way to parallelize the strlen() function as the ONLY condition is the first location of the zero byte. You HAVE TO read bytes one after the other, or make sure that your program allocates all its strings in memory segments that are size_t aligned (but there's no safe way to use the prefetch, unless all your strangs are allocated from buffers that have at least 256 bytes of valid memory allocated after the end of string).

The role of the prefetch is not to optiize the CPU L3 cache (a 32-byte ahead prefetch would be enough in most cases with current processors) but to anticipate the page load when the segment of memory has been pages out and must be read from disk (but in that case the prefetch will require much more time than the time to process 256 bytes in only 32 loops of the parallelized code that treats 8 bytes on each loop): 32 loops will take just about a few hundreds nanoseconds, but to perform a page-in from disk, it could take a dozen of milliseconds or more (need to page out some other memory), and so you won't avoid the wait time.

In my opinion the prefetch is absolutely unnecessary as the gain is almost inexistant (except for a *single* line within the CPU L1 data cache, where it will help acelerate loading that line from the L2 data cache and possibly from the unified L3 cache or from a larger external memory cache on the systel bus, if supported by the north bridge between the CPU and the physical RAM, but such external cache almost never exist today, except on costly systems built witjh north bridges hosting multiples CPUs and sharing the same amount of external RAM).

Note that hardware memory fences are now used in strict systems tuned for security or in some debuggers trying to detect illegal accesses past end of buffers. It is now a reality. It may even happen that some antivirus solutions (or system settings) implement these secure fences on all buffers allocated on stack on in the process local heap. But most memory allocation routines (malloc and so on) are now allocating memory within aligned blocks, because it helps a lot the implementation of free() and of gargage collectors (like the "oops" used in Java VMs to extend the VM size to more than 3GB while still using 32-bit pointers) : it is considerably faster to allocate a few more bytes while preserving the alignment (of course this does not help protectnig memory accesses between multiple threads of the same process, as some debuggers are trying to do for stress tests, but general these extra fences are only protecting from illegal writes, but not from read-only accesses ; as strlen() is only readonly, it will still be safe to assume that read accesses in the same 4-byte or 8-byte block are possible).

If you want to use prefetch, it will only be safe if those prefetches occur only within the size of the same memory allocator block size, and useful ONLY if this allocator block size is larger than the L1 or L2 CPU cache line (for example if the memory allocator only returns blocks that are 32-byte aligned in size, and the CPU L1 cache line is only 4 bytes or 8 bytes, then you'll save a few nanoseconds if you prefetch the memory at end of the allocator block:

__builtin_prefetch ((char*)((size_t)s | (size_t)31), 0, 0);

This way of using prefetches means that it will not always prefetch something if the current loop is already processing the last 4-byte or 8-byte block within the current 32-byte aligned memory block returned by a memory allocator, but care sould be taken that it will not attempt to read past the top of stack if the string was in fact allocated near the top of the caller's stack. To ensure that this top of stack access will never be reached, the strlen() routine could allocate its own dummy local 32-byte buffer (and make sure that an optimizing compiler will not discard this unused local buffer)

Another note : the gcc documentation says that the prefetch will not fault if the given address is invalid, as long as the the expression to compute this address is valid. However validity does not mean that such preparation (here for read access only according to the second parameter=0, the third parameter=0 being used to tune the persistance or locality of the prefectch in the cache) will not have any undesired side effect (for example when processing strings stored in memory-mapped hardware I/O space). If the default implementation of strlen in gcc standard libraries for C does not use prefetch, it is because no such assumption is made, the standard library is built to be usable in all cases, for all types of memory blocks.

Daemonic Dispatches

Even faster UTF-8 character counting