Counting Characters in UTF-8 Strings Is Fast(er) : porges

‘Counting Characters in UTF-8 Strings Is Fast’ by Kragen Sitaker shows several ways to count characters UTF-8, using both assembly and C. But, with a few assumptions, we can go faster.

Assumption One: We are dealing with a valid UTF-8 string

Making this assumption means that once we hit the start of a multi-byte character we can skip forward a few places. It also means we don’t check for hitting invalid characters (~~this sends the algorithm into an infinite loop if run on non-valid input~~ it is possible to make the algorithm run past the end of the buffer by supplying malformed data).

Assumption Two: Most strings are ASCII

Therefore, run a simple ASCII count routine beforehand. As soon as we hit a non-ASCII character switch into counting UTF-8.

The code

Note: The current code relies on chars being signed bytes.

int porges_strlen2(char *s)
{
        int i = 0;
 
        //Go fast if string is only ASCII.
        //Loop while not at end of string,
        // and not reading anything with highest bit set.
        //If highest bit is set, number is negative.
        while (s[i] > 0)
                i++;
 
        if (s[i] <= -65) // all follower bytes have values below -65
                return -1; // invalid
 
        //Note, however, that the following code does *not*
        // check for invalid characters.
        //The above is just included to bail out on the tests :)
 
        int count = i;
        while (s[i])
        {
                //if ASCII just go to next character
                if (s[i] > 0)      i += 1;
                else
                //select amongst multi-byte starters
                switch (0xF0 & s[i])
                {
                        case 0xE0: i += 3; break;
                        case 0xF0: i += 4; break;
                        default:   i += 2; break;
                }
                ++count;
        }
        return count;
}

Results

I used Kragen’s testing code, but removed all strlens that didn’t do UTF-8 counting, and added one test for valid UTF-8 text (just the phrase ‘こんにちは’ repeated). Twice as fast on both the ASCII-only and UTF-8 tests. Improvement on ASCII is due to the ASCII-only routine, and improvement on UTF-8 is due to skipping bytes.

"": 0 0 0 0 0
"hello, world": 12 12 12 12 12
"naïve": 5 5 5 5 5
"こんにちは": 5 5 5 5 5
1: all 'a':
1:           porges_strlen2(string) =   33554431: 0.034672
1:         ap_strlen_utf8_s(string) =   33554431: 0.068210
1:         my_strlen_utf8_c(string) =   33554431: 0.071038
1:         my_strlen_utf8_s(string) =   33554431: 0.135856
2: all '\xe3':
2:           porges_strlen2(string) =   11184811: 0.032115
2:         ap_strlen_utf8_s(string) =   33554431: 0.068228
2:         my_strlen_utf8_c(string) =   33554431: 0.071050
2:         my_strlen_utf8_s(string) =   33554431: 0.152513
3: all '\x81':
3:           porges_strlen2(string) =         -1: 0.000001
3:         my_strlen_utf8_s(string) =          0: 0.068339
3:         ap_strlen_utf8_s(string) =          0: 0.068547
3:         my_strlen_utf8_c(string) =          0: 0.071039
4: all konichiwa:
4:           porges_strlen2(string) =   11184810: 0.032143
4:         ap_strlen_utf8_s(string) =   11184810: 0.068271
4:         my_strlen_utf8_c(string) =   11184810: 0.071036
4:         my_strlen_utf8_s(string) =   11184810: 0.089478

Note also that the invalid UTF-8 gives strange results; this is because the algorithm isn’t meant to work on it! (The first invalid sequence is a list of 3-byte starters, so the result is divided in 3 due to skipping, and the second is a list of follower bytes, so the code bails out.)

Going faster

By dropping back to the ASCII counter whenever we hit ASCII again, we go even faster. This will handle the cases (such as in English) where there are many ASCII characters and only a few multibyte ones.

int porges_strlen2(char *s)
{
        int i = 0;
        int iBefore = 0;
        int count = 0;
 
        while (s[i] > 0)
                ascii:  i++;
 
        count += i-iBefore;
        while (s[i])
        {
                if (s[i] > 0)
                {
                        iBefore = i;
                        goto ascii;
                }
                else
                switch (0xF0 & s[i])
                {
                        case 0xE0: i += 3; break;
                        case 0xF0: i += 4; break;
                        default:   i += 2; break;
                }
                ++count;
        }
        return count;
}

But on the ‘konichiwa’ test the speed improvement happens even though we’re counting pure multibyte, and I’m not sure exactly why… probably something to do with branch prediction or another arcane CPU topic I don’t understand.

4: all konichiwa:
4:           porges_strlen2(string) =   11184810: 0.026017
4:         ap_strlen_utf8_s(string) =   11184810: 0.068320
4:         my_strlen_utf8_c(string) =   11184810: 0.071035
4:         my_strlen_utf8_s(string) =   11184810: 0.089464
5: mixed english:
5:           porges_strlen2(string) =   32435949: 0.040342
5:         my_strlen_utf8_c(string) =   32435949: 0.071035
5:         ap_strlen_utf8_s(string) =   32435949: 0.078233
5:         my_strlen_utf8_s(string) =   32435949: 0.160676

Without the drop-back-to-ASCII modification:

5: mixed english:
5:           porges_strlen2(string) =   32435949: 0.067753

6 Responses to “Counting Characters in UTF-8 Strings Is Fast(er)”

matthew says:

June 4, 2008 at 7:15 pm

BTW, his name is Kragen, not Ragen.
Porges says:

June 4, 2008 at 7:55 pm

Whoops

I think the URL must have tripped me up; I’m so used to Bob Smith being /~bsmith/…
Savvu says:

June 5, 2008 at 2:00 am

while(*s) cnt += tbl[*s++ >> 4]; return cnt;

Setting up tbl is left as an excercise to the reader. If your chars are signed you also need an AND mask.

Porges says:

June 5, 2008 at 12:47 pm

Hi Savvu, I implemented this as:

int tbl[] = {
    1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1
};
 
int savvu_strlen(char *s)
{ 
    int cnt = 0;
    while(*s) cnt += tbl[(*s++ >> 4) & 0xF];
    return cnt;
}

It is consistently the slowest or second-to slowest.

I tried implementing it with byte-skipping:

int tbl[] = {
1,1,1,1,1,1,1,1, //one-byte
1,1,1,1, //invalid, but don't go into infinite loop
2,2, //two-byte starter
3, //three-byte starter
4 //four-byte starter
};
 
int porges_strlen(char *s)
{
        int cnt = 0;
        int i = 0;
        while(s[i]) { i += tbl[(s[i] >> 4) & 0x0f]; ++cnt; }
        return cnt;
}

This version is only faster on the byte-skipping tests, and is still about half the speed of what I posted.

Colin Percival says:

June 5, 2008 at 9:24 pm

I’ve done even better.

Vectorization yields a 2-4x speedup over your code: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
几个汇编/C高性能处理UTF-8的帖子 says:

November 17, 2008 at 11:40 pm

[...] COUNTING CHARACTERS IN UTF-8 STRINGS IS FAST(ER) http://porg.es/blog/counting-characters-in-utf-8-strings-is-faster [...]

porges