Skip to content
Advertisement

Fastest reimplementation PHP’s htmlspecialchars function in C

I need the default behaviour (i.e ENT_COMPAT | ENT_HTML401) of the PHP-function htmlspecialchars() in C, how is this done best (that means fastest) in C?

I don’t need the input string, therefore an in-place solution is possible.

It is a really simple function, it just converts these characters:

'&' -> '&'
'"' -> '"'
'<' -> '&lt;'
'>' -> '&gt;'

What strategy would be fastest? Looping over each character individually and creating the output buffer byte-for-byte, overwriting the input string in-place or some other solution?

Advertisement

Answer

This code assumes that input and output are buffers and that input contains the input string. It also assumes that the output buffer is large enough to hold the output (if not output is truncated):

long i = 0;
long j = 0;

while (input[i])
{
    if (input[i] == '<')
    {
        memcpy(&output[j], "&lt;", 4);
        j += 4;
    } else if (input[i] == '>')
    {
        memcpy(&output[j], "&gt;", 4);
        j += 4;
    } else if (input[i] == '"')
    {
        memcpy(&output[j], "&quot;", 6);
        j += 6;
    } else if (input[i] == '&')
    {
        memcpy(&output[j], "&amp;", 5);
        j += 5;
    } else
    {
        output[j++] = input[i];
    }
    if (j > sizeof(output)-7)
    {
        break;
    }
    i++;
}
output[j] = 0;

In C, ugly code is often the fastest.

An in-place solution would only yield performance benefits if the characters to be exchanged would be very, very rare so that the whole string can be reordered (very expensive) on every character that is detected. With normal HTML-data, where these 4 characters will appear often, an in-place solution would be much slower.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement