Skip to content
Advertisement

PHP – how to count the number of leading spaces in a multi-byte / UTF-8 string correctly

I have UTF-8 strings such as those below:

            21st century 

      Other languages 

         General collections 

         Ancient languages 

         Medieval languages 

            Several authors (Two or more languages) 

As you can see, the strings contain alphanumeric characters as well leading and trailing spaces.

I’d like to use PHP to retrieve the number of leading spaces (not trailing spaces) in each string. Note that the spaces might be non-standard ASCII spaces. I tried using:

var_dump(mb_ord($space_char, "UTF-8"));

where the $space_char contains a sample space character I copied from one of the above strings, and I got 160 rather than 32.

I have tried:

strspn($string,$cmask); // $cmask contains a string with two space characters with 160 and 32 as their Unicode code points.

but I get a very unpredictable value.

The values should be:

(1) 12
(2) 6
(3) 9
(4) 9
(5) 9
(6) 12

What am I doing wrong?

Advertisement

Answer

I would go the regular expression route:

<?php
function count_leading_spaces($str) {
    // p{Zs} will match a whitespace character that is invisible,
    // but does take up space
    if (mb_ereg('^p{Zs}+', $str, $regs) === false)
        return 0;
    return mb_strlen($regs[0]);
}

$samples = [
'            21st century ',
'      Other languages ',
'         General collections ',
'         Ancient languages ',
'         Medieval languages ',
'            Several authors (Two or more languages) ',
];

foreach ($samples as $i => $sample) {
    printf("(%d) %dn", $i + 1, count_leading_spaces($sample));
}

Output:

(1) 12
(2) 6
(3) 9
(4) 9
(5) 9
(6) 12
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement