After quite a bit of searching and testing, the simplest method I’ve found for a Unicode-compatible alternative to the PHP ord()
function is this:
$utf8Character = 'Ą'; list(, $ord) = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8')); echo $ord; # 260
I found this here. However, it has been mentioned that this method is rather slow. Does anyone know of a more efficient method which is nearly as simple? And what does UCS-4BE mean?
Advertisement
Answer
You might also be able to implement this function using iconv()
, but the mb_convert_encoding
method you’ve got looks reasonable to me. Just make sure that $utf8Character
is a single character, not a long string, and it’ll perform reasonably well.
UCS-4BE is a Unicode encoding which stores each character as a 32-bit (4 byte) integer. This accounts for the “UCS-4”; the “BE” prefix indicates that the integers are stored in big-endian order. The reason for this encoding is that, unlike smaller encodings (like UTF-8 or UTF-16), it requires no surrogate pairs — each character is a fixed size.