Let’s say (for simplicity’s sake) that I have a multibyte, UTF-8 encoded string variable with 3 letters (consisting of 4 bytes):
$original = 'Fön';
Since it’s UTF-8, the bytes’ hex values are (excluding the BOM):
46 C3 B6 6E
As the $original
variable is user-defined, I will need to hande two things:
- Get the exact number of bytes (not UTF-8 characters) used in the string, and
- A way to access each individual byte (not UTF-8 character).
I would tend to use strlen()
to handle “1.”, and access the $original
variable’s bytes with a simple `$original[$byteposition]
like this:
<?php header('Content-Type: text/html; charset=UTF-8'); $original = 'Fön'; $totalbytes = strlen($original); for($byteposition = 0; $byteposition < $totalbytes; $byteposition++) { $currentbyte = $original[$byteposition]; /* Doesn't work since var_dump shows 3 bytes. */ var_dump($currentbyte); /* Fails too since "ord" only works on ASCII chars. It returns "46 F6 6E" */ printf("%02X", ord($currentbyte)); echo('<br>'); } exit(); ?>
This proves my initial idea is not working:
- var_dump shows 3 bytes
- printf fails too since “ord” only works on ASCII chars
How can I get the single bytes from a multibyte PHP string variable in a binary-safe way?
What I am looking for is a binary-safe way to convert UTF-8 string(s) into byte-array(s).
Advertisement
Answer
you can get a bytearray by unpacking the utf8_encoded string $a:
$a = utf8_encode('Fön'); $b = unpack('C*', $a); var_dump($b);
used format C* for “unsigned char”
References