I am trying to pass a UTF-8 string as a command line parameter from PHP to a Java program. When I view the string in the PHP debugger, it show correctly: Présentation
Yet when I look at the arg[0] data in the Java debugger (and the returned value passed back to the PHP program) I see: Pr??sentation
I have tried the Java code below and neither ISO_8859_1 nor UTF_8 return the proper results.
I’ve looked here on stackoverflow (Translate UTF-8 character encoding function from PHP to Java) as well as other sites and still cannot make sense at what I am doing wrong.
Everything seems to work find in PHP yet Java is doing something right from the start with the data that looks like it needs perhaps additional processing after or before I call the code below.
This is my first go at dealing with international characters. Any help is greatly appreciated. Thank you!
Edit: I am debugging on Windows remotely – the PHP and Java are being run on an Ubuntu system. But since the PHP code and Java code called from the PHP code reside on the linux based system, there should not be any issue with Windows command line Java and UTF-8. I had read here on stackoverflow that was an issue for some in the recent past.
byte[] test_str_1 = args[0].getBytes(StandardCharsets.ISO_8859_1); System.out.println(test_str_1); byte[] test_str_2 = args[0].getBytes(StandardCharsets.UTF_8); System.out.println(test_str_2);
Advertisement
Answer
The problem has been solved using the solution provided here:
Everyone’s help got me on the right track. It was indeed a locale issue, but not at the OS level. Instead it was with PHP’s locale.
Another user had a similar issue and it was fixed with by adding the following code to the PHP script before executing the command line that calls the Java program:
$locale = 'en_US.utf-8'; setlocale(LC_ALL, $locale); putenv('LC_ALL='.$locale);
So now, in the Java code, when I view the args[0] param, that is now displayed correctly and also the processed text stored in a file and then sent back to and received into the PHP script properly. It took a bit of looking up the byte values, corresponding UTF-8 encodings, and the like before I could start to see the issue was that PHP was translating what was a correct string just before exec, into a different string during the exec() call. During this call the UTF-8 xc3 0xa9 bytes for “é” (Unicode u00E9) into 3f 3f (two ASCII question mark chars).
During my searching here on stackoverflow I saw a warning not the use literals (e.g. “Présentation”) and once I backtracked the data to the caller it became evident that the issue involved the actual call to exec().
Hopefully another new to Unicode processing can benefit from this information.
Thanks for everyone’s input which pointed me in the right direction.