So for example, an ISO-8859-1 encoded XML document that even has some characters that are not part of the character set of that encoding, let’s say the € (euro) symbol. This is possible in XML if the symbol is represented as a unicode character entity, in this case the €
(euro) string:
<?xml version="1.0" encoding="ISO-8859-1"?> <foo> <bar>€</bar> </foo>
I need to obtain the bar element string with the same encoding as the document, which means encoded in ISO-8859-1 (also means to preserve the unicode character entities that are not part of this encoding), i.e. the ISO-8859-1 string <bar>€</bar>
.
I couldn’t achieve this by using the saveXML method of the DOMDocument class, since it dumps elements always in UTF-8 (whilst whole documents always in the encoding of their XML declaration):
$DD = new DOMDocument; $DD -> load('foo.xml'); $dump = $DD -> saveXML($DD -> getElementsByTagName('bar') -> item(0));
The $dump
variable resulted in the UTF-8 string <bar>€</bar>
.
Notice how elements are dumped also with its unicode character entities traduced to actual UTF-8 characters.
So, how would I get the ISO-8859-1 string <bar>€</bar>
? Are XML parsers meant to work this sort of task or should I just utilize regular expressions o something else?
Advertisement
Answer
Yes, they will decode entities and if you only save a part of a document it will be UTF-8 because it has no way to specify the encoding – it defaults back to UTF-8.
Here is a demo:
$xml = <<<'XML' <?xml version="1.0" encoding="ISO-8859-1"?> <foo> <bar>€</bar> </foo> XML; $source = new DOMDocument(); $source->loadXML($xml); echo "Document Part:n"; echo $source->saveXML($source->getElementsByTagName('bar')->item(0)); echo "nn"; echo "Whole Document:n"; echo $source->saveXML(); echo "nn";
Output:
Document Part: <bar>€</bar> Whole Document: <?xml version="1.0" encoding="ISO-8859-1"?> <foo> <bar>€</bar> </foo>
You could copy the node into a new document. However the output will include the XML declaration with the encoding:
$target = new DOMDocument('1.0', 'ASCII'); $target->appendChild($target->importNode($source->getElementsByTagName('bar')->item(0), true)); echo "Separated Node:n"; echo $target->saveXML();
Output:
Separated Node: <?xml version="1.0" encoding="ASCII"?> <bar>€</bar>