Skip to content
Advertisement

How to dump an XML document’s element as a string that has the same encoding as the document?

So for example, an ISO-8859-1 encoded XML document that even has some characters that are not part of the character set of that encoding, let’s say the € (euro) symbol. This is possible in XML if the symbol is represented as a unicode character entity, in this case the (euro) string:

<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
    <bar>€</bar>
</foo>

I need to obtain the bar element string with the same encoding as the document, which means encoded in ISO-8859-1 (also means to preserve the unicode character entities that are not part of this encoding), i.e. the ISO-8859-1 string <bar>€</bar>.

I couldn’t achieve this by using the saveXML method of the DOMDocument class, since it dumps elements always in UTF-8 (whilst whole documents always in the encoding of their XML declaration):

$DD = new DOMDocument;
$DD -> load('foo.xml');
$dump = $DD -> saveXML($DD -> getElementsByTagName('bar') -> item(0));

The $dump variable resulted in the UTF-8 string <bar>€</bar>.

Notice how elements are dumped also with its unicode character entities traduced to actual UTF-8 characters.

So, how would I get the ISO-8859-1 string <bar>€</bar>? Are XML parsers meant to work this sort of task or should I just utilize regular expressions o something else?

Advertisement

Answer

Yes, they will decode entities and if you only save a part of a document it will be UTF-8 because it has no way to specify the encoding – it defaults back to UTF-8.

Here is a demo:

$xml = <<<'XML'
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
    <bar>€</bar>
</foo>
XML;

$source = new DOMDocument();
$source->loadXML($xml);

echo "Document Part:n";
echo $source->saveXML($source->getElementsByTagName('bar')->item(0));
echo "nn";

echo "Whole Document:n";
echo $source->saveXML();
echo "nn";

Output:

Document Part:
<bar>€</bar>

Whole Document:
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
    <bar>€</bar>
</foo>

You could copy the node into a new document. However the output will include the XML declaration with the encoding:

$target = new DOMDocument('1.0', 'ASCII');
$target->appendChild($target->importNode($source->getElementsByTagName('bar')->item(0), true));

echo "Separated Node:n";
echo $target->saveXML();

Output:

Separated Node:
<?xml version="1.0" encoding="ASCII"?>
<bar>€</bar>
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement