Skip to content
Advertisement

php/simplexml – LIBXML options ignored?

I’m trying to use the LIBXML* constants for the 2nd parameter of SimpleXMLElementconstructor, but they don’t change anything at all.

$xml = '<root><empty_tag/><foo></foo></root>';
$simpleXml = new SimpleXMLElement($xml, LIBXML_NOENT|LIBXML_NOXMLDECL|LIBXML_NOEMPTYTAG);

$simpleXml->foo = 'Ņ';

echo $simpleXml->asXML();

Expected:

<root><empty_tag></empty_tag><foo>Ņ</foo></root>

Actual:

<?xml version="1.0"?>
<root><empty_tag/><foo>&#x145;</foo></root>

As you can see, not a single one of those flags does anything – the entity is still escaped (even though XML should only escape "'&>< according to https://www.w3.org/TR/xml/#syntax), the XML declaration is still there, and the empty tag remains empty. Is there a way to achieve the desired result using SimpleXML? Or at the very least make only escape the 5 special characters? addChild() is not an option here, I’m assigning existing nodes.

Advertisement

Answer

These constants might be a bit cryptic in their naming. So what actually is supported?

LIBXML_NOENT

Are entities added as entity references to the document or are they expanded. Needs to be specified by loading the document:

<?php

$xml = '<!DOCTYPE test [<!ENTITY c "TEST">]>
<test>&c;</test>';

echo (new SimpleXMLElement($xml))->asXML(), "n";
echo (new SimpleXMLElement($xml, LIBXML_NOENT))->asXML(), "n";

This shows the first output:

<?xml version="1.0"?>
<!DOCTYPE test [
<!ENTITY c "TEST">
]>
<test>&c;</test>

The entity is preserved. And for the second echo, with LIBXML_NOENT:

<?xml version="1.0"?>
<!DOCTYPE test [
<!ENTITY c "TEST">
]>
<test>TEST</test>

The XML is borrowed from a related Q&A: What does LIBXML_NOENT do (and why isn’t it called LIBXML_ENT)?

This is by the way not related to the non US-ASCII character you’ve got with your document. If you need to have the document w/ it, set the encoding to UTF-8 for example:

$xml = '<root><empty_tag/><foo></foo></root>';
$simpleXml = new SimpleXMLElement($xml);

dom_import_simplexml($simpleXml)->ownerDocument->encoding = 'UTF-8';

$simpleXml->foo = 'Ņ';

echo $simpleXml->asXML();

The trick here is to set the encoding in the underlying DOMDocument, this is the only way I know for a SimpleXMLElement (and DOMDocument). Here the output:

<?xml version="1.0" encoding="UTF-8"?>
<root><empty_tag/><foo>Ņ</foo></root>

You can see no more &#x145; entity but instead just Ņ in Unicode (UTF-8 encoded). The XML declaration also shows now the encoding.

From your question I assume this is what you’re looking “for” LIBXML_NOENT.

LIBXML_NOXMLDECL

The second one in the list. I never got it to work, it’s buggy and/or has some specific version requirements but honestly I don’t even know if/where to apply it intentionally.

You can either strip the first line (always “n” terminated) which contains the XML Declaration from the output.

Or you can again related to the underlying DOMDocument to output the document-element so it’s not the complete document and hence has no XML Declaration:

$dom = dom_import_simplexml($simpleXml)->ownerDocument;
echo $dom->saveXML($dom->documentElement);

Output:

<root><empty_tag/><foo>Ņ</foo></root>

This is basically what is suggested in: remove xml version tag when a xml is created in php.

LIBXML_NOEMPTYTAG

The third and last one in the list. I could now quote from the PHP manual but this has been done elsewhere on site already but anyway, how to do this with a SimpleXMLElement regardless the constant is not available?

One way would be to provide the option via DOMDocument again:

$dom = dom_import_simplexml($simpleXml)->ownerDocument;
echo $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

Output:

<root><empty_tag></empty_tag><foo>Ņ</foo></root>

Or for doing this “pure” SimpleXML, an empty text node into every empty element:

$xml = '<?xml version="1.0" encoding="UTF-8"?><root><empty_tag/><foo></foo></root>';
$simpleXml = new SimpleXMLElement($xml);
$simpleXml->foo = 'Ņ';

foreach ($simpleXml->xpath('//*[not(*) and string() = ""]') as $empty) {
    $empty[0] = '';
}

echo $simpleXml->asXML();

That is in the foreach, to obtain all empty elements per the xpath query and then setting the text contents of it to an empty string which will insert a text-node in there if there ain’t (an empty) one yet. Outpupt:

<?xml version="1.0" encoding="UTF-8"?>
<root><empty_tag></empty_tag><foo>Ņ</foo></root>

I hope this gives you the options you were looking for.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement