I have a .txt file. Using the following code I read it:
while (!feof($handle)) { yield trim(utf8_encode(fgets($handle))); }
Now from the retrieved string I want to remove not only the HTML tags but also the HTML content inside. Found many solutions to remove the tags but not both – tags + content.
Sample string – Hey my name is <b>John</b>. I am a <i>coder</i>!
Required output string – Hey my name is . I am a !
How can I achieve this?
Advertisement
Answer
One way to achieve this is by using DOMDocument
and DOMXPath
. My solution assumes that the provided HTML string has no container node or that the container node contents are not meant to be stripped (as this would result in a completely empty string).
$string = 'Hey my name is <b>John</b>. I am a <i>coder</i>!'; // create a DOMDocument (an XML/HTML parser) $dom = new DOMDocument('1.0', 'UTF-8'); // load the HTML string without adding a <!DOCTYPE ...> and <html><body> tags // and with error/warning reports turned off // if loading fails, there's something seriously wrong with the HTML if($dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED | LIBXML_NOERROR | LIBXML_NOWARNING)) { // create an DOMXPath instance for the loaded document $xpath = new DOMXPath($dom); // remember the root node; DOMDocument automatically adds a <p> container if one is not present $rootNode = $dom->documentElement; // fetch all descendant nodes (children and grandchildren, etc.) of the root node $childNodes = $xpath->query('//*', $rootNode); // with each of these decendants... foreach($childNodes as $childNode) { // ...remove them from their parent node $childNode->parentNode->removeChild($childNode); } // echo the sanitized HTML echo $rootNode->nodeValue . "n"; }
If you do want to strip a potential container code then it’s going to be a bit harder, because it’s difficult to differentiate between an original container node and a container node that’s automatically added by DOMDocument
.
Also, if an unintended non-closing tag is found, it can lead to unexpected results, as it will strip everything until the next closing tag, because DOMDocument
will automatically add a closing tag for invalid non-closing tags.