I have some contents in string format include some unwanted html tags and its content. I am looking for a way to remove them but still could not find a perfect solution for the purpose.
Method 1
Normally, we use strip_tags
to remove the tags but it reserves the text content inside the tag.
Method 2
Then I tried to use preg_replace
to remove tags along with the content using pattern like /<font[sS].*?</font>/
Test
but in the real situation , tags are sometime embracing the same tags like
text to keep <span xxx="xxxx"><span xxx="xxx">unwanted text</span> unwanted text </span> text to keep ...<some other tags>
Using both method won’t give the desired result.
Method 1 Output
unwanted text unwanted text text to keep
Method 2 Output
unwanted text </span> text to keep
What is the best way to archive this?
I am looking for a solution like this (I know this pattern is not working):
$remove_arr = [ 'span', 'div', 'strong' ] foreach($remove_arr as $remove){ $content = preg_replace("/<$remove.*?</$remove>/", '',$content); }
Thank you guys in advanced!
UPDATE
The code is in a file called Collect.php
<?php namespace appcommonmodel; use thinkDb; use thinkCache; use appcommonutilPinyin; use thinkRequest; class Collect extends Base { //the code }
Advertisement
Answer
strip_tags()
has its limitations, and regular expressions are just not the tool to work effectively with arbitrary HTML strings (see this question)
You need to be working with something that understands HTML. i.e. DOMDocument. You could recurse down the tree finding the nodes you want, but fortunately PHP has DOMXPath that will do it for you
This snippet will load a string into DOMDocument, search for and remove all the <span>
elements, and return the remainder into a string:
$str = <<<HTML <!doctype html> <html> <body> text to keep <span xxx="xxxx"><span xxx="xxx">unwanted text</span> unwanted text </span> text to keep </body> </html> HTML; $doc = new DOMDocument(); $doc->loadHTML($str); $xpath = new DOMXPath($doc); foreach($xpath->evaluate("//span") as $node) { echo 'Removing: '.$node->nodeValue."<br>"; $node->parentNode->removeChild($node); } $output = $doc->saveHTML(); echo htmlspecialchars($output);
Output:
Removing: unwanted text unwanted text Removing: unwanted text <!DOCTYPE html> <html> <body> text to keep text to keep </body> </html>
Demo: https://3v4l.org/6m0e2