Skip to content
Advertisement

How to remove specific html tags with contents in php?

I have some contents in string format include some unwanted html tags and its content. I am looking for a way to remove them but still could not find a perfect solution for the purpose.

Method 1

Normally, we use strip_tags to remove the tags but it reserves the text content inside the tag.

Method 2

Then I tried to use preg_replace to remove tags along with the content using pattern like /<font[sS].*?</font>/


Test

but in the real situation , tags are sometime embracing the same tags like

text to keep <span xxx="xxxx"><span xxx="xxx">unwanted text</span> unwanted text </span> text to keep ...<some other tags>

Using both method won’t give the desired result.

Method 1 Output

unwanted text unwanted text text to keep

Method 2 Output

unwanted text </span> text to keep

What is the best way to archive this?

I am looking for a solution like this (I know this pattern is not working):

$remove_arr = [
   'span',
   'div',
   'strong'
]
foreach($remove_arr as $remove){
  $content = preg_replace("/<$remove.*?</$remove>/", '',$content);
}

Thank you guys in advanced!

UPDATE

The code is in a file called Collect.php

<?php
namespace appcommonmodel;
use thinkDb;
use thinkCache;
use appcommonutilPinyin;
use thinkRequest;

class Collect extends Base {
   //the code
}

enter image description here

Advertisement

Answer

strip_tags() has its limitations, and regular expressions are just not the tool to work effectively with arbitrary HTML strings (see this question)

You need to be working with something that understands HTML. i.e. DOMDocument. You could recurse down the tree finding the nodes you want, but fortunately PHP has DOMXPath that will do it for you

This snippet will load a string into DOMDocument, search for and remove all the <span> elements, and return the remainder into a string:

$str = <<<HTML
<!doctype html>
<html>
<body>
text to keep <span xxx="xxxx"><span xxx="xxx">unwanted text</span> unwanted text </span> text to keep
</body>
</html>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($str);
$xpath = new DOMXPath($doc);

foreach($xpath->evaluate("//span") as $node) {
    echo 'Removing: '.$node->nodeValue."<br>";
    $node->parentNode->removeChild($node);
}
$output = $doc->saveHTML();
echo htmlspecialchars($output);

Output:

Removing: unwanted text unwanted text
Removing: unwanted text
<!DOCTYPE html> <html> <body> text to keep text to keep </body> </html> 

Demo: https://3v4l.org/6m0e2

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement