php DOMDocument preg_replace fail detect

Basically, I want to replace content with hyperlink when detected matching keyword tag. the replace need to be outside of caption/image/figure/figcaption/iframe/a of existing content, because putting hyperlink inside these will causing format breaking.

my php

 $html_content= '税务调查。

<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登与儿子。" width="100" height="100" class="size-full wp-image" /> 拜登与儿子。

他在声明中说：“我会非常认真地调查，往来。”

<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登总统" width="100" height="100" class="aligncenter size-full wp-image" />

<div style="position:relative; overflow:hidden"> <iframe src="https://cdn.google.com/players/VM.html" width="100" height="100" frameborder="0" scrolling="auto" title="大促销 拜登的美国" style="position:absolute;"></iframe> </div>

<iframe style="border: none; overflow: hidden;" src="https://www.facebook.com/plugins/video.php?height=100&amp;href=https%3A%2F%2Fwww.facebook.com;width=100&amp;t=0" width="100" height="100" frameborder="0" allowfullscreen="allowfullscreen"></iframe>

<iframe src="https://www.facebook.com/plugins/video.php?height=400&href=https%3A%2F%2&show_text=false&width=100&t=0" width="100" height="100" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowfullscreen="true" allow="autoplay; clipboard-write; encrypted-media; picture-in-picture; web-share" allowFullScreen="true"></iframe>

<b>更多热点</b>

<p>halo拜登也指美国经济不会衰退</p>

<figure id="attachment_279" style="width: 100px" class="wp-caption alignnone"><img class="size-full wp-imag" src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="修理厂商总会拜登城" width="100" height="100" /><figcaption class="wp-caption-text">修理厂商总会拜登城</figcaption></figure>

<a href="http://google.com">go to google</a>

<span style="color: #ff6600;"><strong>另外，拜登声明中说</strong></span>';


function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
  if (!empty($dom->childNodes)) {
    foreach ($dom->childNodes as $node) {
        //echo $node->parentNode->nodeName . "<Br>";
      if ($node instanceof DOMText && !in_array($node->parentNode->nodeName, $excludeParents)) {
        $node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
      } 
      else{
        preg_replace_dom($regex, $replacement, $node, $excludeParents);
      }
    }
  }
}


$dom = new DOMDocument;
$internalErrors = libxml_use_internal_errors(true);
$dom->loadHTML( mb_convert_encoding($html_content, 'HTML-ENTITIES', "UTF-8"), LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED );


$tags = array("拜登","认真");
foreach($tags as $tag){
    $tagurl= '<span class="article-tag"><a class="mytag" href="http://outside.com" >'.$tag.'</a></span>';
    preg_replace_dom('/'.$tag.'/i', $tagurl, $dom->documentElement, array('a','image','iframe','figure','figcaption','caption'));

    $test_tag = '['.$tag.']';
    //preg_replace_dom('/'.$tag.'/i', $test_tag, $dom->documentElement, array('a','image','iframe','figure','figcaption','caption'));
}     


function getLink($tag){
    $arr = array(
        "拜登"=>"http://bai.com",
        "认真"=>"http://ren.com",      
        );
    return $arr[$tag];    
}

 $output = mb_substr($dom->saveHTML(), 0, null, "UTF-8");
//echo $output;
echo html_entity_decode($output);

Now I facing 2 issue

want to exclude replace hyperlink tag into …
but it fail on regex..

currently it display like this…

this DOMDocument loadHTML method will add in extra paragraph tag randomly at any places… Although I can process the output by removing ALL the paragraph tag, but it also means the final content is not original anymore. Some input content by default have some paragraph tag, so this action will end up making existing p tag gone too..
(solved) want to preg_replace as clickable hyperlink to display at browser. but echo $output showing the pure raw hyperlink syntax, unable to click..

update on issue2, value saved into $node->nodeValue are escaped and causing pure plain text. I add in this to unescape it, echo html_entity_decode($output); and it now display correctly.

Desired output

 $output= '税务调查。

<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登与儿子。" width="100" height="100" class="size-full wp-image" /> 拜登与儿子。

他在声明中说：“我会非常<span class="article-tag"><a class="mytag" href="http://outside.com" >认真</a></span>地调查，往来。”

<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登总统" width="100" height="100" class="aligncenter size-full wp-image" />

<div style="position:relative; overflow:hidden"> <iframe src="https://cdn.google.com/players/VM.html" width="100" height="100" frameborder="0" scrolling="auto" title="大促销 拜登的美国" style="position:absolute;"></iframe> </div>

<iframe style="border: none; overflow: hidden;" src="https://www.facebook.com/plugins/video.php?height=100&amp;href=https%3A%2F%2Fwww.facebook.com;width=100&amp;t=0" width="100" height="100" frameborder="0" allowfullscreen="allowfullscreen"></iframe>

<iframe src="https://www.facebook.com/plugins/video.php?height=400&href=https%3A%2F%2&show_text=false&width=100&t=0" width="100" height="100" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowfullscreen="true" allow="autoplay; clipboard-write; encrypted-media; picture-in-picture; web-share" allowFullScreen="true"></iframe>

<b>更多热点</b>

<p>halo<span class="article-tag"><a class="mytag" href="http://outside.com" >拜登</a></span>也指美国经济不会衰退</p>

<figure id="attachment_279" style="width: 100px" class="wp-caption alignnone"><img class="size-full wp-imag" src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="修理厂商总会拜登城" width="100" height="100" /><figcaption class="wp-caption-text">修理厂商总会拜登城</figcaption></figure>

<a href="http://google.com">go to google</a>

<span style="color: #ff6600;"><strong>另外，<span class="article-tag"><a class="mytag" href="http://outside.com" >拜登</a></span>声明中说</strong></span>';

Answer

I tried very, VERY hard to implement a DOMDocument+Xpath solution, but I came unstuck while trying to disqualify the text node within the square-tagged caption block. I couldn’t manage to isolate the whole caption block to be able to exclude it. In the end, here is a caveman’s regex approach to serve as a band-aid until someone smarter can solve this problem properly.

The regex matches the blacklisted tags in the text and discards them; it only replaces text that is not disqualified.

Code: (Demo)

$tags = ["拜登", "认真"];
$blacklisted = implode(
    '|',
    array_map(
        fn($tag) => "<{$tag}[ >].+?" . ($tag === 'img' ? "/>" : "</$tag>"),
        ['a', 'img', 'iframe', 'figure', 'figcaption']
    )
);
echo preg_replace(
         sprintf('~(?:].+?|%s)(*SKIP)(*FAIL)|%s~us', $blacklisted, implode('|', $tags)),
         '<span class="article-tag"><a class="mytag" href="http://outside.com">$0</a></span>',
         $html
     );

Advertisement

Answer