I have a markup HTML as below:
<body> <div>......</div> ............ <div class="entry-content"> <div class="code1 code2">(ads.....);</div> <p><img src="https://www..."></img></p> <h2> title </h2> <div class="code1-block code2">(ads.....);</div> <div class="data1 dta-ta1"> <ul><li><p> text</p></li> <li><span> text2 </span></li> <li><span> text3 </span></li> <div class="codex1 code-block"><span>(ads ....); </span></div> <li><span> text4 </span></li> <div class="codex1 code-block"><span>(ads ....); </span></div> </ul> </div> <div class="codex2-block code2">(ads.....);</div> <div class="data2-entry dta-ta2"> <p> <span> text5</span> </p> <p> text6 </p> <p> text7 </p <div class="codex1 code-block"><span>(ads ....); </span></div> <li><span> text8 </span></li> <div class="codex1 code-block"><span>(ads ....); </span></div> </div> </div> </body>
I’ve tried to “go into div with class="entry-content"
get all texts from its child nodes excluding child nodes with class= "code1", "code2", "codex1", "codex2"
My code as below just goes to the div and gets all texts from child nodes. However, I can not remove text from the child nodes with code1 & code2. I appreciate for your supports. Thanks.
$classname='entry-content'; $a = new DOMXPath($dom); $query = "//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]"; $list = $a->query($query); if ($list->length > 0) { foreach ($list as $element) { $nodes = $element->childNodes; foreach ($element as $node) { $bodytext = trim(preg_replace('/[rn]+/', ' ', $node->nodeValue)); $bodyContent .= '<p>' . $bodytext . '</p>'; } } }
My expected output:
title
text2
text3
text4
text5
text6
text7
text8
Advertisement
Answer
Your input document is not well-formed, a >
is missing for </p
, and one div
is not closed properly. With the input document fixed, a working path expression is
XPath expression
//div[@class='content']//text()[not(ancestor::div/@class[contains(., 'code')])][normalize-space()]
It selects all text nodes, but only if they do not have an ancestor div
element that has a class
attribute whose value contains “code”, and also, the text nodes selected cannot be whitespace-only.
Output
Individual results are separated by ------
:
title ----------------------- text ----------------------- text2 ----------------------- text3 ----------------------- text4 ----------------------- text5 ----------------------- text6 ----------------------- text7 ----------------------- text8
Update
I tried with your answer. It works however I still need a source from img tag. How can I get it?
It’s possible to also select the source
attribute of an img
element, but this would make the Xpath expression even more complicated. You should just add another line of PHP to evaluate a separate path expression, such as:
//div[@class='entry-content']/p/img/@source
Update 2
While I absolutely do not recommend to use this expression (because it obfuscates your code), here is how to combine both expressions into a single one with a union operator:
//div[@class='entry-content']//text()[not(ancestor::div/@class[contains(., 'code')])][normalize-space()] | //div[@class='entry-content']//p/img/@src