Skip to content
Advertisement

Complex Xpath get all values excluding some specific class attributes

I have a markup HTML as below:

<body>
    <div>......</div>
    ............
    <div class="entry-content">
        <div class="code1 code2">(ads.....);</div>
        <p><img src="https://www..."></img></p>
        <h2> title </h2>
        <div class="code1-block code2">(ads.....);</div>
        <div class="data1 dta-ta1">
              <ul><li><p> text</p></li>
                  <li><span> text2 </span></li>
                  <li><span> text3 </span></li>
                  <div class="codex1 code-block"><span>(ads ....); </span></div>
                  <li><span> text4 </span></li>
                  <div class="codex1 code-block"><span>(ads ....); </span></div>
              </ul>
        </div> 
        <div class="codex2-block code2">(ads.....);</div>
        <div class="data2-entry dta-ta2">
              <p>
                <span> text5</span>
              </p>
              <p> text6 </p>
              <p> text7 </p
              <div class="codex1 code-block"><span>(ads ....); </span></div>
              <li><span> text8 </span></li>
              <div class="codex1 code-block"><span>(ads ....); </span></div>
        </div>
  </div>
</body>

I’ve tried to “go into div with class="entry-content" get all texts from its child nodes excluding child nodes with class= "code1", "code2", "codex1", "codex2"

My code as below just goes to the div and gets all texts from child nodes. However, I can not remove text from the child nodes with code1 & code2. I appreciate for your supports. Thanks.

 $classname='entry-content';
 $a = new DOMXPath($dom);
 $query = "//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]";

 $list = $a->query($query);

 if ($list->length > 0) {
    foreach ($list as $element) {
        $nodes = $element->childNodes;

          foreach ($element as $node) {
             $bodytext = trim(preg_replace('/[rn]+/', ' ', $node->nodeValue));
             $bodyContent .= '<p>' . $bodytext . '</p>';
          }
    }
 }

My expected output:

https://www

title

text2

text3

text4

text5

text6

text7

text8

Advertisement

Answer

Your input document is not well-formed, a > is missing for </p, and one div is not closed properly. With the input document fixed, a working path expression is

XPath expression

//div[@class='content']//text()[not(ancestor::div/@class[contains(., 'code')])][normalize-space()]

It selects all text nodes, but only if they do not have an ancestor div element that has a class attribute whose value contains “code”, and also, the text nodes selected cannot be whitespace-only.

Output

Individual results are separated by ------:

 title 
-----------------------
 text
-----------------------
 text2 
-----------------------
 text3 
-----------------------
 text4 
-----------------------
 text5
-----------------------
 text6 
-----------------------
 text7 
-----------------------
 text8 

Update

I tried with your answer. It works however I still need a source from img tag. How can I get it?

It’s possible to also select the source attribute of an img element, but this would make the Xpath expression even more complicated. You should just add another line of PHP to evaluate a separate path expression, such as:

//div[@class='entry-content']/p/img/@source

Update 2

While I absolutely do not recommend to use this expression (because it obfuscates your code), here is how to combine both expressions into a single one with a union operator:

//div[@class='entry-content']//text()[not(ancestor::div/@class[contains(., 'code')])][normalize-space()] | //div[@class='entry-content']//p/img/@src
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement