I want to use PHP simple HTML DOM parser to scrape from a website. Source code is so random like that :
JavaScript
x
<font face="Arial" color="#ff0000">
<p>Parameters</p>
</font><font face="Arial" size="2" color="#ff0000">
<p>Param1</p>
</font><font face="Arial" size="2" color="#0000ff">
<p>Details. (Lob., </font><i><font face="Arial"
size="2" color="#ff0000">Co v</font><font face="Arial" size="2"
color="#0000ff">.)</p>
Instead of putting directly “Details. (Lob., Co v.)” inside < p> < /p> , it’s put using < font> and < i>. When I use this code
JavaScript
foreach($html->find('p') as $p)
{
echo $p->plaintext.'<br>';
}
I find “Details. (Lob.,” it stops when finding < i > or < font >. How can I extract the whole line “Details. (Lob., Co v.)”
Thank you for your answer
Advertisement
Answer
You can use strip_tags() function to remove the unnecessary tags. after removing unnecessary tags, you can use DOM parser.
The strip_tags() function strips a string from HTML, XML, and PHP tags.
string strip_tags ( string $str [, string $allowable_tags ] )
You can read more about strip_tags() function on php.net
Example:
JavaScript
$html = '<font face="Arial" color="#ff0000">
<p>Parameters</p>
</font><font face="Arial" size="2" color="#ff0000">
<p>Param1</p>
</font><font face="Arial" size="2" color="#0000ff">
<p>Details. (Lob., </font><i><font face="Arial"
size="2" color="#ff0000">Co v</font><font face="Arial" size="2"
color="#0000ff">.)</p>';
$html = strip_tags($string, '<p>');
echo $html;
Result:
JavaScript
<p>Parameters</p>
<p>Param1</p>
<p>Details. (Lob., Co v.)</p>