PHP Scrape Article Excerpt like Readability

I’ve seen this question, but it doesn’t really satisfy what I’m looking for. That question’s answers were either: lift from the meta description tag, and the second was generating an excerpt for an article you already have the body from.

What I want to do is actually get the first few sentences of an article, like Readability does. What’t the best method for this? HTML Parsing? Here’s what I’m currently using, but this is not very reliable.

function guessExcerpt($url) {
    $html = file_get_contents_curl($url);

    $doc = new DOMDocument();
    @$doc->loadHTML($html);

    $metas = $doc->getElementsByTagName('meta');

    for ($i = 0; $i < $metas->length; $i++)
    {
        $meta = $metas->item($i);
        if($meta->getAttribute('name') == 'description')
            $description = $meta->getAttribute('content');

    }

    return $description;
}

function file_get_contents_curl($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

JavaScript
​x
 
function guessExcerpt($url) {    $html = file_get_contents_curl($url);​    $doc = new DOMDocument();    @$doc->loadHTML($html);​    $metas = $doc->getElementsByTagName('meta');​    for ($i = 0; $i < $metas->length; $i++)    {        $meta = $metas->item($i);        if($meta->getAttribute('name') == 'description')            $description = $meta->getAttribute('content');​    }​    return $description;}​function file_get_contents_curl($url) {    $ch = curl_init();​    curl_setopt($ch, CURLOPT_HEADER, 0);    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);    curl_setopt($ch, CURLOPT_TIMEOUT, 5);    curl_setopt($ch, CURLOPT_URL, $url);    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);​    $data = curl_exec($ch);    curl_close($ch);​    return $data;}​

Answer

Here is a port of Readability in PHP: https://github.com/andreskrey/readability.php. Just try it. The extraction result will be similar to Readability (because it implements Readability’s algorithm).

require 'lib/Readability.inc.php';

$html = file_get_contents_curl($url);

$Readability     = new Readability($html, $html_input_charset); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$title   = $ReadabilityData['title'];
$content = $ReadabilityData['content'];

JavaScript
 
require 'lib/Readability.inc.php';​$html = file_get_contents_curl($url);​$Readability     = new Readability($html, $html_input_charset); // default charset is utf-8$ReadabilityData = $Readability->getContent();​$title   = $ReadabilityData['title'];$content = $ReadabilityData['content'];​

Then you can use some sentences from $content as the excerpt.

Advertisement

Answer