Skip to content
Advertisement

PHP Scrape Article Excerpt like Readability

I’ve seen this question, but it doesn’t really satisfy what I’m looking for. That question’s answers were either: lift from the meta description tag, and the second was generating an excerpt for an article you already have the body from.

What I want to do is actually get the first few sentences of an article, like Readability does. What’t the best method for this? HTML Parsing? Here’s what I’m currently using, but this is not very reliable.

function guessExcerpt($url) {
    $html = file_get_contents_curl($url);

    $doc = new DOMDocument();
    @$doc->loadHTML($html);

    $metas = $doc->getElementsByTagName('meta');

    for ($i = 0; $i < $metas->length; $i++)
    {
        $meta = $metas->item($i);
        if($meta->getAttribute('name') == 'description')
            $description = $meta->getAttribute('content');

    }

    return $description;
}

function file_get_contents_curl($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

Advertisement

Answer

Here is a port of Readability in PHP: https://github.com/andreskrey/readability.php. Just try it. The extraction result will be similar to Readability (because it implements Readability’s algorithm).

require 'lib/Readability.inc.php';

$html = file_get_contents_curl($url);

$Readability     = new Readability($html, $html_input_charset); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$title   = $ReadabilityData['title'];
$content = $ReadabilityData['content'];

Then you can use some sentences from $content as the excerpt.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement