Skip to content
Advertisement

How to get Wikipedia page HTML with absolute URLs using the API?

I’m trying to retrieve articles through wikipedia API using this code

$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=example&format=json&prop=text';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
$json = json_decode($c);
$content = $json->{'parse'}->{'text'}->{'*'};

I can view the content in my website and everything is fine but I have a problem with the links inside the article that I have retrieved. If you open the url you can see that all the links start with href=”/ meaning that if someone clicks on any related link in the article it redirects him to www.mysite.com/wiki/.. (Error 404) instead of en.wikipedia.com/wiki/.. Is there any piece of code that I can add to the existing one to fix this issue?

Advertisement

Answer

This seems to be a shortcoming in the MediaWiki action=parse API. In fact, someone already filed a feature request asking for an option to make action=parse return full URLs.

As a workaround, you could either try to mangle the links yourself (like adil suggests), or use index.php?action=render like this:

This will only give you the page HTML with no API wrapper, but if that’s all you want anyway then it should be fine. (For example, this is the method used internally by InstantCommons to show remote file description pages.)

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement