Skip to content
Advertisement

How do I download txt web content using perl

I am trying to download data from this data page. I have tried a number of scripts I googled. On the data page I have to select the countries I want, one at a time. The one script which gets close to what I want is:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
my $file = 'Zamb.txt';
getstore($url, $file);

However this script gives me the page, not the data. I would appreciate if I can get help to download the data, if this is possible. I would also appreciate to do it in php if this may be an easier alternative.

Advertisement

Answer

The link returns text wrapped in HTML. Simplest approach would be to use HTML::FormatText and HTML::Parse to get the text only version.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;
use HTML::FormatText;


my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
my $text = HTML::FormatText->new(leftmargin=>0, rightmargin=>100000000000)->format(HTML::TreeBuilder->new_from_url($url));

my $file = 'Zamb.txt';
open (my $fh, '>', $file);
print $fh $text;
close ($fh);

This is the content of Zamb.txt afterwards.

 $ cat Zamb.txt
##########################################################
# Query made at 02/29/2020 18:15:54 UTC
##########################################################

##########################################################
# latest SYNOP reports from Zambia before 02/29/2020 18:15:54 UTC
##########################################################
202002291200 AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201
                   333 5//// 85850 83080=

My php fu isn’t up to date, but for PHP, I think you can use the following:

<?php
$url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
$content = strip_tags(file_get_contents($url));
echo substr($content, strpos($content, '###############'));

Note: I seem to recall that there are some configuration options that might disable fetching URL via file_get_contents so YMMV.

However, the same page there is a note:

NOTE: If you want to get simply files with synop reports in CSV format without HTML tags consider to use the binary getsynop

This would get you the same data in a easy to use format:

$ wget "https://www.ogimet.com/cgi-bin/getsynop?begin=$(date +%Y%m%d0000)&state=Zambia" -o /dev/null -O - | tail -1
67855,2020,02,29,12,00,AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201 333 5//// 85850 83080=
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement