I am trying to download data from this data page. I have tried a number of scripts I googled. On the data page I have to select the countries I want, one at a time. The one script which gets close to what I want is:
#!/usr/bin/perl use strict; use warnings; use LWP::Simple; my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send'; my $file = 'Zamb.txt'; getstore($url, $file);
However this script gives me the page, not the data. I would appreciate if I can get help to download the data, if this is possible. I would also appreciate to do it in php if this may be an easier alternative.
Advertisement
Answer
The link returns text wrapped in HTML. Simplest approach would be to use HTML::FormatText and HTML::Parse to get the text only version.
#!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; use HTML::FormatText; my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send'; my $text = HTML::FormatText->new(leftmargin=>0, rightmargin=>100000000000)->format(HTML::TreeBuilder->new_from_url($url)); my $file = 'Zamb.txt'; open (my $fh, '>', $file); print $fh $text; close ($fh);
HTML::TreeBuilder->new_from_url($url)
– download and parse the htmlHTML::FormatText
->new(leftmargin=>0, rightmargin=>100000000000)
– intialize the html format – set the right margin to a big value to prevent wrapping
This is the content of Zamb.txt afterwards.
$ cat Zamb.txt ########################################################## # Query made at 02/29/2020 18:15:54 UTC ########################################################## ########################################################## # latest SYNOP reports from Zambia before 02/29/2020 18:15:54 UTC ########################################################## 202002291200 AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201 333 5//// 85850 83080=
My php fu isn’t up to date, but for PHP, I think you can use the following:
<?php $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send'; $content = strip_tags(file_get_contents($url)); echo substr($content, strpos($content, '###############'));
Note: I seem to recall that there are some configuration options that might disable fetching URL via file_get_contents so YMMV.
However, the same page there is a note:
NOTE: If you want to get simply files with synop reports in CSV format without HTML tags consider to use the binary getsynop
This would get you the same data in a easy to use format:
$ wget "https://www.ogimet.com/cgi-bin/getsynop?begin=$(date +%Y%m%d0000)&state=Zambia" -o /dev/null -O - | tail -1 67855,2020,02,29,12,00,AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201 333 5//// 85850 83080=