I got in trouble parsing a text file in codeigniter, for each line in file I need to capture groups data…the data are: – progressive number – operator – manufacturer – model – registration – type
Here you are an example of the file lines
8 SIRIO S.P.A. BOMBARDIER INC. BD-100-1A10 I-FORZ STANDARD 9 ESERCENTE PRIVATO PIAGGIO AERO INDUSTRIES S.P.A. P.180 AVANTI II I-FXRJ SPECIALE/STANDARD 10 MIGNINI & PETRINI S.P.A. ROBINSON HELICOPTER COMPANY R44 II I-HIKE SPECIALE/STANDARD 11 MIGNINI & PETRINI S.P.A. ROBINSON HELICOPTER COMPANY R44 II I-HIKE STANDARD 12 BLUE PANORAMA AIRLINES S.P.A. THE BOEING COMPANY 737-86N I-LCFC STANDARD
To parse each line I’m using the following code:
if ($fh = fopen($filePath, 'r')) {
while (!feof($fh)) {
$line = trim(fgets($fh));
if(preg_match('/^(d{1,})s+(w{1,})s+(w{1,})s+(w{1,})s+(w{1,})s+(w{1,})$/i', $line, $matches))
{
$regs[] = array(
'Operator' => $matches[1],
'Manufacturer' => $matches[2],
'Model' => $matches[3],
'Registration' => $matches[4],
'Type' => $matches[5]
);
$this->data['error'] = FALSE;
}
}
fclose($fh);
}
The code above doesn’t work…I think because some groups of data are composed by more then one words…for example “SIRIO S.P.A.” Any hint to fix this? Thanks a lot for any help
Advertisement
Answer
You should not use w for capturing the data as some of the characters in your text like &, ., - and / are not part of word characters. Moreover some of them are space separated, so you should replace w{1,} with S+(?: S+)* which will capture your text properly into groups you have made.
Try changing your regex to this and it should work,
^s*(d+)s+(S+(?: S+)*)s+(S+(?: S+)*)s+(S+(?: S+)*)s+(S+(?: S+)*)s+(S+(?: S+)*)$
Explanation of what S+(?: S+)* does in above regex.
S+–Sis opposite ofsmeaning it matches any non-whitespace (won’t match a space or tab or newline or vertical space or horizontal space and in general any whitespace) character. HenceS+matches one or more visible characters(?: S+)*– Here?:is only for turning a group as non-capture group and following it has a space andS+and all of it is enclosed in parenthesis with*quantifier. So this means match a space followed by one or more non-whitespace character and whole of it zero or more times as*quantifier is used.
So S+(?: S+) will match abc or abc xyz or abc pqr xyz and so on but the moment more than one space appears, the match stops as there is only a single space present in the regex before S+
Hope my explanation is clear. If still any doubt, please feel free to ask.