I got in trouble parsing a text file in codeigniter, for each line in file I need to capture groups data…the data are: – progressive number – operator – manufacturer – model – registration – type
Here you are an example of the file lines
8 SIRIO S.P.A. BOMBARDIER INC. BD-100-1A10 I-FORZ STANDARD 9 ESERCENTE PRIVATO PIAGGIO AERO INDUSTRIES S.P.A. P.180 AVANTI II I-FXRJ SPECIALE/STANDARD 10 MIGNINI & PETRINI S.P.A. ROBINSON HELICOPTER COMPANY R44 II I-HIKE SPECIALE/STANDARD 11 MIGNINI & PETRINI S.P.A. ROBINSON HELICOPTER COMPANY R44 II I-HIKE STANDARD 12 BLUE PANORAMA AIRLINES S.P.A. THE BOEING COMPANY 737-86N I-LCFC STANDARD
To parse each line I’m using the following code:
if ($fh = fopen($filePath, 'r')) { while (!feof($fh)) { $line = trim(fgets($fh)); if(preg_match('/^(d{1,})s+(w{1,})s+(w{1,})s+(w{1,})s+(w{1,})s+(w{1,})$/i', $line, $matches)) { $regs[] = array( 'Operator' => $matches[1], 'Manufacturer' => $matches[2], 'Model' => $matches[3], 'Registration' => $matches[4], 'Type' => $matches[5] ); $this->data['error'] = FALSE; } } fclose($fh); }
The code above doesn’t work…I think because some groups of data are composed by more then one words…for example “SIRIO S.P.A.” Any hint to fix this? Thanks a lot for any help
Advertisement
Answer
You should not use w
for capturing the data as some of the characters in your text like &
, .
, -
and /
are not part of word characters. Moreover some of them are space separated, so you should replace w{1,}
with S+(?: S+)*
which will capture your text properly into groups you have made.
Try changing your regex to this and it should work,
^s*(d+)s+(S+(?: S+)*)s+(S+(?: S+)*)s+(S+(?: S+)*)s+(S+(?: S+)*)s+(S+(?: S+)*)$
Explanation of what S+(?: S+)*
does in above regex.
S+
–S
is opposite ofs
meaning it matches any non-whitespace (won’t match a space or tab or newline or vertical space or horizontal space and in general any whitespace) character. HenceS+
matches one or more visible characters(?: S+)*
– Here?:
is only for turning a group as non-capture group and following it has a space andS+
and all of it is enclosed in parenthesis with*
quantifier. So this means match a space followed by one or more non-whitespace character and whole of it zero or more times as*
quantifier is used.
So S+(?: S+)
will match abc
or abc xyz
or abc pqr xyz
and so on but the moment more than one space appears, the match stops as there is only a single space present in the regex before S+
Hope my explanation is clear. If still any doubt, please feel free to ask.