Skip to content
Advertisement

Capturing groups in string using preg_match

I got in trouble parsing a text file in codeigniter, for each line in file I need to capture groups data…the data are: – progressive number – operator – manufacturer – model – registration – type

Here you are an example of the file lines

 8  SIRIO S.P.A.                                             BOMBARDIER INC.                                       BD-100-1A10             I-FORZ              STANDARD

 9  ESERCENTE PRIVATO                                        PIAGGIO AERO INDUSTRIES S.P.A.                        P.180 AVANTI II         I-FXRJ              SPECIALE/STANDARD

10  MIGNINI & PETRINI S.P.A.                                 ROBINSON HELICOPTER COMPANY                           R44 II                  I-HIKE              SPECIALE/STANDARD

11  MIGNINI & PETRINI S.P.A.                                 ROBINSON HELICOPTER COMPANY                           R44 II                  I-HIKE              STANDARD

12  BLUE PANORAMA AIRLINES S.P.A.                            THE BOEING COMPANY                                    737-86N                 I-LCFC              STANDARD

To parse each line I’m using the following code:

if ($fh = fopen($filePath, 'r')) {
    while (!feof($fh)) {
        $line = trim(fgets($fh));

        if(preg_match('/^(d{1,})s+(w{1,})s+(w{1,})s+(w{1,})s+(w{1,})s+(w{1,})$/i', $line, $matches))
       {
             $regs[] = array(
             'Operator'     => $matches[1],
             'Manufacturer' => $matches[2],
             'Model'        => $matches[3],
             'Registration' => $matches[4],
             'Type'         => $matches[5]
             );
             $this->data['error'] = FALSE;
        }
    }
    fclose($fh);
 }

The code above doesn’t work…I think because some groups of data are composed by more then one words…for example “SIRIO S.P.A.” Any hint to fix this? Thanks a lot for any help

Advertisement

Answer

You should not use w for capturing the data as some of the characters in your text like &, ., - and / are not part of word characters. Moreover some of them are space separated, so you should replace w{1,} with S+(?: S+)* which will capture your text properly into groups you have made.

Try changing your regex to this and it should work,

^s*(d+)s+(S+(?: S+)*)s+(S+(?: S+)*)s+(S+(?: S+)*)s+(S+(?: S+)*)s+(S+(?: S+)*)$

Check this demo

Explanation of what S+(?: S+)* does in above regex.

  • S+S is opposite of s meaning it matches any non-whitespace (won’t match a space or tab or newline or vertical space or horizontal space and in general any whitespace) character. Hence S+ matches one or more visible characters
  • (?: S+)* – Here ?: is only for turning a group as non-capture group and following it has a space and S+ and all of it is enclosed in parenthesis with * quantifier. So this means match a space followed by one or more non-whitespace character and whole of it zero or more times as * quantifier is used.

So S+(?: S+) will match abc or abc xyz or abc pqr xyz and so on but the moment more than one space appears, the match stops as there is only a single space present in the regex before S+

Hope my explanation is clear. If still any doubt, please feel free to ask.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement