Skip to content
Advertisement

Can’t figure out a solution to this regex

I have the following list of strings:

$list = array(
 'c1' => '{sometext...} 1tb hdd 1tb hdd {sometext...}'
 'c2' => '{sometext...} 1tb hdd 1tb {sometext...}',
 'c3' => '{sometext...} hdd 1tb hdd 1tb {sometext...}',
 'c4' => '{sometext...} hdd 1tb hdd 1tb hdd {sometext...}'
);

and the following regular expression which should run on all strings, and if a match is found, return true otherwise, return false.

/(?<!hdds)(dtb hdd dtb){1,}(?!shdd)/

As of now, my result set looks something like this:

'c1' => false,
'c2' => true,
'c3' => false,
'c4' => false

However, for the correct result, would be, to mark c4 as true instead. How could I change my regex, to achieve the desired result?

USE CASE: the use case for this would be, to correctly identify ambigous attributes in product title naming. In case1 and case3, it is easily decidable which capacity belongs to which storage device, however in the other two cases, it is not programatically decidable, because there is a hdd without a capacity value.

NOTE: Counting the number of hdd instances in the string is not a good solution, as in the {sometext...} part of the string, other instances of the text may appear as different kind of noise.

Advertisement

Answer

You can use

(?<=(hdds)|)dtb hdd dtb(?(1)(?=shdd)|(?!shdd))
(?:hdds+dtb hdd dtb(?!s+hdd)|(?<!hdds)dtb hdd dtbs+hdd)(*SKIP)(*F)|dtb hdd dtb

See the regex demo #1 and regex demo #2.

Details #1:

  • (?<=(hdds)|) – checks if there is hdd+whitespace (captured into Group 1) or empty string immediately to the left of the current location
  • dtb hdd dtb – matches digit + tb hdd + digit + tb
  • (?(1)(?=shdd)|(?!shdd)) – if Group 1 value is not null, make sure there is a whitespace and hdd immediately to the right of the current location, else, makes sure this pattern cannot be found at the same location.

Details #2:

  • (?:hdds+dtb hdd dtb(?!s+hdd)|(?<!hdds)dtb hdd dtbs+hdd)(*SKIP)(*F) – matches the hdds+dtb hdd dtb pattern that is not immediately followed with 1+ whitespaces + hdd or a dtb hdd dtbs+hdd that is not immediately preceded with hdd + whitespace, fails these matches and goes on to search for the next match from the failure location
  • | – or
  • dtb hdd dtb – matches digit, tb hdd , digit, tb.

See the PHP demo:

$list = array(
 'c1' => '{sometext...} 1tb hdd 1tb hdd {sometext...}',
 'c2' => '{sometext...} 1tb hdd 1tb {sometext...}',
 'c3' => '{sometext...} hdd 1tb hdd 1tb {sometext...}',
 'c4' => '{sometext...} hdd 1tb hdd 1tb hdd {sometext...}'
);
print_r(preg_grep('~(?<=(hdds)|)dtb hdd dtb(?(1)(?=shdd)|(?!shdd))~', $list));
// => Array
//   (
//     [c2] => {sometext...} 1tb hdd 1tb {sometext...}
//     [c4] => {sometext...} hdd 1tb hdd 1tb hdd {sometext...}
//   )
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement