Skip to content
Advertisement

Regex to find repeated sentences from more to less

i have string like

$string = "hello this is a string and hello but this is a string hello but this is a string and ";

in it there is repeated words and repeated sentences but i only want the sentences so i expect

hello but this is a string to be captured

i tried using this regex (.{10,}).*?1 but it got me this is a string and

but i want to get hello but this is a string because it is the most letters from 10+ without making it {25,} to match more only

but it is also very very slow


  • Cary Swoveland: my plan is to capture longest string repeated and remove it from the string and leaving only one so in my example it would be

hello this is a string and hello but this is a string and

Advertisement

Answer

Collect all substrings that start with a word char at a word boundary position and get the longest one using an extra step (as it is impossible to do with plain regex):

$string = "hello this is a string and hello but this is a string hello but this is a string and ";
if (preg_match_all('~(?=(b(w.{9,})(?=.*?b2)))~u', $string, $m)) {
    echo array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; });
}
// => hello but this is a string 

See the PHP demo. See the regex demo.

Note: if you plan to limit the length of the matches to 25 chars, use '~(?=(b(w.{9,24})(?=.*?b2)))~u'.

Details:

  • (?= – start of a positive lookahead:
    • ( – Group 1:
      • b – word boundary –(w.{9,}) – a word char and then nine or more chars other than line break chars
      • (?=.*?b2) – a positive lookahead that requires any zero or more chars other than line break chars as few as possible and then the same string as captured in Group 2 preceded with a word boundary
    • ) – end of Group 1
  • ) – end of lookahead.

We only get the longest string from the $m[1] array using array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; }).

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement