i have string like
$string = "hello this is a string and hello but this is a string hello but this is a string and ";
in it there is repeated words and repeated sentences but i only want the sentences so i expect
hello but this is a string
to be captured
i tried using this regex (.{10,}).*?1
but it got me this is a string and
but i want to get hello but this is a string
because it is the most letters from 10+
without making it {25,}
to match more only
but it is also very very slow
- Cary Swoveland: my plan is to capture longest string repeated and remove it from the string and leaving only one so in my example it would be
hello this is a string and hello but this is a string and
Advertisement
Answer
Collect all substrings that start with a word char at a word boundary position and get the longest one using an extra step (as it is impossible to do with plain regex):
$string = "hello this is a string and hello but this is a string hello but this is a string and "; if (preg_match_all('~(?=(b(w.{9,})(?=.*?b2)))~u', $string, $m)) { echo array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; }); } // => hello but this is a string
See the PHP demo. See the regex demo.
Note: if you plan to limit the length of the matches to 25 chars, use '~(?=(b(w.{9,24})(?=.*?b2)))~u'
.
Details:
(?=
– start of a positive lookahead:(
– Group 1:b
– word boundary –(w.{9,})
– a word char and then nine or more chars other than line break chars(?=.*?b2)
– a positive lookahead that requires any zero or more chars other than line break chars as few as possible and then the same string as captured in Group 2 preceded with a word boundary
)
– end of Group 1
)
– end of lookahead.
We only get the longest string from the $m[1]
array using array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; })
.