I have a text variable which contains multiple images with a relative or absolute path. I need to check if the src attribute starts with http
or https
then ignore it, but in case it starts with /
or something like abc/
then prepend a base url.
I tried like below:
<?php $html = <<<HTML <img src="docs/relative/url/img.jpg" /> <img src="/docs/relative/url/img.jpg" /> <img src="https://docs/relative/url/img.jpg" /> <img src="http://docs/relative/url/img.jpg" /> HTML; $base = 'https://example.com/'; $pattern = "/<img src="[^http|https]([^"]*)"/"; $replace = "<img src="" . $base . "${1}""; echo $text = preg_replace($pattern, $replace, $html);
My output is:
<img src="https://example.com/ocs/relative/url/img.jpg" /> <img src="https://example.com/docs/relative/url/img.jpg" /> <img src="https://docs/relative/url/img.jpg" /> <img src="http://docs/relative/url/img.jpg" />
Issue here: I got 99% result correct, but when the src attribute started with something like docs/
then first letter of it cut off. (please check first img src in output)
Output I needed is:
<img src="https://example.com/docs/relative/url/img.jpg" /><!--check this and compare with current result, you will get the difference --> <img src="https://example.com/docs/relative/url/img.jpg" /> <img src="https://docs/relative/url/img.jpg" /> <img src="http://docs/relative/url/img.jpg" />
Could any one help me to rectify it.
Advertisement
Answer
The following pattern will seek src
attributes that do not start with http
or https
. Then for relative paths that begin with a forward slash, the leading slash will be removed before prepending the $base
string to the src
value.
Code: (Demo)
$base = 'https://example.com/'; echo preg_replace('~ src="(?!http)K/?~', $base, $html);
Output:
<img src="https://example.com/docs/relative/url/img.jpg" /> <img src="https://example.com/docs/relative/url/img.jpg" /> <img src="https://docs/relative/url/img.jpg" /> <img src="http://docs/relative/url/img.jpg" />
Breakdown:
~ #starting pattern delimiter src=" #match space, s, r, c, =, then " (?!http) #only continue matching if not https or http K #forget any previously matched characters so they are not destroyed by the replacement string /? #optionally match a forward slash ~ #ending pattern delimiter
As for your pattern, /<img src="[^http|https]([^"]*)"/
:
[^http|https]
actually means “match a single character that is not from this list:|
,h
,t
,p
, ands
. It could be simplified to[^|hpst]
because the order of the listed characters in the “negated character class” is irrelevant and duplicating characters is meaningless. So you see,[^...]
is not how you say “a string starts with something or somethingelse”.- Capturing all remaining characters in a substring until the next double quote with the intent to use it again in the replacement is unnecessary. This is why I use
K
to pinpoint where$base
should be injected instead of([^"]*)
.
Furthermore, I always recommend the stability of a DOM parser when dealing with a valid HTML document. You can use DOMDocument with XPath to target the qualifying elements and modify the src
attributes without regex.
Code: (Demo)
$dom = new DOMDocument; $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); $xpath = new DOMXPath($dom); foreach ($xpath->query("//img[not(starts-with(@src, 'http'))]") as $node) { $node->setAttribute('src', $base . ltrim($node->getAttribute('src'), '/')); } echo $dom->saveHTML();
A related answer: https://stackoverflow.com/a/48837947/2943403