I have a text variable which contains multiple images with a relative or absolute path. I need to check if the src attribute starts with http or https then ignore it, but in case it starts with / or something like abc/ then prepend a base url.
I tried like below:
<?php
$html = <<<HTML
<img src="docs/relative/url/img.jpg" />
<img src="/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />
HTML;
$base = 'https://example.com/';
$pattern = "/<img src="[^http|https]([^"]*)"/";
$replace = "<img src="" . $base . "${1}"";
echo $text = preg_replace($pattern, $replace, $html);
My output is:
<img src="https://example.com/ocs/relative/url/img.jpg" /> <img src="https://example.com/docs/relative/url/img.jpg" /> <img src="https://docs/relative/url/img.jpg" /> <img src="http://docs/relative/url/img.jpg" />
Issue here: I got 99% result correct, but when the src attribute started with something like docs/ then first letter of it cut off. (please check first img src in output)
Output I needed is:
<img src="https://example.com/docs/relative/url/img.jpg" /><!--check this and compare with current result, you will get the difference --> <img src="https://example.com/docs/relative/url/img.jpg" /> <img src="https://docs/relative/url/img.jpg" /> <img src="http://docs/relative/url/img.jpg" />
Could any one help me to rectify it.
Advertisement
Answer
The following pattern will seek src attributes that do not start with http or https. Then for relative paths that begin with a forward slash, the leading slash will be removed before prepending the $base string to the src value.
Code: (Demo)
$base = 'https://example.com/';
echo preg_replace('~ src="(?!http)K/?~', $base, $html);
Output:
<img src="https://example.com/docs/relative/url/img.jpg" /> <img src="https://example.com/docs/relative/url/img.jpg" /> <img src="https://docs/relative/url/img.jpg" /> <img src="http://docs/relative/url/img.jpg" />
Breakdown:
~ #starting pattern delimiter src=" #match space, s, r, c, =, then " (?!http) #only continue matching if not https or http K #forget any previously matched characters so they are not destroyed by the replacement string /? #optionally match a forward slash ~ #ending pattern delimiter
As for your pattern, /<img src="[^http|https]([^"]*)"/:
[^http|https]actually means “match a single character that is not from this list:|,h,t,p, ands. It could be simplified to[^|hpst]because the order of the listed characters in the “negated character class” is irrelevant and duplicating characters is meaningless. So you see,[^...]is not how you say “a string starts with something or somethingelse”.- Capturing all remaining characters in a substring until the next double quote with the intent to use it again in the replacement is unnecessary. This is why I use
Kto pinpoint where$baseshould be injected instead of([^"]*).
Furthermore, I always recommend the stability of a DOM parser when dealing with a valid HTML document. You can use DOMDocument with XPath to target the qualifying elements and modify the src attributes without regex.
Code: (Demo)
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//img[not(starts-with(@src, 'http'))]") as $node) {
$node->setAttribute('src', $base . ltrim($node->getAttribute('src'), '/'));
}
echo $dom->saveHTML();
A related answer: https://stackoverflow.com/a/48837947/2943403