I’ve finally managed to make a function which does the following:
- Takes a string as input. This can be either an entire HTML document or a HTML “snippet” (even broken).
- Creates a DOMDocument from this and loops through all nodes.
- Whenever it encounters any node whose element is outside of a whitelist of basic structural elements, it “marks it for deletion”. For example,
<script>
is not whitelisted. - Whenever any node has ANY attribute starting with “on”, this is immediately removed with
removeAttribute
. The same goes for any “style” attribute, and any “href” attribute whose value starts with “javascript:”. - When all nodes are looped through, the ones marked for deletion are looped over and deleted (
$node->parentNode->removeChild($node)
). This isn’t done in the first loop because the parser becomes confused if you do that. - This document is now
saveHTML
ed and returned as a string, now representing a cleaned/secured HTML document/snippet.
As far as I can tell, there is no way to abuse this. Unless there is some bug in the DOM parser, which is off my hands/conscience.
But maybe there is another “onsomething” attribute or something else I haven’t thought of?
I feel pretty confident in outputting any HTML from any untrusted external/user-provided source after it’s been mangled through this function of mine, but perhaps I’m being cocky?
(I truly wish that strip_tags
would do this on its own so that I didn’t have to code my own thing.)
Advertisement
Answer
If you want to prevent xss, all of the on*
attributes are candidates for removal. Also style
might have javascript in various ways in some browsers, as well as href
(javascript:
). SVG can I think include scripts and so on.
Look here for a non-comprehensive list of how these sanitizers would be bypassed, and why it’s very hard to build a sanitizer yourself.
Why not just use a known-good sanitizer like Google Caja, instead of reinventing them? It’s a lot harder than you seem to think.