Skip to content
Advertisement

Invalid characters in XML tag name

We have a huge database where users can create custom fields. Every UTF-8 character is allowed in their name. Until a few weeks ago, when they export their data in XML, only invalid characters that users had in their tables were slash / and whitespace characters, and we replaced them with underscores.

Now I see that some users who need an export in XML are using in their field names *, !… So if their field name instead valid_name is named for example invalid*name!, this script will break.

Part of code used for defining tag name:

JavaScript

Sample of valid XML:

JavaScript

I don’t need for users to see in their element name !, *… I need to know what are characters that aren’t allowed to be in element name, And I will replace them probably with an underscore, I am opened also if you have better proposition instead of replacing them with an underscore.

Advertisement

Answer

@Quentin suggest the better way. Using dynamic node names mean that you can not define an XSD/Schema, your XML files will be wellformed only. You will not be able to make full use of validators. So a <field name="..."/> is a better solution from a machine readability and maintenance point of view.

However, NCNames (non-colonized names) allow for quite a lot characters. Here is what I implemented in my library for converting JSON.

$nameStartChar defines letters and several Unicode ranges. $nameChar adds some more characters to that definition (like the digits).

The first RegExp removes any character that is NOT a name char. The second removes any starting character that is NOT defined in $nameStartChar. If the result is empty it will return a default name.

JavaScript

An qualified XML node name can consist of two NC names separated by ‘:’. The first part would be the namespace prefix.

JavaScript

Output:

JavaScript
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement