MS SQL generated XML – cleanup and sanitize
All recent versions of MS SQL are able to generate the XML representation of a SQL query result on the fly:
>select * from dbo.foo id foo 1 dead 2 beef >select * from dbo.foo for XML RAW, ELEMENTS <row> <id>1</id> <foo>dead</foo> </row> <row> <id>2</id> <foo>beef</foo> </row>
This is nothing new and special (in the age of NO-SQL and document centric databases) and works well. If you go to use the SQL server as a data source in a business integration orchestration the XML representation of the SQL results turns the mapping of the data between different entities into a point and click game…
While playing with the example databases of Microsoft NAV I recognised some funny things.
UTF-16 conversion of column names
Some smart guy decided to use spaces and braces within the column names. To turn these column names into valid XML element names Microsoft decided to convert these characters into their UTF-16 representation surrounded by underscores. By applying this rule a column named ‚Ali Baba‘ will turn into ‚Ali_x0020_Baba‘. It turned out that the conversation is applied to nearly every special character used within column names:
<test.dbo.foo xmlns:colon="uri"> <space_x0020_sign>1</space_x0020_sign> <exclamation_x0021_sign>1</exclamation_x0021_sign> <doublequote_x0022_sign>1</doublequote_x0022_sign> <hash_x0023_sign>1</hash_x0023_sign> <dollar_x0024_sign>1</dollar_x0024_sign> <percent_x0025_sign>1</percent_x0025_sign> <amp_x0026_sign>1</amp_x0026_sign> <singlequote_x0027_sign>1</singlequote_x0027_sign> <openparentheses_x0028_sign>1</openparentheses_x0028_sign> <closingparentheses_x0029_sign>1</closingparentheses_x0029_sign> <mul_x002A_sign>1</mul_x002A_sign> <plus_x002B_sign>1</plus_x002B_sign> <comma_x002C_sign>1</comma_x002C_sign> <minus-sign>1</minus-sign> <dot.sign>1</dot.sign> <slash_x002F_sign>1</slash_x002F_sign> <colon:sign>1</colon:sign> <semicolon_x003B_sign>1</semicolon_x003B_sign> <lesserthen_x003C_sign>1</lesserthen_x003C_sign> <equal_x003D_sign>1</equal_x003D_sign> <biggerthe_x003E_sign>1</biggerthe_x003E_sign> <questionmark_x003F_sign>1</questionmark_x003F_sign> <at_x0040_sign>1</at_x0040_sign> <backslash_x005C_sign>1</backslash_x005C_sign> <caret_x005E_sign>1</caret_x005E_sign> <underscore_sign>1</underscore_sign> <pipe_x007C_sign>1</pipe_x007C_sign> <opencurly_x007B_sign>1</opencurly_x007B_sign> <closingcurly_x007D_sign>1</closingcurly_x007D_sign> <tilde_x007E_sign>1</tilde_x007E_sign> </test.dbo.foo>
This is a minor problem cause it just creates ugly element names. If the data is only exported (and no ‚translation‘ back in the database is needed), these encoded entities can be replaced with anything looking … better – like the underscore. This is just for cosmetic purposes – nothing a machine will take a notice of…
IF OBJECT_ID (N'dbo.cleanXMLElements', N'FN') IS NOT NULL DROP FUNCTION dbo.cleanXMLElements GO -- replace any occurence of _x00<VALUE@REPLACE>_ with the string given by the second argument -- warning: this is a pure pattern matching - if the sequence _x00<VALUE@REPLACE>_ is part of your payload - good luck! -- note: to increase performance - remove any values from @REPLACE not needed -- param_0: xml document -- param_1: replacement string -- return: cleaned xml document CREATE FUNCTION dbo.cleanXMLElements (@XMLDoc XML, @REPLACE_WIDTH VARCHAR(10)) RETURNS xml AS BEGIN DECLARE @XMLString NVARCHAR(MAX) DECLARE @REPLACE VARCHAR(100) -- hex-values of elements that should be replaced, match is done on _x00<value>_ SET @REPLACE = '20,21,22,23,24,25,26,27,28,29,2a,2b,2c,2f,3b,3c,3d,3e,3f,40,5c,5e,7c,7b,7d,7e' SET @XMLString = CAST(@XMLDoc as NVARCHAR(MAX)) WHILE LEN(@REPLACE) > 0 BEGIN SET @XMLString = REPLACE(@XMLString, '_x00' + LEFT(@REPLACE, CHARINDEX(',', @REPLACE + ',')-1) + '_', @REPLACE_WIDTH) SET @REPLACE = STUFF(@REPLACE, 1, CHARINDEX(',', @REPLACE + ','), '') END RETURN CAST(@XMLString as XML) END GO
XML name spaces
If a column name contains a colon the XML export will fail:
>SELECT * FROM dbo.foo for XML AUTO, ELEMENTS, TYPE Msg 6846, Level 16, State 1, Line 4 XML name space prefix 'colon' declaration is missing for FOR XML column name 'colon:sign'.
In XML the construct A:B defines B as an element living in the name space of A (here: sign is part of the name space colon). Since the generated XML representation lacks the needed name space definition so the document is not well formed…
Unhappily it is very likely that the colon is part of the values contained in the query/document so the simple approach to just replace/strip all colons is very error prone. It is also not possible to generate the name space definition on the fly (eg. by parsing the table definition). The only remaining solution for that kind of problem is to add the needed name space definition manually:
WITH XMLNAMESPACES ('uri' as colon) SELECT * FROM dbo.foo for XML AUTO, ELEMENTS, TYPE
This eliminates the option to export large databases automatically – if there is a colon in a column name. Of course its still possible to generate the needed code…
invalid characters in XML document
Back to the NAV example database… I tried to insert the rows of the Item table as XML into another table (trigger based logging) – and the command failed caused by some illegal characters contained into the values of some fields (example):
USE test GO IF OBJECT_ID (N'dbo.log', N'U') IS NOT NULL DROP TABLE dbo.log GO CREATE TABLE dbo.log ( id int IDENTITY(1,1), tab_name VARCHAR(20), data XML ); GO insert into dbo.log values('Item', (SELECT * FROM [Demo Database NAV (8-0)].[dbo].[CRONUS AG$Item] for xml raw, elements, BINARY BASE64)) Msg 9420, Level 16, State 1, Line 10 XML parsing: line 1, character 5458, illegal xml character
I dumped the XML and found  at the position given in the error message. The ASCII code 0x04 represents the special character EOT – End Of Transmission (I have no idea how the guy that created the example table was able to put that special char into the table data). Since 0x04 is not allowed in XML documents the conversion fails. Maybe I’m to demanding – but why the hell is the freaking tool that converts the SQL result to its XML representation not able to handle/filter that kind of special char? Hu? Have a look at http://www.w3.org/TR/xml/#charsets to get an idea what is allowed – and what not.
To workaround this misbehaviour just remove these illegal characters from the XML document before any further processing. Since the generated XML is not valid it is not possible to handover the converted SQL result directly to the sanitizer function so an explicit cast to nvarchar is needed here.
USE test GO IF OBJECT_ID (N'dbo.sanitizeXML', N'FN') IS NOT NULL DROP FUNCTION dbo.sanitizeXML GO -- remove any occurrence of &#x<VALUE@REMOVE>; -- warning: this is a pure pattern matching - if the sequence &#x<VALUE@REMOVE> is part of your payload - good luck! -- note: to increase performance - remove any values from @REMOVE not needed -- param_0: text to be sanitized -- return: sanitized string --CREATE FUNCTION dbo.sanitizeXML (@XMLDoc XML) CREATE FUNCTION dbo.sanitizeXML (@XMLString NVARCHAR(max)) RETURNS XML AS BEGIN --DECLARE @XMLString NVARCHAR(MAX) DECLARE @REMOVE VARCHAR(100) -- hex-values of elements that should be removed, match is done on &#x<value>; -- 0x05 -> ENQ -- 0x04 -> EOT -- full set: 00,01,02,03,04,05,06,07,08,0b,0c,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f SET @REMOVE = '04,05' WHILE LEN(@REMOVE) > 0 BEGIN SET @XMLString = REPLACE(@XMLString, '&#x' + LEFT(@REMOVE, CHARINDEX(',', @REMOVE + ',') - 1) + ';', '') SET @REMOVE = STUFF(@REMOVE, 1, CHARINDEX(',', @REMOVE + ','), '') END RETURN CAST(@XMLString as XML) END GO insert into dbo.log (tab_name, data) values('Item', (select test.dbo.sanitizeXML(CAST((SELECT * FROM [Demo Database NAV (8-0)].[dbo].[CRONUS AG$Item] for xml raw, elements, BINARY BASE64) AS NVARCHAR(MAX)))))
For performance reasons the sanitizeXML currently only handles the special chars I have to deal with. In theory every char from 0x00 up to 0x1f (except 0x09, 0x0a, 0x0d) must be removed.
Scripts and examples: sanitize_and_cleanup_xml_sql.zip