MS SQL generated XML – cleanup and sanitize

posted by on 2015.02.22, under sql
22:

All recent versions of MS SQL are able to generate the XML representation of a SQL query result on the fly:

>select * from dbo.foo
id  foo
1  dead
2  beef

>select * from dbo.foo for XML RAW, ELEMENTS

<row>
  <id>1</id>
  <foo>dead</foo>
</row>
<row>
  <id>2</id>
  <foo>beef</foo>
</row>

This is nothing new and special (in the age of NO-SQL and document centric databases) and works well. If you go to use the SQL server as a data source in a business integration orchestration the XML representation of the SQL results turns the mapping of the data between different entities into a point and click game…

While playing with the example databases of Microsoft NAV I recognised some funny things.

UTF-16 conversion of column names

Some smart guy decided to use spaces and braces within the column names. To turn these column names into valid XML element names Microsoft decided to convert these characters into their UTF-16 representation surrounded by underscores. By applying this rule a column named ‚Ali Baba‘ will turn into ‚Ali_x0020_Baba‘. It turned out that the conversation is applied to nearly every special character used within column names:

<test.dbo.foo xmlns:colon="uri">
  <space_x0020_sign>1</space_x0020_sign>
  <exclamation_x0021_sign>1</exclamation_x0021_sign>
  <doublequote_x0022_sign>1</doublequote_x0022_sign>
  <hash_x0023_sign>1</hash_x0023_sign>
  <dollar_x0024_sign>1</dollar_x0024_sign>
  <percent_x0025_sign>1</percent_x0025_sign>
  <amp_x0026_sign>1</amp_x0026_sign>
  <singlequote_x0027_sign>1</singlequote_x0027_sign>
  <openparentheses_x0028_sign>1</openparentheses_x0028_sign>
  <closingparentheses_x0029_sign>1</closingparentheses_x0029_sign>
  <mul_x002A_sign>1</mul_x002A_sign>
  <plus_x002B_sign>1</plus_x002B_sign>
  <comma_x002C_sign>1</comma_x002C_sign>
  <minus-sign>1</minus-sign>
  <dot.sign>1</dot.sign>
  <slash_x002F_sign>1</slash_x002F_sign>
  <colon:sign>1</colon:sign>
  <semicolon_x003B_sign>1</semicolon_x003B_sign>
  <lesserthen_x003C_sign>1</lesserthen_x003C_sign>
  <equal_x003D_sign>1</equal_x003D_sign>
  <biggerthe_x003E_sign>1</biggerthe_x003E_sign>
  <questionmark_x003F_sign>1</questionmark_x003F_sign>
  <at_x0040_sign>1</at_x0040_sign>
  <backslash_x005C_sign>1</backslash_x005C_sign>
  <caret_x005E_sign>1</caret_x005E_sign>
  <underscore_sign>1</underscore_sign>
  <pipe_x007C_sign>1</pipe_x007C_sign>
  <opencurly_x007B_sign>1</opencurly_x007B_sign>
  <closingcurly_x007D_sign>1</closingcurly_x007D_sign>
  <tilde_x007E_sign>1</tilde_x007E_sign>
</test.dbo.foo>

This is a minor problem cause it just creates ugly element names. If the data is only exported (and no ‚translation‘ back in the database is needed), these encoded entities can be replaced with anything looking … better – like the underscore. This is just for cosmetic purposes – nothing a machine will take a notice of…

IF OBJECT_ID (N'dbo.cleanXMLElements', N'FN') IS NOT NULL
    DROP FUNCTION dbo.cleanXMLElements
GO

-- replace any occurence of _x00<VALUE@REPLACE>_ with the string given by the second argument
-- warning: this is a pure pattern matching - if the sequence _x00<VALUE@REPLACE>_ is part of your payload - good luck!
-- note: to increase performance - remove any values from @REPLACE not needed
-- param_0: xml document
-- param_1: replacement string
-- return: cleaned xml document
CREATE FUNCTION dbo.cleanXMLElements (@XMLDoc XML, @REPLACE_WIDTH VARCHAR(10))
RETURNS xml
AS
BEGIN
  DECLARE @XMLString NVARCHAR(MAX)
  DECLARE @REPLACE VARCHAR(100)
  -- hex-values of elements that should be replaced, match is done on _x00<value>_
  SET @REPLACE = '20,21,22,23,24,25,26,27,28,29,2a,2b,2c,2f,3b,3c,3d,3e,3f,40,5c,5e,7c,7b,7d,7e'
  SET @XMLString = CAST(@XMLDoc as NVARCHAR(MAX))

  WHILE LEN(@REPLACE) > 0
  BEGIN
    SET @XMLString = REPLACE(@XMLString, '_x00' + LEFT(@REPLACE, CHARINDEX(',', @REPLACE + ',')-1) + '_', @REPLACE_WIDTH)
    SET @REPLACE = STUFF(@REPLACE, 1, CHARINDEX(',', @REPLACE + ','), '')
  END
  RETURN CAST(@XMLString as XML)
END
GO

XML name spaces

If a column name contains a colon the XML export will fail:

>SELECT * FROM dbo.foo for XML AUTO, ELEMENTS, TYPE

Msg 6846, Level 16, State 1, Line 4
XML name space prefix 'colon' declaration is missing for FOR XML column name 'colon:sign'.

In XML the construct A:B defines B as an element living in the name space of A (here: sign is part of the name space colon). Since the generated XML representation lacks the needed name space definition so the document is not well formed…

Unhappily it is very likely that the colon is part of the values contained in the query/document so the simple approach to just replace/strip all colons is very error prone. It is also not possible to generate the name space definition on the fly (eg. by parsing the table definition). The only remaining solution for that kind of problem is to add the needed name space definition manually:

WITH XMLNAMESPACES ('uri' as colon)
SELECT * FROM dbo.foo for XML AUTO, ELEMENTS, TYPE

This eliminates the option to export large databases automatically – if there is a colon in a column name. Of course its still possible to generate the needed code…

invalid characters in XML document

Back to the NAV example database… I tried to insert the rows of the Item table as XML into another table (trigger based logging) – and the command failed caused by some illegal characters contained into the values of some fields (example):

USE test
GO

IF OBJECT_ID (N'dbo.log', N'U') IS NOT NULL
    DROP TABLE dbo.log
GO

CREATE TABLE dbo.log (
  id int IDENTITY(1,1),
  tab_name VARCHAR(20),
  data XML
);
GO
insert into dbo.log values('Item', (SELECT * FROM [Demo Database NAV (8-0)].[dbo].[CRONUS AG$Item] for xml raw, elements, BINARY BASE64))

Msg 9420, Level 16, State 1, Line 10
XML parsing: line 1, character 5458, illegal xml character

I dumped the XML and found &#x04 at the position given in the error message. The ASCII code 0x04 represents the special character EOT – End Of Transmission (I have no idea how the guy that created the example table was able to put that special char into the table data). Since 0x04 is not allowed in XML documents the conversion fails. Maybe I’m to demanding – but why the hell is the freaking tool that converts the SQL result to its XML representation not able to handle/filter that kind of special char? Hu? Have a look at http://www.w3.org/TR/xml/#charsets to get an idea what is allowed – and what not.

To workaround this misbehaviour just remove these illegal characters from the XML document before any further processing. Since the generated XML is not valid it is not possible to handover the converted SQL result directly to the sanitizer function so an explicit cast to nvarchar is needed here.

USE test
GO

IF OBJECT_ID (N'dbo.sanitizeXML', N'FN') IS NOT NULL
    DROP FUNCTION dbo.sanitizeXML
GO

-- remove any occurrence of &#x<VALUE@REMOVE>;
-- warning: this is a pure pattern matching - if the sequence &#x<VALUE@REMOVE> is part of your payload - good luck!
-- note: to increase performance - remove any values from @REMOVE not needed
-- param_0: text to be sanitized
-- return: sanitized string
--CREATE FUNCTION dbo.sanitizeXML (@XMLDoc XML)
CREATE FUNCTION dbo.sanitizeXML (@XMLString NVARCHAR(max))
RETURNS XML
AS
BEGIN
  --DECLARE @XMLString NVARCHAR(MAX)
  DECLARE @REMOVE VARCHAR(100)
  -- hex-values of elements that should be removed, match is done on &#x<value>;
  -- 0x05 -> ENQ
  -- 0x04 -> EOT
  -- full set: 00,01,02,03,04,05,06,07,08,0b,0c,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f
  SET @REMOVE = '04,05'

  WHILE LEN(@REMOVE) > 0
  BEGIN
    SET @XMLString = REPLACE(@XMLString, '&#x' + LEFT(@REMOVE, CHARINDEX(',', @REMOVE + ',') - 1) + ';', '')
    SET @REMOVE = STUFF(@REMOVE, 1, CHARINDEX(',', @REMOVE + ','), '')
  END
  RETURN CAST(@XMLString as XML)
END
GO

insert into dbo.log (tab_name, data) values('Item', (select test.dbo.sanitizeXML(CAST((SELECT * FROM [Demo Database NAV (8-0)].[dbo].[CRONUS AG$Item] for xml raw, elements, BINARY BASE64) AS NVARCHAR(MAX)))))

For performance reasons the sanitizeXML currently only handles the special chars I have to deal with. In theory every char from 0x00 up to 0x1f (except 0x09, 0x0a, 0x0d) must be removed.

Scripts and examples: sanitize_and_cleanup_xml_sql.zip

pagetop