Share via


How do I escape invalid xml character 0x02?

Question

Wednesday, March 17, 2010 4:16 PM

In one of my tables an end user managed to insert a character 0x02 into a varchar(255) column. At some point in time, from this table xml documents are generated using the for xml path() syntax. For the record with the 0x02 character in it this now generates an error:

".Net SqlClient Data Provider: Msg 6841, Level 16, State 1, Line 22

FOR XML could not serialize the data for node 'v' because it contains a character (0x0002) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive."

I've now  hit this error on char(2), but the same error occurs for a lot more characters.

The example below illustrates this by inserting 2 records in a table and creating an xml document for both records. The record with id = 1 is no problem, a valid xml is generated; even the '&' in it is properly escaped in the xml and translated back into a '&' when I retrieve the varchar value for the element. The record with id = 2 can however not be generated into an xml document.

I would expect an escape sequence like '' to be generated for the char(2) character, analogous like the '&' is escaped as '&'.

use tempdb
go

declare @tbl table (
    id int not null,
    v varchar(255) not null
);

insert @tbl(id, v) 
select 1, 'DECANTER & Sons'
union all
select 2, char(0x02) + 'DECANTER & Sons'

declare @xml xml;

select @xml = (
        select t.v from @tbl t where t.id = 1
        for xml path('xml'), type
    );

select @xml.value('xml[1]/v[1]', 'varchar(255)'), @xml


select @xml = (
        select t.v from @tbl t where t.id = 2
        for xml path('xml'), type
    );

select @xml.value('xml[1]/v[1]', 'varchar(255)'), @xml

I need the actual contents of the column in the xml document or the process depending on the document will not function correctly, so I can not simply replace any invalid characters by an empty string. 

I would expect this output for id = 2:

<xml>

<v>DECANTER & Sons</v>

</xml>

But even trying to store this perfectly valid xml in an xml variable results in the same error!

declare @xml xml;
select @xml = N'
<xml>
    <v>&#02;DECANTER &amp; Sons</v>
</xml>'

Please help me so that I can store all characters that exist in the varchar(255) column in my xml document?

@@version :

Microsoft SQL Server 2005 - 9.00.4266.00 (X64) 

Oct  7 2009 17:38:17 

Copyright (c) 1988-2005 Microsoft Corporation

Standard Edition (64-bit) on Windows NT 5.2 (Build 3790: Service Pack 2)

All replies (10)

Thursday, March 18, 2010 9:19 AM ✅Answered | 1 vote

There was space problem, I have gave you sample query and you can convert to for xml path which i think would not be more complicate for you.
If i get solution i will post .

use tempdb
go

declare @tbl table (
    id int not null,
    v varchar(255) not null
);

insert @tbl(id, v)
--select 1, 'DECANTER & Sons'
--union all
select 2, 'DECANTER & Sons'

declare @xml xml;

select v from @tbl-- for xml auto

SELECT  1 as Tag,
        0 as Parent,
        v as [v!1!!xml] -- no attribute name so ELEMENT assumed*/
FROM   @tbl
FOR XML EXPLICIT, ROOT('xml')


Thursday, March 18, 2010 5:53 AM

Hi,

All text in an XML document will be parsed by the parser.
But text inside a CDATA section will be ignored by the parser.
Characters like "<" and "&" are illegal in XML elements.
"<" will generate an error because the parser interprets it as the start of a new element.
"&" will generate an error because the parser interprets it as the start of an character entity.
Everything inside a CDATA section is ignored by the parser.
A CDATA section starts with "<![CDATA[" and ends with "]]>":

Thanks,
Mayur


Thursday, March 18, 2010 6:08 AM

use tempdb
go

declare @tbl table (
    id int not null,
    v varchar(255) not null
);

insert @tbl(id, v)
--select 1, 'DECANTER & Sons'
--union all
select 2, 'DECANTER & Sons'

declare @xml xml;

select v from @tbl-- for xml auto

SELECT  1 as Tag,
        0 as Parent,
        v 
            as [v!1!!cdata] -- no attribute name so ELEMENT assumed*/
FROM   @tbl
FOR XML EXPLICIT, ROOT('xml')


Thursday, March 18, 2010 8:26 AM

I agree, putting the data in a cdata section is a way to get the values into the xml document. But I need the values in the elements so that the rest of the processes can process the exact same data that is in the table from the xml. So yours is not the answer I'm looking for, MayurGupta, sorry.

Putting a '&' in an element is not valid xml. I know that. That is why we have escape sequences -as I demonstrated above- : Put '&' in your element value and a single '&' will come out when you properly retrieve the value using .value(). The same way: '<' for '<', '>' for '>'. And there is one more escape sequence possible, for example: 'A' for 'A'. 65 being the unicode id for the capital 'A'. This is called a numeric character reference escape (or NCR-escape) by the w3c. In SQLXML this works for all character codes, except for a small set of codes: 0, 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31. These happen to be all the non-printable characters, plus all codes over 55296, but the latter are of less concern to me. 

w3c states at their site: http://www.w3.org/International/questions/qa-escapes: that using '', or '' would be the proper way to encode a string with a char(2) in it.

So in conclusion: In SQLXML I can escape all characters, except for those characters that actually need escaping! I don't think that's right. Does anyone know about this being reported as a bug yet? I my opinion there are 2 bugs:

1 - the xml parser does not allow '' as an escape sequence for char(2), and

2 - the for xml path engine does not generate the escape sequence '' when a char(2) is encountered in a (n)varchar value.

Note: due to the large amount of code on our sql servers, they are still at compatibility level 80, it may be that at level 90 there is no issue at all. I have not been able to test this yet.


Thursday, March 18, 2010 8:34 AM | 1 vote

Ok Try below

use tempdb
go

declare @tbl table (
    id int not null,
    v varchar(255) not null
);

insert @tbl(id, v)
--select 1, 'DECANTER & Sons'
--union all
select 2, 'DECANTER & Sons'

declare @xml xml;

select v from @tbl-- for xml auto

SELECT  1 as Tag,
        0 as Parent,
        v as [v!1!!xml ] -- no attribute name so ELEMENT assumed*/
FROM   @tbl
FOR XML EXPLICIT, ROOT('xml')


Thursday, March 18, 2010 8:47 AM

Thanks for your attention, but that results in another error message:

.Net SqlClient Data Provider: Msg 6824, Level 16, State 1, Line 15

In the FOR XML EXPLICIT clause, mode 'xml ' in a column name is invalid.

*
*

But why do you use 'for xml explicit' in your examples while I am asking for 'for xml path'? Do you suggest that 'for xml explicit' supports features that 'for xml path' does not? Having to rewrite my entire code into for xml explicit will be huge job... (plus that for xml explicit performs much worse).


Thursday, March 18, 2010 9:58 AM

I've added the assignment for setting the value of your query's result into the @xml variable to your example:

use tempdb
go

declare @tbl table (
    id int not null,
    v varchar(255) not null
);

insert @tbl(id, v)
--select 1, 'DECANTER & Sons'
--union all
select 2, '&#02;DECANTER & Sons'

declare @xml xml;

select v from @tbl-- for xml auto

select @xml = (
    SELECT  1 as Tag,
              0 as Parent,
              v as [v!1!!xml] -- no attribute name so ELEMENT assumed*/
    FROM   @tbl
    FOR XML EXPLICIT, ROOT('xml')
);

This outputs again the same 'illegal xml character' problem:

(1 row(s) affected)

(1 row(s) affected)

.Net SqlClient Data Provider: Msg 9420, Level 16, State 1, Line 16

XML parsing: line 1, character 13, illegal xml character

Your example does not hold the same data as mine: your's already has the escape sequence in the data, mine has an actual char(2) in the data. So to get to the situation you describe I would have to first replace all illegal characters by their escape sequence in all affected columns (i.e. all varchar, char, nvarchar, nchar, text and ntext columns). And, even after I would have gone through the pain of converting all illegal characters into their appropriate escape sequences I would still not be able to store/process the resulting xml document.


Thursday, March 18, 2010 11:05 AM | 1 vote

The simplest thing might be to clean your data?


Thursday, March 18, 2010 1:08 PM

Yeah, that's what I did for now (our business critical merge replication got into a replication conflict because of this issue, so I had to act). But what's going to happen next time someone finds a loophole and gets invalid characters in any of the tables? I'd like my applications to be robust enough to handle that and currently sqlxml is the weakest link in this chain.

Thanks MayurGupta for thinking with me (I'll reward you with the answer points for that;)). I'm going to contact MS to see if they have an answer for this issue. Any option I can think of violates either the xml format or makes us loose data. And MSSQL's behavior is clearly wrong in this, I think I have a chance at getting MS to do something about it.


Friday, April 2, 2010 1:20 PM

I reported the issue at Microsoft Connect, Please all follow the link and vote +1.

 

if a simple statement as 

 

select name
from dbo.MyTable
for xml path('client'), type

can fail because of a single character in the table, I think that needs to be fixed a.s.a.p.

http://connect.microsoft.com/SQLServer/feedback/details/547817/xml-type-does-not-accept-all-character-escape-sequences