c-dev@xerces.apache.org
[Top] [All Lists]

[jira] Commented: (XERCESC-1846) We should not let the ICU use a replace

Subject: [jira] Commented: (XERCESC-1846) We should not let the ICU use a replacement character that we know will result in a document that's not well-formed.
From: "Jesse Pelton (JIRA)"
Date: Wed, 17 Dec 2008 05:13:44 -0800 PST
    [ 
https://issues.apache.org/jira/browse/XERCESC-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657377#action_12657377
 ] 

Jesse Pelton commented on XERCESC-1846:
---------------------------------------

If this is indeed a bug, it affects more than the ICU transcoder. The following 
Xerces transcoder classes hard-code the use of 0x1A as a replacement character:

 XML88591Transcoder
 XML88591Transcoder390 
 XMLASCIITranscoder 
XMLASCIITranscoder390

The following classes use Windows' WideCharToMultiByte function, specifying 
that the system default value should be used as a replacement character. I'm 
not sure whether there's any guarantee that this character is legal in an XML 
document.

 Win32Transcoder
 CygwinTranscoder

I don't know enough to determine whether there's any danger that the following 
classes might cause an illegal character to be included in a serialized 
document:

 MacOSTranscoder
 MacOSLCPTranscoder

> We should not let the ICU use a replacement character that we know will 
> result in a document that's not well-formed.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1846
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1846
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Non-Validating Parser, Utilities
>         Environment: ICU 4.0, 3.8
> Xerces 3.0, 2.8. Probably will be common problem for earlier versions too.
> HP-UX B.11.23 U 9000/800
> aCC compier, cc
>            Reporter: Jan Suchy
>
> Jan Suchà wrote:
> > Hi Jesse,
> > thank you for your answer and ideas.
> > I have found one kind of solution to patch the transcoder wrap class:
> > src\xercesc\util\Transcoders\ICU\ICUTransService.cpp
> >
> > adding there to constructor of ICUTranscoder::ICUTranscoder these lines:
> >
> > UErrorCode uerr = U_ZERO_ERROR;
> > ucnv_setSubstChars(toAdopt, "?", 1, &uerr);
> > ...
> >
> > Than, the "?" character is used as replacement char, when using icu.
> > This is ICU specific solutions and is not clear, because there is necessary 
> > to rebuild xerces lib. I would like to see some possible switch around 
> > XMLFormatter class, but there is unknown UConverter form ICU which will be 
> > used next, because there is nothing to know which transcoder will be called 
> > later.
> Please create a Jira issue because this is a bug. We should not let the
> ICU use a replacement character that we know will result in a document
> that's not well-formed.
> Dave
> HISTORY:
> ------------------------------------------------------------------------------------------------------------------------
> Hi Jesse,
> thank you for your answer and ideas.
> I have found one kind of solution to patch the transcoder wrap class:
> src\xercesc\util\Transcoders\ICU\ICUTransService.cpp
> adding there to constructor of ICUTranscoder::ICUTranscoder these lines:
>     UErrorCode uerr = U_ZERO_ERROR;
>     ucnv_setSubstChars(toAdopt, "?", 1, &uerr);
> ...
> Than, the "?" character is used as replacement char, when using icu.
> This is ICU specific solutions and is not clear, because there is necessary 
> to rebuild xerces lib. I would like to see some possible switch around 
> XMLFormatter class, but there is unknown UConverter form ICU which will be 
> used next, because there is nothing to know which transcoder will be called 
> later.
> Not optimal, but works.
> thank you,
> jan
> > ------------ PÅvodnà zprÃva ------------
> > Od: Jesse Pelton <jsp@xxxxxxx>
> > PÅedmÄt: RE: xerces/ICU unicode alias for weak encoding when
> > serializing/converting to CP
> > Datum: 16.12.2008 15:31:01
> > ----------------------------------------
> > I'm not an expert on this area, but the transcoders included with Xerces do 
> > not
> > provide any way to specify the replacement character, and ICU may be the 
> > same.
> > Even if ICU gives you a way to do so, I'm not sure how you'd get access to a
> > transcoder instance to alter.
> >
> > Note, though, that 0x1A is not a legal character in an XML document. 
> > (Oracle's
> > parser is correct in rejecting it.) I think it's safe to assume in your 
> > scenario
> > that any such characters in a serialized document are replacements for
> > unrepresentable characters. You should therefore be able to post-process the
> > serialization output and replace 0x1A with one or more characters of your
> > choosing. If you don't want to post-process the whole document, you could 
> > derive
> > an XMLFormatTarget that replaces the replacement character in each chunk of 
> > data
> > handed to it. Neither option is exactly elegant, but I'd probably do the 
> > latter;
> > it'll work regardless of your format target, where the former approach 
> > requires
> > serializing to memory.
> >
> >
> > -----Original Message-----
> > From: Jan Suchà [mailto:zuchy@xxxxxxx]
> > Sent: Tuesday, December 16, 2008 5:37 AM
> > To: c-users@xxxxxxxxxxxxxxxxx
> > Subject: RE: xerces/ICU unicode alias for weak encoding when
> > serializing/converting to CP
> >
> > Hello again,
> > i have tried to use class:
> >
> > http://xerces.apache.org/xerces-c/apiDocs-2/classXMLFormatter.html#_details
> >
> > with attributes: NoEscapes , UnRep_Replace
> >
> > and the problematic char was replaced by:
> > ^Z
> >
> > But it is still not solving problem with Oracle DB XML parser to parse this 
> > xml.
> > I have got this error:
> >
> > ORA-31011: XML parsing failed
> > ORA-19202: Error occurred in XML processing
> > LPX-00216: invalid character 26 (0x1A)
> > Error at line 22
> >
> > I would like to replace unknown character with my own character, which will 
> > be
> > parseable (for example char "?" or "_").
> > How can I change replacement character, which is used as default?
> >
> > Thank anybody for any idea.
> >
> > Have a nice day,
> > Jan
> >
> >
> > > ------------ PÅvodnà zprÃva ------------
> > > Od: Jan Suchà <zuchy@xxxxxxx>
> > > PÅedmÄt: RE: xerces/ICU unicode alias for weak encoding when
> > > serializing/converting to CP
> > > Datum: 16.12.2008 09:35:40
> > > ----------------------------------------
> > > Hello Jesse,
> > > thank you for your answer :-) it seems to be promising. I'll look at it.
> > > Jan
> > >
> > >
> > > > ------------ PÅvodnà zprÃva ------------
> > > > Od: Jesse Pelton <jsp@xxxxxxx>
> > > > PÅedmÄt: RE: xerces/ICU unicode alias for weak encoding when
> > > > serializing/converting to CP
> > > > Datum: 15.12.2008 18:15:49
> > > > ----------------------------------------
> > > > The constructors for the Xerces XMLFormatter object all take an 
> > > > UnRepFlags
> > > > argument that allows you to specify how to handle unrepresentable
> > characters.
> > >
> > > > So does XMLFormatter::formatBuf(). It appears that the transcoder gets 
> > > > to
> > > > decide what character to replace unrepresentable characters with.
> > > >
> > > > Hope that helps.
> > > >
> > > > -----Original Message-----
> > > > From: Jan Suchà [mailto:zuchy@xxxxxxx]
> > > > Sent: Monday, December 15, 2008 4:25 AM
> > > > To: c-users@xxxxxxxxxxxxxxxxx
> > > > Subject: xerces/ICU unicode alias for weak encoding when
> > > serializing/converting
> > > > to CP
> > > >
> > > > Hello all,
> > > > I need to obtain output XML in iso-8859-2 encoding.
> > > > I am using UTF-8 as input encoding.
> > > > There is some character, in UTF-8 xml, which is not representable in
> > > > iso-8859-2.
> > > > I am using ICU 3.8, xerces 2.8 and Xqilla svn 702.
> > > >
> > > > After serializing XML to iso-8859-2 the problematic character is 
> > > > serialized
> > > by
> > > > ICU/xerces/xq to:
> > > >
> > > > &#x2013;
> > > >
> > > > The problem is, that if I will send message in iso-8859-2 with character
> > > > &#x2013; inside to Oracle DB, the Oracle parser
> > > >
> > > > does not like this character and this error is obtained:
> > > >
> > > > ORA-31011: XML parsing failed, LPX-00217: invalid character 8211 
> > > > (U+2013)
> > > >
> > > > So, what I am looking for is some method, how to say to the ICU or to
> > Xerces
> > > or
> > > > to XQ, that the Unicode character, must
> > > >
> > > > not be included in result and must be for example replaced by character
> > "?",
> > > to
> > > > avoid Oracle parser to process it.
> > > >
> > > > I would like to find clear solution, like saying to ICU not calling
> > callback
> > > > function or define own alias or behavior on
> > > >
> > > > this situation. Is it possible?
> > > > Any ideas?
> > > > Thank you
> > > > Jan Suchy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: c-dev-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>