content/udk/cpp/man/spec/textconversion.html - openoffice-org - Git at Google

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <HTML>
 <head>
     <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1"/>
     <TITLE>Text Conversion Functions</TITLE>
 <style type="text/css">
 <!--
 h1 { text-align:center; margin-top: 0.2cm; text-decoration: none; color: #ffffff; font-size: 6; margin-top: 0.2cm}
 h2 { margin-top: 0.2cm; margin-bottom=0.1cm; color: #ffffff; background-color: #666699 }
 li {margin-bottom: 0.2cm;}
 dl {margin-bottom: 0.2cm;}
 dd {margin-bottom: 0.2cm;}
 dt {margin-bottom: 0.2cm;}
 -->
 </style>
 </head>
 <body>
 <TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" bgcolor=#666699
    summary=header>
     <TR>
         <TD>
             <h1> Text Conversion Functions </h1>
 		</td><td>
 			<a href="http://www.openoffice.org"><img src="../../../images/open_office_org_logo.gif" alt="OpenOffice.org" align="right" BORDER="0"/></a>
         </TD>
     </TR>
 </TABLE>

 <P>This text describes the functions <CODE>rtl_convertTextToUnicode()</CODE>
 and <CODE>rtl_convertUnicodeToText()</CODE>, the meaning of all the
 accompanying <CODE>RTL_TEXTTOUNICODE_FLAGS_<VAR>XXX</VAR></CODE>,
 <CODE>RTL_TEXTTOUNICODE_INFO_<VAR>XXX</VAR></CODE>,
 <CODE>RTL_UNICODETOTEXT_FLAGS_<VAR>XXX</VAR></CODE> and
 <CODE>RTL_UNICODETOTEXT_INFO_<VAR>XXX</VAR></CODE> flags, and the conversion
 context conventions.</P>

 <H2>Conversion Context</H2>

 <P>It is valid to pass a null pointer instead of an
 <CODE>rtl_TextToUnicodeContext</CODE> or <CODE>rtl_UnicodeToTextContext</CODE>
 to the conversion functions.  In that case, the functions behave as if they
 received an initial context, as obtained by
 <CODE>rtl_createTextToUnicodeContext()</CODE>,
 <CODE>rtl_resetTextToUnicodeContext()</CODE>,
 <CODE>rtl_createUnicodeToTextContext()</CODE>, or
 <CODE>rtl_resetUnicodeToTextContext()</CODE>, and simply do not return any
 context information (which is effectively lost).  This implies that you should
 always specify the <CODE>FLAGS_FLUSH</CODE> flag when using a null context,
 for otherwise it is not possible in general to find out whether the input
 buffer has been completely converted.</P>

 <H2>Handling of Undefined Codes</H2>

 <P>An <DFN>undefined code</DFN> is any of the following:
 <UL>
     <LI>A code from the source encoding that is valid (see
     <A HREF="#invalid">&ldquo;invalid code&rdquo;</A>), but not (yet) assigned
     a character.  Examples are <CODE>0xA5</CODE> in ISO 8859-3,
     <CODE>0xA2A1</CODE> in EUC-CN, and <CODE>0x167F</CODE> in Unicode.</LI>

     <LI>A code from the source encoding that is assigned a character that
     cannot be mapped to the destination encoding.  Examples are
     <CODE>0x0100</CODE> in Unicode, which cannot be mapped to ISO 8859-1; and
     <CODE>0xA698</CODE> in HangulTalk, which cannot be mapped to
     Unicode.</LI>

     <LI>A code from the source encoding that is reserved for private use, and
     thus cannot be mapped to the destination encoding.  (Even if the
     destination encoding also has private-use codes, a higher-level protocol
     would be needed to map between these private-use areas.)</LI>
 </UL>

 <P>In the text-to-Unicode direction, the conversion functions distinguish
 between single-byte and multi-byte undefined codes (<CODE>0xA5</CODE> in
 ISO 8859-3 and <CODE>0x80</CODE> in GB-18030 are single-byte undefined codes,
 while <CODE>0xA2A1</CODE> in EUC-CN and <CODE>0xFE39FE39</CODE> in GB-18030
 are multi-byte undefined codes.)</P>

 <P>When encountering an undefined code, the conversion functions allow any of
 the following behaviours (which are mutually exclusive):
 <DL>
     <DT><CODE>FLAGS_UNDEFINED_ERROR</CODE></DT>
     <DT><CODE>FLAGS_MBUNDEFINED_ERROR</CODE></DT>
     <DD>Read past the undefined code in the input buffer, set both the
     <CODE>INFO_UNDEFINED</CODE> or <CODE>INFO_MBUNDEFINED</CODE> and the
     <CODE>INFO_ERROR</CODE> flags, and immediately quit the conversion
     (ignoring any <CODE>FLAGS_FLUSH</CODE> flag).</DD>

     <DT><CODE>FLAGS_UNDEFINED_IGNORE</CODE></DT>
     <DT><CODE>FLAGS_MBUNDEFINED_IGNORE</CODE></DT>
     <DD>Read past the undefined code in the input buffer, set the
     <CODE>INFO_UNDEFINED</CODE> or <CODE>INFO_MBUNDEFINED</CODE> flag, and
     continue with the conversion.</DD>

     <DT><CODE>FLAGS_UNDEFINED_MAPTOPRIVATE</CODE></DT>
     <DD>If there is not enough space left in the output buffer,
     <A HREF="#exhausted">act accordingly</A>.  Otherwise, read past the
     undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
     flag, write <CODE>U+F1<VAR>xx</VAR></CODE> into the output buffer (where
     <CODE><VAR>xx</VAR></CODE> is the single-byte undefined code), and
     continue with the conversion.</DD>

     <DT><CODE>FLAGS_UNDEFINED_0</CODE></DT>
     <DD>If there is not enough space left in the output buffer,
     <A HREF="#exhausted">act accordingly</A>.  Otherwise, read past the
     undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
     flag, write an (appropriately encoded) ASCII <CODE>NUL</CODE> character
     (<CODE>0x00</CODE>) into the output buffer, and continue with the
     conversion.</DD>

     <DT><CODE>FLAGS_UNDEFINED_QUESTIONMARK</CODE></DT>
     <DD>If there is not enough space left in the output buffer,
     <A HREF="#exhausted">act accordingly</A>.  Otherwise, read past the
     undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
     flag, write an (appropriately encoded) ASCII &ldquo;<CODE>?</CODE>&rdquo;
     character (<CODE>0x3F</CODE>) into the output buffer, and continue with
     the conversion.</DD>

     <DT><CODE>FLAGS_UNDEFINED_UNDERLINE</CODE></DT>
     <DD>If there is not enough space left in the output buffer,
     <A HREF="#exhausted">act accordingly</A>.  Otherwise, read past the
     undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
     flag, write an (appropriately encoded) ASCII &ldquo;<CODE>_</CODE>&rdquo;
     character (<CODE>0x5F</CODE>) into the output buffer, and continue with
     the conversion.</DD>

     <DT><CODE>FLAGS_UNDEFINED_DEFAULT</CODE></DT>
     <DD>If there is not enough space left in the output buffer,
     <A HREF="#exhausted">act accordingly</A>.  Otherwise, read past the
     undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
     flag, write some output-encoding&ndash;specific character (currently
     <CODE>U+FFFD</CODE> for Unicode and &ldquo;<CODE>?</CODE>&rdquo; for all
     other encodings) into the output buffer, and continue with the
     conversion.</DD>
 </DL>

 <P>In the Unicode-to-text direction, the conversion functions also allow any
 of the following extra flags (of which an arbitrary number can be specified).
 In all cases, the usual checks for an <A HREF="#exhausted">exhausted output
 buffer</A> are made, and otherwise the <CODE>INFO_UNDEFINED</CODE> flag is
 set.
 <DL>
     <DT><CODE>FLAGS_UNDEFINED_REPLACE</CODE></DT>
     <DD>Some Unicode characters that have no direct mapping to the destination
     encoding are mapped to similar single characters in the destination
     encoding.  For example, <CODE>U+00A0</CODE> <SMALL>(NO-BREAK
     SPACE)</SMALL> could be mapped to <CODE>0x20</CODE> <SMALL>(SPACE)</SMALL>
     in ASCII.  Expect this to be poorly supported by the current
     implementation.</DD>

     <DT><CODE>FLAGS_UNDEFINED_REPLACESTR</CODE></DT>
     <DD>Some Unicode characters that have no direct mapping to the destination
     encoding are mapped to similar strings of characters in the destination
     encoding.  For example, <CODE>U+00A9</CODE> <SMALL>(COPYRIGHT
     SIGN)</SMALL> could be mapped to the three-character string
     &ldquo;<CODE>(C)</CODE>&rdquo; in ASCII.  Expect this to be poorly
     supported by the current implementation.</DD>

     <DT><CODE>FLAGS_PRIVATE_MAPTO0</CODE></DT>
     <DD>Private-use characters (<CODE>U+E000</CODE>&ndash;<CODE>U+F8FF</CODE>,
     <CODE>U+F0000</CODE>&ndash;<CODE>U+FFFFD</CODE>, and
     <CODE>U+100000</CODE>&ndash;<CODE>U+10FFFD</CODE>) are mapped to an
     (appropriately encoded) ASCII <CODE>NUL</CODE> character
     (<CODE>0x00</CODE>) in the output buffer.</DD>

     <DT><CODE>FLAGS_NONSPACING_IGNORE</CODE></DT>
     <DD>Certain non-spacing characters, like <CODE>U+200B</CODE> <SMALL>(ZERO
     WIDTH SPACE)</SMALL> and <CODE>U+FEFF</CODE> <SMALL>(ZERO WIDTH NO-BREAK
     SPACE)</SMALL>, are ignored.  Expect some uncertainty in the current
     implementation as to which characters are affected.</DD>

     <DT><CODE>FLAGS_CONTROL_IGNORE</CODE></DT>
     <DD>Control characters (<CODE>U+0000</CODE>&ndash;<CODE>U+001F</CODE> and
     <CODE>U+007F</CODE>&ndash;<CODE>U+009F</CODE>) are ignored.</DD>

     <DT><CODE>FLAGS_PRIVATE_IGNORE</CODE></DT>
     <dd>Private-use characters (<CODE>U+E000</CODE>&ndash;<CODE>U+F8FF</CODE>,
     <CODE>U+F0000</CODE>&ndash;<CODE>U+FFFFD</CODE>, and
     <CODE>U+100000</CODE>&ndash;<CODE>U+10FFFD</CODE>) are ignored.</DD>
 </DL>

 <P>There is also a <CODE>FLAGS_NOCOMPOSITE</CODE> flag, of which I am not sure
 what it should be used for.</P>

 <H2>Handling of Invalid Codes</H2>

 <P>An <A NAME="invalid"><DFN>invalid code</DFN></A> is a string of one or more
 units in the input buffer that is not valid according to the input encoding:
 <UL>
     <LI>It is not valid because it may never appear in the input encoding
     (e.g., <CODE>0x80</CODE> in ASCII, or <CODE>0xFF</CODE> in GB-18030).</LI>

     <LI>It is not valid because it is only the prefix of a valid string of
     units, with further units missing (e.g., the single high-surrogate
     <CODE>0xD800</CODE> in Unicode, with a following low-surrogate missing, or
     <CODE>0xA1</CODE> in EUC-CN, with a second byte in the range
     <CODE>0xA1</CODE>&ndash;<CODE>0xFE</CODE> missing).</LI>
 </UL>

 <P>Invalid codes of the second category (that are potentially prefixes of
 valid strings) are handled specially at the end of the input buffer.  If the
 <CODE>FLAGS_FLUSH</CODE> flag is specified, they are handled like all other
 invalid codes.  Otherwise, the <CODE>INFO_SRCBUFFERTOSMALL</CODE> flag is set
 to indicate that the input buffer possibly ended in the middle of an input
 character (and the prefix is either not yet read, or is stored in the
 conversion context, or is partly read and partly stored in the conversion
 context).</P>

 <P>When encountering an invalid code (other than the special cases at the end
 of the input buffer), the conversion functions allow any of the following
 behaviours (which are mutually exclusive):
 <DL>
     <DT><CODE>FLAGS_INVALID_ERROR</CODE></DT>
     <DD>Read past the invalid code in the input buffer, set both the
     <CODE>INFO_INVALID</CODE> and the <CODE>INFO_ERROR</CODE> flags, and
     immediately quit the conversion (ignoring any <CODE>FLAGS_FLUSH</CODE>
     flag).</DD>

     <DT><CODE>FLAGS_INVALID_IGNORE</CODE></DT>
     <DD>Read past the invalid code in the input buffer, set the
     <CODE>INFO_INVALID</CODE> flag, and continue with the conversion.</DD>

     <DT><CODE>FLAGS_INVALID_0</CODE></DT>
     <DD>If there is not enough space left in the output buffer,
     <A HREF="#exhausted">act accordingly</A>.  Otherwise, read past the
     invalid code in the input buffer, set the <CODE>INFO_INVALID</CODE> flag,
     write an (appropriately encoded) ASCII <CODE>NUL</CODE> character
     (<CODE>0x00</CODE>) into the output buffer, and continue with the
     conversion.</DD>

     <DT><CODE>FLAGS_INVALID_QUESTIONMARK</CODE></DT>
     <DD>If there is not enough space left in the output buffer,
     <A HREF="#exhausted">act accordingly</A>.  Otherwise, read past the
     invalid code in the input buffer, set the <CODE>INFO_INVALID</CODE> flag,
     write an (appropriately encoded) ASCII &ldquo;<CODE>?</CODE>&rdquo;
     character (<CODE>0x3F</CODE>) into the output buffer, and continue with
     the conversion.</DD>

     <DT><CODE>FLAGS_INVALID_UNDERLINE</CODE></DT>
     <DD>If there is not enough space left in the output buffer,
     <A HREF="#exhausted">act accordingly</A>.  Otherwise, read past the
     invalid code in the input buffer, set the <CODE>INFO_INVALID</CODE> flag,
     write an (appropriately encoded) ASCII &ldquo;<CODE>_</CODE>&rdquo;
     character (<CODE>0x5F</CODE>) into the output buffer, and continue with
     the conversion.</DD>

     <DT><CODE>FLAGS_INVALID_DEFAULT</CODE></DT>
     <DD>If there is not enough space left in the output buffer,
     <A HREF="#exhausted">act accordingly</A>.  Otherwise, read past the
     invalid code in the input buffer, set the <CODE>INFO_INVALID</CODE> flag,
     write some output-encoding&ndash;specific character (currently
     <CODE>U+FFFD</CODE> for Unicode and &ldquo;<CODE>?</CODE>&rdquo; for all
     other encodings) into the output buffer, and continue with the
     conversion.</DD>
 </DL>

 <H2><A NAME="exhausted">Handling of Destination Buffer Exhaustion</A></H2>

 <P>If, in the course of conversion, there is not enough space left in the
 output buffer (either for a normal character mapping or for a special mapping
 of undefined or invalid codes), the <CODE>INFO_DESTBUFFERTOSMALL</CODE> flag
 is set, and the conversion is quit immediately (ignoring any
 <CODE>FLAGS_FLUSH</CODE> flag).  It is unspecified whether the input units
 that would overflow the output buffer are already read (and stored in the
 conversion context) or not, but the number of processed input buffer units
 returned by the conversion function will be correct in either case.</P>

 <TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" summary=footer>
     <TR>
         <TD BGCOLOR="#666699">
             <FONT COLOR="White">Author: <A HREF="mailto:stephan.bergmann@sun.com"><FONT COLOR="White">Stephan Bergmann</FONT></A> (Last modification $Date: 2004/12/08 14:22:01 $).<br/>
 			Copyright 2001 <A HREF="http://www.openoffice.org"><FONT COLOR="White">OpenOffice.org</FONT></A> Foundation.  All Rights Reserved.</FONT>
         </TD>
     </TR>
 </TABLE>
 </body>
 </HTML>
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
	<HTML>
	<head>
	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1"/>
	<TITLE>Text Conversion Functions</TITLE>
	<style type="text/css">
	<!--
	h1 { text-align:center; margin-top: 0.2cm; text-decoration: none; color: #ffffff; font-size: 6; margin-top: 0.2cm}
	h2 { margin-top: 0.2cm; margin-bottom=0.1cm; color: #ffffff; background-color: #666699 }
	li {margin-bottom: 0.2cm;}
	dl {margin-bottom: 0.2cm;}
	dd {margin-bottom: 0.2cm;}
	dt {margin-bottom: 0.2cm;}
	-->
	</style>
	</head>
	<body>
	<TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" bgcolor=#666699
	summary=header>
	<TR>
	<TD>
	<h1> Text Conversion Functions </h1>
	</td><td>
	<a href="http://www.openoffice.org"><img src="../../../images/open_office_org_logo.gif" alt="OpenOffice.org" align="right" BORDER="0"/></a>
	</TD>
	</TR>
	</TABLE>

	<P>This text describes the functions <CODE>rtl_convertTextToUnicode()</CODE>
	and <CODE>rtl_convertUnicodeToText()</CODE>, the meaning of all the
	accompanying <CODE>RTL_TEXTTOUNICODE_FLAGS_<VAR>XXX</VAR></CODE>,
	<CODE>RTL_TEXTTOUNICODE_INFO_<VAR>XXX</VAR></CODE>,
	<CODE>RTL_UNICODETOTEXT_FLAGS_<VAR>XXX</VAR></CODE> and
	<CODE>RTL_UNICODETOTEXT_INFO_<VAR>XXX</VAR></CODE> flags, and the conversion
	context conventions.</P>

	<H2>Conversion Context</H2>

	<P>It is valid to pass a null pointer instead of an
	<CODE>rtl_TextToUnicodeContext</CODE> or <CODE>rtl_UnicodeToTextContext</CODE>
	to the conversion functions. In that case, the functions behave as if they
	received an initial context, as obtained by
	<CODE>rtl_createTextToUnicodeContext()</CODE>,
	<CODE>rtl_resetTextToUnicodeContext()</CODE>,
	<CODE>rtl_createUnicodeToTextContext()</CODE>, or
	<CODE>rtl_resetUnicodeToTextContext()</CODE>, and simply do not return any
	context information (which is effectively lost). This implies that you should
	always specify the <CODE>FLAGS_FLUSH</CODE> flag when using a null context,
	for otherwise it is not possible in general to find out whether the input
	buffer has been completely converted.</P>

	<H2>Handling of Undefined Codes</H2>

	<P>An <DFN>undefined code</DFN> is any of the following:
	<UL>
	<LI>A code from the source encoding that is valid (see
	<A HREF="#invalid">“invalid code”</A>), but not (yet) assigned
	a character. Examples are <CODE>0xA5</CODE> in ISO 8859-3,
	<CODE>0xA2A1</CODE> in EUC-CN, and <CODE>0x167F</CODE> in Unicode.</LI>

	<LI>A code from the source encoding that is assigned a character that
	cannot be mapped to the destination encoding. Examples are
	<CODE>0x0100</CODE> in Unicode, which cannot be mapped to ISO 8859-1; and
	<CODE>0xA698</CODE> in HangulTalk, which cannot be mapped to
	Unicode.</LI>

	<LI>A code from the source encoding that is reserved for private use, and
	thus cannot be mapped to the destination encoding. (Even if the
	destination encoding also has private-use codes, a higher-level protocol
	would be needed to map between these private-use areas.)</LI>
	</UL>

	<P>In the text-to-Unicode direction, the conversion functions distinguish
	between single-byte and multi-byte undefined codes (<CODE>0xA5</CODE> in
	ISO 8859-3 and <CODE>0x80</CODE> in GB-18030 are single-byte undefined codes,
	while <CODE>0xA2A1</CODE> in EUC-CN and <CODE>0xFE39FE39</CODE> in GB-18030
	are multi-byte undefined codes.)</P>

	<P>When encountering an undefined code, the conversion functions allow any of
	the following behaviours (which are mutually exclusive):
	<DL>
	<DT><CODE>FLAGS_UNDEFINED_ERROR</CODE></DT>
	<DT><CODE>FLAGS_MBUNDEFINED_ERROR</CODE></DT>
	<DD>Read past the undefined code in the input buffer, set both the
	<CODE>INFO_UNDEFINED</CODE> or <CODE>INFO_MBUNDEFINED</CODE> and the
	<CODE>INFO_ERROR</CODE> flags, and immediately quit the conversion
	(ignoring any <CODE>FLAGS_FLUSH</CODE> flag).</DD>

	<DT><CODE>FLAGS_UNDEFINED_IGNORE</CODE></DT>
	<DT><CODE>FLAGS_MBUNDEFINED_IGNORE</CODE></DT>
	<DD>Read past the undefined code in the input buffer, set the
	<CODE>INFO_UNDEFINED</CODE> or <CODE>INFO_MBUNDEFINED</CODE> flag, and
	continue with the conversion.</DD>

	<DT><CODE>FLAGS_UNDEFINED_MAPTOPRIVATE</CODE></DT>
	<DD>If there is not enough space left in the output buffer,
	<A HREF="#exhausted">act accordingly</A>. Otherwise, read past the
	undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
	flag, write <CODE>U+F1<VAR>xx</VAR></CODE> into the output buffer (where
	<CODE><VAR>xx</VAR></CODE> is the single-byte undefined code), and
	continue with the conversion.</DD>

	<DT><CODE>FLAGS_UNDEFINED_0</CODE></DT>
	<DD>If there is not enough space left in the output buffer,
	<A HREF="#exhausted">act accordingly</A>. Otherwise, read past the
	undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
	flag, write an (appropriately encoded) ASCII <CODE>NUL</CODE> character
	(<CODE>0x00</CODE>) into the output buffer, and continue with the
	conversion.</DD>

	<DT><CODE>FLAGS_UNDEFINED_QUESTIONMARK</CODE></DT>
	<DD>If there is not enough space left in the output buffer,
	<A HREF="#exhausted">act accordingly</A>. Otherwise, read past the
	undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
	flag, write an (appropriately encoded) ASCII “<CODE>?</CODE>”
	character (<CODE>0x3F</CODE>) into the output buffer, and continue with
	the conversion.</DD>

	<DT><CODE>FLAGS_UNDEFINED_UNDERLINE</CODE></DT>
	<DD>If there is not enough space left in the output buffer,
	<A HREF="#exhausted">act accordingly</A>. Otherwise, read past the
	undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
	flag, write an (appropriately encoded) ASCII “<CODE>_</CODE>”
	character (<CODE>0x5F</CODE>) into the output buffer, and continue with
	the conversion.</DD>

	<DT><CODE>FLAGS_UNDEFINED_DEFAULT</CODE></DT>
	<DD>If there is not enough space left in the output buffer,
	<A HREF="#exhausted">act accordingly</A>. Otherwise, read past the
	undefined code in the input buffer, set the <CODE>INFO_UNDEFINED</CODE>
	flag, write some output-encoding–specific character (currently
	<CODE>U+FFFD</CODE> for Unicode and “<CODE>?</CODE>” for all
	other encodings) into the output buffer, and continue with the
	conversion.</DD>
	</DL>

	<P>In the Unicode-to-text direction, the conversion functions also allow any
	of the following extra flags (of which an arbitrary number can be specified).
	In all cases, the usual checks for an <A HREF="#exhausted">exhausted output
	buffer</A> are made, and otherwise the <CODE>INFO_UNDEFINED</CODE> flag is
	set.
	<DL>
	<DT><CODE>FLAGS_UNDEFINED_REPLACE</CODE></DT>
	<DD>Some Unicode characters that have no direct mapping to the destination
	encoding are mapped to similar single characters in the destination
	encoding. For example, <CODE>U+00A0</CODE> <SMALL>(NO-BREAK
	SPACE)</SMALL> could be mapped to <CODE>0x20</CODE> <SMALL>(SPACE)</SMALL>
	in ASCII. Expect this to be poorly supported by the current
	implementation.</DD>

	<DT><CODE>FLAGS_UNDEFINED_REPLACESTR</CODE></DT>
	<DD>Some Unicode characters that have no direct mapping to the destination
	encoding are mapped to similar strings of characters in the destination
	encoding. For example, <CODE>U+00A9</CODE> <SMALL>(COPYRIGHT
	SIGN)</SMALL> could be mapped to the three-character string
	“<CODE>(C)</CODE>” in ASCII. Expect this to be poorly
	supported by the current implementation.</DD>

	<DT><CODE>FLAGS_PRIVATE_MAPTO0</CODE></DT>
	<DD>Private-use characters (<CODE>U+E000</CODE>–<CODE>U+F8FF</CODE>,
	<CODE>U+F0000</CODE>–<CODE>U+FFFFD</CODE>, and
	<CODE>U+100000</CODE>–<CODE>U+10FFFD</CODE>) are mapped to an
	(appropriately encoded) ASCII <CODE>NUL</CODE> character
	(<CODE>0x00</CODE>) in the output buffer.</DD>

	<DT><CODE>FLAGS_NONSPACING_IGNORE</CODE></DT>
	<DD>Certain non-spacing characters, like <CODE>U+200B</CODE> <SMALL>(ZERO
	WIDTH SPACE)</SMALL> and <CODE>U+FEFF</CODE> <SMALL>(ZERO WIDTH NO-BREAK
	SPACE)</SMALL>, are ignored. Expect some uncertainty in the current
	implementation as to which characters are affected.</DD>

	<DT><CODE>FLAGS_CONTROL_IGNORE</CODE></DT>
	<DD>Control characters (<CODE>U+0000</CODE>–<CODE>U+001F</CODE> and
	<CODE>U+007F</CODE>–<CODE>U+009F</CODE>) are ignored.</DD>

	<DT><CODE>FLAGS_PRIVATE_IGNORE</CODE></DT>
	<dd>Private-use characters (<CODE>U+E000</CODE>–<CODE>U+F8FF</CODE>,
	<CODE>U+F0000</CODE>–<CODE>U+FFFFD</CODE>, and
	<CODE>U+100000</CODE>–<CODE>U+10FFFD</CODE>) are ignored.</DD>
	</DL>

	<P>There is also a <CODE>FLAGS_NOCOMPOSITE</CODE> flag, of which I am not sure
	what it should be used for.</P>

	<H2>Handling of Invalid Codes</H2>

	<P>An <A NAME="invalid"><DFN>invalid code</DFN></A> is a string of one or more
	units in the input buffer that is not valid according to the input encoding:
	<UL>
	<LI>It is not valid because it may never appear in the input encoding
	(e.g., <CODE>0x80</CODE> in ASCII, or <CODE>0xFF</CODE> in GB-18030).</LI>

	<LI>It is not valid because it is only the prefix of a valid string of
	units, with further units missing (e.g., the single high-surrogate
	<CODE>0xD800</CODE> in Unicode, with a following low-surrogate missing, or
	<CODE>0xA1</CODE> in EUC-CN, with a second byte in the range
	<CODE>0xA1</CODE>–<CODE>0xFE</CODE> missing).</LI>
	</UL>

	<P>Invalid codes of the second category (that are potentially prefixes of
	valid strings) are handled specially at the end of the input buffer. If the
	<CODE>FLAGS_FLUSH</CODE> flag is specified, they are handled like all other
	invalid codes. Otherwise, the <CODE>INFO_SRCBUFFERTOSMALL</CODE> flag is set
	to indicate that the input buffer possibly ended in the middle of an input
	character (and the prefix is either not yet read, or is stored in the
	conversion context, or is partly read and partly stored in the conversion
	context).</P>

	<P>When encountering an invalid code (other than the special cases at the end
	of the input buffer), the conversion functions allow any of the following
	behaviours (which are mutually exclusive):
	<DL>
	<DT><CODE>FLAGS_INVALID_ERROR</CODE></DT>
	<DD>Read past the invalid code in the input buffer, set both the
	<CODE>INFO_INVALID</CODE> and the <CODE>INFO_ERROR</CODE> flags, and
	immediately quit the conversion (ignoring any <CODE>FLAGS_FLUSH</CODE>
	flag).</DD>

	<DT><CODE>FLAGS_INVALID_IGNORE</CODE></DT>
	<DD>Read past the invalid code in the input buffer, set the
	<CODE>INFO_INVALID</CODE> flag, and continue with the conversion.</DD>

	<DT><CODE>FLAGS_INVALID_0</CODE></DT>
	<DD>If there is not enough space left in the output buffer,
	<A HREF="#exhausted">act accordingly</A>. Otherwise, read past the
	invalid code in the input buffer, set the <CODE>INFO_INVALID</CODE> flag,
	write an (appropriately encoded) ASCII <CODE>NUL</CODE> character
	(<CODE>0x00</CODE>) into the output buffer, and continue with the
	conversion.</DD>

	<DT><CODE>FLAGS_INVALID_QUESTIONMARK</CODE></DT>
	<DD>If there is not enough space left in the output buffer,
	<A HREF="#exhausted">act accordingly</A>. Otherwise, read past the
	invalid code in the input buffer, set the <CODE>INFO_INVALID</CODE> flag,
	write an (appropriately encoded) ASCII “<CODE>?</CODE>”
	character (<CODE>0x3F</CODE>) into the output buffer, and continue with
	the conversion.</DD>

	<DT><CODE>FLAGS_INVALID_UNDERLINE</CODE></DT>
	<DD>If there is not enough space left in the output buffer,
	<A HREF="#exhausted">act accordingly</A>. Otherwise, read past the
	invalid code in the input buffer, set the <CODE>INFO_INVALID</CODE> flag,
	write an (appropriately encoded) ASCII “<CODE>_</CODE>”
	character (<CODE>0x5F</CODE>) into the output buffer, and continue with
	the conversion.</DD>

	<DT><CODE>FLAGS_INVALID_DEFAULT</CODE></DT>
	<DD>If there is not enough space left in the output buffer,
	<A HREF="#exhausted">act accordingly</A>. Otherwise, read past the
	invalid code in the input buffer, set the <CODE>INFO_INVALID</CODE> flag,
	write some output-encoding–specific character (currently
	<CODE>U+FFFD</CODE> for Unicode and “<CODE>?</CODE>” for all
	other encodings) into the output buffer, and continue with the
	conversion.</DD>
	</DL>

	<H2><A NAME="exhausted">Handling of Destination Buffer Exhaustion</A></H2>

	<P>If, in the course of conversion, there is not enough space left in the
	output buffer (either for a normal character mapping or for a special mapping
	of undefined or invalid codes), the <CODE>INFO_DESTBUFFERTOSMALL</CODE> flag
	is set, and the conversion is quit immediately (ignoring any
	<CODE>FLAGS_FLUSH</CODE> flag). It is unspecified whether the input units
	that would overflow the output buffer are already read (and stored in the
	conversion context) or not, but the number of processed input buffer units
	returned by the conversion function will be correct in either case.</P>

	<TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" summary=footer>
	<TR>
	<TD BGCOLOR="#666699">
	<FONT COLOR="White">Author: <A HREF="mailto:stephan.bergmann@sun.com"><FONT COLOR="White">Stephan Bergmann</FONT></A> (Last modification $Date: 2004/12/08 14:22:01 $).<br/>
	Copyright 2001 <A HREF="http://www.openoffice.org"><FONT COLOR="White">OpenOffice.org</FONT></A> Foundation. All Rights Reserved.</FONT>
	</TD>
	</TR>
	</TABLE>
	</body>
	</HTML>