diff options
Diffstat (limited to 'intl/icu/source/extra/uconv/uconv.1.in')
-rw-r--r-- | intl/icu/source/extra/uconv/uconv.1.in | 445 |
1 files changed, 445 insertions, 0 deletions
diff --git a/intl/icu/source/extra/uconv/uconv.1.in b/intl/icu/source/extra/uconv/uconv.1.in new file mode 100644 index 000000000..3636025aa --- /dev/null +++ b/intl/icu/source/extra/uconv/uconv.1.in @@ -0,0 +1,445 @@ +.\" Hey, Emacs! This is -*-nroff-*- you know... +.\" +.\" uconv.1: manual page for the uconv utility. +.\" +.\" Copyright (C) 2016 and later: Unicode, Inc. and others. +.\" License & terms of use: http://www.unicode.org/copyright.html +.\" Copyright (C) 2000-2013 IBM, Inc. and others. +.\" +.\" Manual page by Yves Arrouye <yves@realnames.com>. +.\" +.TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual" +.SH NAME +.B uconv +\- convert data from one encoding to another +.SH SYNOPSIS +.B uconv +[ +.BR "\-h\fP, \fB\-?\fP, \fB\-\-help" +] +[ +.BI "\-V\fP, \fB\-\-version" +] +[ +.BI "\-s\fP, \fB\-\-silent" +] +[ +.BI "\-v\fP, \fB\-\-verbose" +] +[ +.BI "\-l\fP, \fB\-\-list" +| +.BI "\-l\fP, \fB\-\-list\-code" " code" +| +.BI "\-\-default-code" +| +.BI "\-L\fP, \fB\-\-list\-transliterators" +] +[ +.BI "\-\-canon" +] +[ +.BI "\-x" " transliteration +] +[ +.BI "\-\-to\-callback" " callback" +| +.B "\-c" +] +[ +.BI "\-\-from\-callback" " callback" +| +.B "\-i" +] +[ +.BI "\-\-callback" " callback" +] +[ +.BI "\-\-fallback" +| +.BI "\-\-no\-fallback" +] +[ +.BI "\-b\fP, \fB\-\-block\-size" " size" +] +[ +.BI "\-f\fP, \fB\-\-from\-code" " encoding" +] +[ +.BI "\-t\fP, \fB\-\-to\-code" " encoding" +] +[ +.BI "\-\-add\-signature" +] +[ +.BI "\-\-remove\-signature" +] +[ +.BI "\-o\fP, \fB\-\-output" " file" +] +[ +.IR file .\|.\|. +] +.SH DESCRIPTION +.B uconv +converts, or transcodes, each given +.I file +(or its standard input if no +.I file +is specified) from one +.I encoding +to another. +The transcoding is done using Unicode as a pivot encoding +(i.e. the data are first transcoded from their original encoding to +Unicode, and then from Unicode to the destination encoding). +.PP +If an +.I encoding +is not specified or is +.BR - , +the default encoding is used. Thus, calling +.B uconv +with no +.I encoding +provides an easy way to validate and sanitize data files for +further consumption by tools requiring data in the default encoding. +.PP +When calling +.BR uconv , +it is possible to specify callbacks that are used to handle invalid +characters in the input, or characters that cannot be transcoded to +the destination encoding. Some encodings, for example, offer a default +substitution character that can be used to represent the occurence of +such characters in the input. Other callbacks offer a useful visual +representation of the invalid data. +.PP +.B uconv +can also run the specified +.IR transliteration +on the transcoded data, +in which case transliteration will happen as an intermediate step, +after the data have been transcoded to Unicode. +The +.I transliteration +can be either a list of semicolon-separated transliterator names, +or an arbitrarily complex set of rules in the ICU transliteration +rules format. +.PP +For transcoding purposes, +.B uconv +options are compatible with those of +.BR iconv (1), +making it easy to replace it in scripts. It is not necessarily the case, +however, that the encoding names used by +.B uconv +and ICU are the same as the ones used by +.BR iconv (1). +Also, options that provide informational data, such as the +.B \-l\fP, \fB\-\-list +one offered by some +.BR iconv (1) +variants such as GNU's, produce data in a slightly different and +easier to parse format. +.SH OPTIONS +.TP +.BR "\-h\fP, \fB\-?\fP, \fB\-\-help" +Print help about usage and exit. +.TP +.BR "\-V\fP, \fB\-\-version" +Print the version of +.B uconv +and exit. +.TP +.BI "\-s\fP, \fB\-\-silent" +Suppress messages during execution. +.TP +.BI "\-v\fP, \fB\-\-verbose" +Display extra informative messages during execution. +.TP +.BI "\-l\fP, \fB\-\-list" +List all the available encodings and exit. +.TP +.BI "\-l\fP, \fB\-\-list\-code" " code" +List only the +.I code +encoding and exit. If +.I code +is not a proper encoding, exit with an error. +.TP +.BI "\-\-default-code" +List only the name of the default encoding and exit. +.TP +.BI "\-L\fP, \fB\-\-list\-transliterators" +List all the available transliterators and exit. +.TP +.BI "\--canon" +If used with +.BI "\-l\fP, \fB\-\-list" +or +.BR "\-\-default-code" , +the list of encodings is produced in a format compatible with +.BR convrtrs.txt (5). +If used with +.BR "\-L\fP, \fB\-\-list\-transliterators" , +print only one transliterator name per line. +.TP +.BI "\-x" " transliteration" +Run the given +.IR transliteration +on the transcoded Unicode data, +and use the transliterated data as input for the transcoding to +the the destination encoding. +.TP +.BI "\-\-to\-callback" " callback" +Use +.I callback +to handle characters that cannot be transcoded to the destination +encoding. See section +.B CALLBACKS +for details on valid callbacks. +.TP +.B "\-c" +Omit invalid characters from the output. +Same as +.BR "\-\-to\-callback skip" . +.TP +.BI "\-\-from\-callback" " callback" +Use +.I callback +to handle characters that cannot be transcoded from the original +encoding. See section +.B CALLBACKS +for details on valid callbacks. +.TP +.B "\-i" +Ignore invalid sequences in the input. +Same as +.BR "\-\-from\-callback skip" . +.TP +.BI "\-\-callback" " callback" +Use +.I callback +to handle both characters that cannot be transcoded from the original +encoding and characters that cannot be transcoded to the destination +encoding. See section +.B CALLBACKS +for details on valid callbacks. +.TP +.BI "\-\-fallback" +Use the fallback mapping when transcoding from +Unicode to the destination encoding. +.TP +.BI "\-\-no\-fallback" +Do not use the fallback mapping when transcoding from Unicode to the +destination encoding. +This is the default. +.TP +.BI "\-b\fP, \fB\-\-block\-size" " size" +Read input in blocks of +.I size +bytes at a time. The default block size is +4096. +.TP +.BI "\-f\fP, \fB\-\-from\-code" " encoding" +Set the original encoding of the data to +.IR encoding . +.TP +.BI "\-t\fP, \fB\-\-to\-code" " encoding" +Transcode the data to +.IR encoding . +.TP +.BI "\-\-add\-signature" +Add a U+FEFF Unicode signature character (BOM) if the output charset +supports it and does not add one anyway. +.TP +.BI "\-\-remove\-signature" +Remove a U+FEFF Unicode signature character (BOM). +.TP +.BI "\-o\fP, \fB\-\-output" " file" +Write the transcoded data to +.IR file . +.SH CALLBACKS +.B uconv +supports specifying callbacks to handle invalid data. Callbacks can be +set for both directions of transcoding: from the original encoding to +Unicode, with the +.BR "\-\-from\-callback" +option, and from Unicode to the destination encoding, with the +.BR "\-\-to\-callback" +option. +.PP +The following is a list of valid +.I callback +names, along with a description of their behavior. The list of +callbacks actually supported by +.B uconv +is displayed when it is called with +.BR "\-h\fP, \fB\-\-help" . +.PP +.TP \w'\fBescape-unicode'u+3n +.B substitute +Write the the encoding's substitute sequence, or the Unicode +replacement character +.B U+FFFD +when transcoding to Unicode. +.TP +.B skip +Ignore the invalid data. +.TP +.B stop +Stop with an error when encountering invalid data. +This is the default callback. +.TP +.B escape +Same as +.BR escape-icu . +.TP +.B escape-icu +Replace the missing characters with a string of the format +.BR %U\fIhhhh\fP +for plane 0 characters, and +.BR %U\fIhhhh\fP%U\fIhhhh\fP +for planes 1 and above characters, +where +.I hhhh +is the hexadecimal value of one of the UTF-16 code units representing the +character. Characters from planes 1 and above are written as a pair of +UTF-16 surrogate code units. +.TP +.B escape-java +Replace the missing characters with a string of the format +.BR \eu\fIhhhh\fP +for plane 0 characters, and +.BR \eu\fIhhhh\fP\eu\fIhhhh\fP +for planes 1 and above characters, +where +.I hhhh +is the hexadecimal value of one of the UTF-16 code units representing the +character. Characters from planes 1 and above are written as a pair of +UTF-16 surrogate code units. +.TP +.B escape-c +Replace the missing characters with a string of the format +.BR \eu\fIhhhh\fP +for plane 0 characters, and +.BR \eU\fIhhhhhhhh\fP +for planes 1 and above characters, +where +.I hhhh +and +.I hhhhhhhh +are the hexadecimal values of the Unicode codepoint. +.TP +.B escape-xml +Same as +.BR escape-xml-hex . +.TP +.B escape-xml-hex +Replace the missing characters with a string of the format +.BR &#x\fIhhhh\fP; , +where +.I hhhh +is the hexadecimal value of the Unicode codepoint. +.TP +.B escape-xml-dec +Replace the missing characters with a string of the format +.BR &#\fInnnn\fP; , +where +.I nnnn +is the decimal value of the Unicode codepoint. +.TP +.B escape-unicode +Replace the missing characters with a string of the format +.BR {U+\fIhhhh\fP} , +where +.I hhhh +is the hexadecimal value of the Unicode codepoint. +That hexadecimal string is of variable length and can use from 4 to +6 digits. +This is the format universally used to denote a Unicode codepoint in +the litterature, delimited by curly braces for easy recognition of those +substitutions in the output. +.SH EXAMPLES +Convert data from a given +.I encoding +to the platform encoding: + +.RS 4 +.B \fR$ \fPuconv \-f \fIencoding\fP +.RE +.PP +Check if a +.I file +contains valid data for a given +.IR encoding : + +.RS 4 +.B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null +.RE +.PP +Convert a UTF-8 +.I file +to a given +.I encoding +and ensure that the resulting text is good for any version of HTML: + +.RS 4 +.B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e +.br +.B " \-\-callback escape-xml-dec \fIfile\fP" +.RE +.PP +Display the names of the Unicode code points in a UTF-file: + +.RS 4 +.B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP +.RE +.PP +Print the name of a Unicode code point whose value is known (\fBU+30AB\fP +in this example): + +.RS 4 +.B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo +.br +{KATAKANA LETTER KA}{LINE FEED} +.br +$ +.RE + +(The names are delimited by curly braces. +Also, the name of the line terminator is also displayed.) +.PP +Normalize UTF-8 data using Unicode NFKC, remove all control characters, +and map Katakana to Hiragana: + +.RS 4 +.B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e +.br +.B " \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'" +.SH CAVEATS AND BUGS +.B uconv +does report errors as occuring at the first invalid byte +encountered. This may be confusing to users of GNU +.BR iconv (1), +which reports errors as occuring at the first byte of an invalid +sequence. For multi-byte character sets or encodings, this means that +.BR uconv +error positions may be at a later offset in the input stream than +would be the case with GNU +.BR iconv (1). +.PP +The reporting of error positions when a transliterator is used may be +inaccurate or unavailable, in which case +.BR uconv +will report the offset in the output stream at which the error +occured. +.SH AUTHORS +Jonas Utterstroem +.br +Yves Arrouye +.SH VERSION +@VERSION@ +.SH COPYRIGHT +Copyright (C) 2000-2005 IBM, Inc. and others. +.SH SEE ALSO +.BR iconv (1) |