1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
|
#
# $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
#
CHARACTER DATA
==============
This package generates some data files that contain character properties useful
for text processing.
CHARACTER PROPERTIES
====================
The first data file is called "ctype.dat" and contains a compressed form of
the character properties found in the Unicode Character Database (UCDB).
Additional properties can be specified in limited UCDB format in another file
to avoid modifying the original UCDB.
The following is a property name and code table to be used with the character
data:
NAME CODE DESCRIPTION
---------------------
Mn 0 Mark, Non-Spacing
Mc 1 Mark, Spacing Combining
Me 2 Mark, Enclosing
Nd 3 Number, Decimal Digit
Nl 4 Number, Letter
No 5 Number, Other
Zs 6 Separator, Space
Zl 7 Separator, Line
Zp 8 Separator, Paragraph
Cc 9 Other, Control
Cf 10 Other, Format
Cs 11 Other, Surrogate
Co 12 Other, Private Use
Cn 13 Other, Not Assigned
Lu 14 Letter, Uppercase
Ll 15 Letter, Lowercase
Lt 16 Letter, Titlecase
Lm 17 Letter, Modifier
Lo 18 Letter, Other
Pc 19 Punctuation, Connector
Pd 20 Punctuation, Dash
Ps 21 Punctuation, Open
Pe 22 Punctuation, Close
Po 23 Punctuation, Other
Sm 24 Symbol, Math
Sc 25 Symbol, Currency
Sk 26 Symbol, Modifier
So 27 Symbol, Other
L 28 Left-To-Right
R 29 Right-To-Left
EN 30 European Number
ES 31 European Number Separator
ET 32 European Number Terminator
AN 33 Arabic Number
CS 34 Common Number Separator
B 35 Block Separator
S 36 Segment Separator
WS 37 Whitespace
ON 38 Other Neutrals
Pi 47 Punctuation, Initial
Pf 48 Punctuation, Final
#
# Implementation specific properties.
#
Cm 39 Composite
Nb 40 Non-Breaking
Sy 41 Symmetric (characters which are part of open/close pairs)
Hd 42 Hex Digit
Qm 43 Quote Mark
Mr 44 Mirroring
Ss 45 Space, Other (controls viewed as spaces in ctype isspace())
Cp 46 Defined character
The actual binary data is formatted as follows:
Assumptions: unsigned short is at least 16-bits in size and unsigned long
is at least 32-bits in size.
unsigned short ByteOrderMark
unsigned short OffsetArraySize
unsigned long Bytes
unsigned short Offsets[OffsetArraySize + 1]
unsigned long Ranges[N], N = value of Offsets[OffsetArraySize]
The Bytes field provides the total byte count used for the Offsets[] and
Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and
there is always one extra node on the end to hold the final index of the
Ranges[] array. The Ranges[] array contains pairs of 4-byte values
representing a range of Unicode characters. The pairs are arranged in
increasing order by the first character code in the range.
Determining if a particular character is in the property list requires a
simple binary search to determine if a character is in any of the ranges
for the property.
If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
machine with a different endian order and the values must be byte-swapped.
To swap a 16-bit value:
c = (c >> 8) | ((c & 0xff) << 8)
To swap a 32-bit value:
c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
(((c >> 16) & 0xff) << 8) | (c >> 24)
CASE MAPPINGS
=============
The next data file is called "case.dat" and contains three case mapping tables
in the following order: upper, lower, and title case. Each table is in
increasing order by character code and each mapping contains 3 unsigned longs
which represent the possible mappings.
The format for the binary form of these tables is:
unsigned short ByteOrderMark
unsigned short NumMappingNodes, count of all mapping nodes
unsigned short CaseTableSizes[2], upper and lower mapping node counts
unsigned long CaseTables[NumMappingNodes]
The starting indexes of the case tables are calculated as following:
UpperIndex = 0;
LowerIndex = CaseTableSizes[0] * 3;
TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
The order of the fields for the three tables are:
Upper case
----------
unsigned long upper;
unsigned long lower;
unsigned long title;
Lower case
----------
unsigned long lower;
unsigned long upper;
unsigned long title;
Title case
----------
unsigned long title;
unsigned long upper;
unsigned long lower;
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
same way as described in the CHARACTER PROPERTIES section.
Because the tables are in increasing order by character code, locating a
mapping requires a simple binary search on one of the 3 codes that make up
each node.
It is important to note that there can only be 65536 mapping nodes which
divided into 3 portions allows 21845 nodes for each case mapping table. The
distribution of mappings may be more or less than 21845 per table, but only
65536 are allowed.
DECOMPOSITIONS
==============
The next data file is called "decomp.dat" and contains the decomposition data
for all characters with decompositions containing more than one character and
are *not* compatibility decompositions. Compatibility decompositions are
signaled in the UCDB format by the use of the <compat> tag in the
decomposition field. Each list of character codes represents a full
decomposition of a composite character. The nodes are arranged in increasing
order by character code.
The format for the binary form of this table is:
unsigned short ByteOrderMark
unsigned short NumDecompNodes, count of all decomposition nodes
unsigned long Bytes
unsigned long DecompNodes[(NumDecompNodes * 2) + 1]
unsigned long Decomp[N], N = sum of all counts in DecompNodes[]
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
same way as described in the CHARACTER PROPERTIES section.
The DecompNodes[] array consists of pairs of unsigned longs, the first of
which is the character code and the second is the initial index of the list
of character codes representing the decomposition.
Locating the decomposition of a composite character requires a binary search
for a character code in the DecompNodes[] array and using its index to
locate the start of the decomposition. The length of the decomposition list
is the index in the following element in DecompNode[] minus the current
index.
COMBINING CLASSES
=================
The fourth data file is called "cmbcl.dat" and contains the characters with
non-zero combining classes.
The format for the binary form of this table is:
unsigned short ByteOrderMark
unsigned short NumCCLNodes
unsigned long Bytes
unsigned long CCLNodes[NumCCLNodes * 3]
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
same way as described in the CHARACTER PROPERTIES section.
The CCLNodes[] array consists of groups of three unsigned longs. The first
and second are the beginning and ending of a range and the third is the
combining class of that range.
If a character is not found in this table, then the combining class is
assumed to be 0.
It is important to note that only 65536 distinct ranges plus combining class
can be specified because the NumCCLNodes is usually a 16-bit number.
NUMBER TABLE
============
The final data file is called "num.dat" and contains the characters that have
a numeric value associated with them.
The format for the binary form of the table is:
unsigned short ByteOrderMark
unsigned short NumNumberNodes
unsigned long Bytes
unsigned long NumberNodes[NumNumberNodes]
unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
/ sizeof(short)]
If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
same way as described in the CHARACTER PROPERTIES section.
The NumberNodes array contains pairs of values, the first of which is the
character code and the second an index into the ValueNodes array. The
ValueNodes array contains pairs of integers which represent the numerator
and denominator of the numeric value of the character. If the character
happens to map to an integer, both the values in ValueNodes will be the
same.
|