package uucd

  1. Overview
  2. Docs

Unicode character database decoder.

Uucd decodes the data of the Unicode character database from its XML representation. It provides high-level (but not necessarily efficient) access to the data so that efficient representations can be extracted.

Uucd decodes the representation described in the Annex #42 of Unicode 12.0.0. Subsequent versions may be decoded as long as no new cases are introduced in parsed enumerated properties.

Consult the basics.

Note. All strings returned by the module are UTF-8 encoded.

Release v12.0.0 — Unicode version 12.0.0 — homepage

References

Code points

type cp = int

The type for Unicode code points, ranges from 0x0000 to 0x10_FFFF.

val is_cp : int -> bool

is_cp n is true iff n a Unicode code point.

val is_scalar_value : int -> bool

is_scalar_value n is true iff n is a Unicode scalar value.

module Cpmap : Map.S with type key = cp

Code point maps.

Properties

Properties are referenced by their name and property values by their abbreviated name. To understand their semantics refer to the standard.

type props

The type for sets of properties.

type 'a prop

The type for properties with property value of type 'a.

val find : props -> 'a prop -> 'a option

find ps p is the value of property p in ps, if any.

val unknown_prop : (string * string) -> string prop

unknown_prop (ns, n) is a property read from an XML attribute whose expanded name is (ns, n). This can be used to access a property unknown to the module.

Non Unihan properties

In alphabetical order.

val age : [ `Version of int * int | `Unassigned ] prop
val alphabetic : bool prop
val ascii_hex_digit : bool prop
val bidi_class : [ `AL | `AN | `B | `BN | `CS | `EN | `ES | `ET | `FSI | `L | `LRE | `LRI | `LRO | `NSM | `ON | `PDF | `PDI | `R | `RLE | `RLI | `RLO | `S | `WS ] prop
val bidi_control : bool prop
val bidi_mirrored : bool prop
val bidi_mirroring_glyph : cp option prop
val bidi_paired_bracket : [ `Self | `Cp of cp ] prop
val bidi_paired_bracket_type : [ `O | `C | `N ] prop
val block : [ `ASCII | `Adlam | `Aegean_Numbers | `Ahom | `Alchemical | `Alphabetic_PF | `Anatolian_Hieroglyphs | `Ancient_Greek_Music | `Ancient_Greek_Numbers | `Ancient_Symbols | `Arabic | `Arabic_Ext_A | `Arabic_Math | `Arabic_PF_A | `Arabic_PF_B | `Arabic_Sup | `Armenian | `Arrows | `Avestan | `Balinese | `Bamum | `Bamum_Sup | `Bassa_Vah | `Batak | `Bengali | `Bhaiksuki | `Block_Elements | `Bopomofo | `Bopomofo_Ext | `Box_Drawing | `Brahmi | `Braille | `Buginese | `Buhid | `Byzantine_Music | `CJK | `CJK_Compat | `CJK_Compat_Forms | `CJK_Compat_Ideographs | `CJK_Compat_Ideographs_Sup | `CJK_Ext_A | `CJK_Ext_B | `CJK_Ext_C | `CJK_Ext_D | `CJK_Ext_E | `CJK_Ext_F | `CJK_Radicals_Sup | `CJK_Strokes | `CJK_Symbols | `Carian | `Caucasian_Albanian | `Chakma | `Cham | `Cherokee | `Cherokee_Sup | `Chess_Symbols | `Compat_Jamo | `Control_Pictures | `Coptic | `Coptic_Epact_Numbers | `Counting_Rod | `Cuneiform | `Cuneiform_Numbers | `Currency_Symbols | `Cypriot_Syllabary | `Cyrillic | `Cyrillic_Ext_A | `Cyrillic_Ext_B | `Cyrillic_Ext_C | `Cyrillic_Sup | `Deseret | `Devanagari | `Devanagari_Ext | `Diacriticals | `Diacriticals_Ext | `Diacriticals_For_Symbols | `Diacriticals_Sup | `Dingbats | `Dogra | `Domino | `Duployan | `Early_Dynastic_Cuneiform | `Egyptian_Hieroglyph_Format_Controls | `Egyptian_Hieroglyphs | `Elbasan | `Elymaic | `Emoticons | `Enclosed_Alphanum | `Enclosed_Alphanum_Sup | `Enclosed_CJK | `Enclosed_Ideographic_Sup | `Ethiopic | `Ethiopic_Ext | `Ethiopic_Ext_A | `Ethiopic_Sup | `Geometric_Shapes | `Geometric_Shapes_Ext | `Georgian | `Georgian_Ext | `Georgian_Sup | `Glagolitic | `Glagolitic_Sup | `Gothic | `Grantha | `Greek | `Greek_Ext | `Gujarati | `Gunjala_Gondi | `Gurmukhi | `Half_And_Full_Forms | `Half_Marks | `Hangul | `Hanifi_Rohingya | `Hanunoo | `Hatran | `Hebrew | `High_PU_Surrogates | `High_Surrogates | `Hiragana | `IDC | `IPA_Ext | `Ideographic_Symbols | `Imperial_Aramaic | `Indic_Number_Forms | `Indic_Siyaq_Numbers | `Inscriptional_Pahlavi | `Inscriptional_Parthian | `Jamo | `Jamo_Ext_A | `Jamo_Ext_B | `Javanese | `Kaithi | `Kana_Ext_A | `Kana_Sup | `Kanbun | `Kangxi | `Kannada | `Katakana | `Katakana_Ext | `Kayah_Li | `Kharoshthi | `Khmer | `Khmer_Symbols | `Khojki | `Khudawadi | `Lao | `Latin_1_Sup | `Latin_Ext_A | `Latin_Ext_Additional | `Latin_Ext_B | `Latin_Ext_C | `Latin_Ext_D | `Latin_Ext_E | `Lepcha | `Letterlike_Symbols | `Limbu | `Linear_A | `Linear_B_Ideograms | `Linear_B_Syllabary | `Lisu | `Low_Surrogates | `Lycian | `Lydian | `Mahajani | `Mahjong | `Makasar | `Malayalam | `Mandaic | `Manichaean | `Marchen | `Masaram_Gondi | `Math_Alphanum | `Math_Operators | `Mayan_Numerals | `Medefaidrin | `Meetei_Mayek | `Meetei_Mayek_Ext | `Mende_Kikakui | `Meroitic_Cursive | `Meroitic_Hieroglyphs | `Miao | `Misc_Arrows | `Misc_Math_Symbols_A | `Misc_Math_Symbols_B | `Misc_Pictographs | `Misc_Symbols | `Misc_Technical | `Modi | `Modifier_Letters | `Modifier_Tone_Letters | `Mongolian | `Mongolian_Sup | `Mro | `Multani | `Music | `Myanmar | `Myanmar_Ext_A | `Myanmar_Ext_B | `NB | `NKo | `Nabataean | `Nandinagari | `New_Tai_Lue | `Newa | `Number_Forms | `Nushu | `Nyiakeng_Puachue_Hmong | `OCR | `Ogham | `Ol_Chiki | `Old_Hungarian | `Old_Italic | `Old_North_Arabian | `Old_Permic | `Old_Persian | `Old_Sogdian | `Old_South_Arabian | `Old_Turkic | `Oriya | `Ornamental_Dingbats | `Osage | `Osmanya | `Ottoman_Siyaq_Numbers | `PUA | `Pahawh_Hmong | `Palmyrene | `Pau_Cin_Hau | `Phags_Pa | `Phaistos | `Phoenician | `Phonetic_Ext | `Phonetic_Ext_Sup | `Playing_Cards | `Psalter_Pahlavi | `Punctuation | `Rejang | `Rumi | `Runic | `Samaritan | `Saurashtra | `Sharada | `Shavian | `Shorthand_Format_Controls | `Siddham | `Sinhala | `Sinhala_Archaic_Numbers | `Small_Forms | `Small_Kana_Ext | `Sogdian | `Sora_Sompeng | `Soyombo | `Specials | `Sundanese | `Sundanese_Sup | `Sup_Arrows_A | `Sup_Arrows_B | `Sup_Arrows_C | `Sup_Math_Operators | `Sup_PUA_A | `Sup_PUA_B | `Sup_Punctuation | `Sup_Symbols_And_Pictographs | `Super_And_Sub | `Sutton_SignWriting | `Syloti_Nagri | `Symbols_And_Pictographs_Ext_A | `Syriac | `Syriac_Sup | `Tagalog | `Tagbanwa | `Tags | `Tai_Le | `Tai_Tham | `Tai_Viet | `Tai_Xuan_Jing | `Takri | `Tamil | `Tamil_Sup | `Tangut | `Tangut_Components | `Telugu | `Thaana | `Thai | `Tibetan | `Tifinagh | `Tirhuta | `Transport_And_Map | `UCAS | `UCAS_Ext | `Ugaritic | `VS | `VS_Sup | `Vai | `Vedic_Ext | `Vertical_Forms | `Wancho | `Warang_Citi | `Yi_Radicals | `Yi_Syllables | `Yijing | `Zanabazar_Square ] prop
val canonical_combining_class : int prop
val cased : bool prop
val case_folding : [ `Self | `Cps of cp list ] prop
val case_ignorable : bool prop
val changes_when_casefolded : bool prop
val changes_when_casemapped : bool prop
val changes_when_lowercased : bool prop
val changes_when_nfkc_casefolded : bool prop
val changes_when_titlecased : bool prop
val changes_when_uppercased : bool prop
val composition_exclusion : bool prop
val dash : bool prop
val decomposition_mapping : [ `Self | `Cps of cp list ] prop
val decomposition_type : [ `Can | `Com | `Enc | `Fin | `Font | `Fra | `Init | `Iso | `Med | `Nar | `Nb | `Sml | `Sqr | `Sub | `Sup | `Vert | `Wide | `None ] prop
val default_ignorable_code_point : bool prop
val deprecated : bool prop
val diacritic : bool prop
val east_asian_width : [ `A | `F | `H | `N | `Na | `W ] prop
val equivalent_unified_ideograph : cp option prop
val expands_on_nfc : bool prop
val expands_on_nfd : bool prop
val expands_on_nfkc : bool prop
val expands_on_nfkd : bool prop
val extender : bool prop
val fc_nfkc_closure : [ `Self | `Cps of cp list ] prop
val full_composition_exclusion : bool prop
val general_category : [ `Lu | `Ll | `Lt | `Lm | `Lo | `Mn | `Mc | `Me | `Nd | `Nl | `No | `Pc | `Pd | `Ps | `Pe | `Pi | `Pf | `Po | `Sm | `Sc | `Sk | `So | `Zs | `Zl | `Zp | `Cc | `Cf | `Cs | `Co | `Cn ] prop
val grapheme_base : bool prop
val grapheme_cluster_break : [ `CN | `CR | `EB | `EBG | `EM | `EX | `GAZ | `L | `LF | `LV | `LVT | `PP | `RI | `SM | `T | `V | `XX | `ZWJ ] prop
val grapheme_extend : bool prop
val hangul_syllable_type : [ `L | `LV | `LVT | `T | `V | `NA ] prop
val hex_digit : bool prop
val hyphen : bool prop
val id_continue : bool prop
val id_start : bool prop
val ideographic : bool prop
val ids_binary_operator : bool prop
val ids_trinary_operator : bool prop
val indic_syllabic_category : [ `Avagraha | `Bindu | `Brahmi_Joining_Number | `Cantillation_Mark | `Consonant | `Consonant_Dead | `Consonant_Final | `Consonant_Head_Letter | `Consonant_Initial_Postfixed | `Consonant_Killer | `Consonant_Medial | `Consonant_Placeholder | `Consonant_Preceding_Repha | `Consonant_Prefixed | `Consonant_Repha | `Consonant_Subjoined | `Consonant_Succeeding_Repha | `Consonant_With_Stacker | `Gemination_Mark | `Invisible_Stacker | `Joiner | `Modifying_Letter | `Non_Joiner | `Nukta | `Number | `Number_Joiner | `Other | `Pure_Killer | `Register_Shifter | `Syllable_Modifier | `Tone_Letter | `Tone_Mark | `Virama | `Visarga | `Vowel | `Vowel_Dependent | `Vowel_Independent ] prop
val indic_matra_category : [ `Right | `Left | `Visual_Order_Left | `Left_And_Right | `Top | `Bottom | `Top_And_Bottom | `Top_And_Right | `Top_And_Left | `Top_And_Left_And_Right | `Bottom_And_Right | `Top_And_Bottom_And_Right | `Overstruck | `Invisible | `NA ] prop
val indic_positional_category : [ `Bottom | `Bottom_And_Right | `Left | `Left_And_Right | `NA | `Overstruck | `Right | `Top | `Top_And_Bottom | `Top_And_Bottom_And_Right | `Top_And_Left | `Top_And_Left_And_Right | `Top_And_Right | `Visual_Order_Left ] prop
val iso_comment : string prop
val jamo_short_name : string prop
val join_control : bool prop
val joining_group : [ `African_Feh | `African_Noon | `African_Qaf | `Ain | `Alaph | `Alef | `Alef_Maqsurah | `Beh | `Beth | `Burushaski_Yeh_Barree | `Dal | `Dalath_Rish | `E | `Farsi_Yeh | `Fe | `Feh | `Final_Semkath | `Gaf | `Gamal | `Hah | `Hanifi_Rohingya_Kinna_Ya | `Hanifi_Rohingya_Pa | `Hamza_On_Heh_Goal | `He | `Heh | `Heh_Goal | `Heth | `Kaf | `Kaph | `Khaph | `Knotted_Heh | `Lam | `Lamadh | `Malayalam_Bha | `Malayalam_Ja | `Malayalam_Lla | `Malayalam_Llla | `Malayalam_Nga | `Malayalam_Nna | `Malayalam_Nnna | `Malayalam_Nya | `Malayalam_Ra | `Malayalam_Ssa | `Malayalam_Tta | `Manichaean_Aleph | `Manichaean_Ayin | `Manichaean_Beth | `Manichaean_Daleth | `Manichaean_Dhamedh | `Manichaean_Five | `Manichaean_Gimel | `Manichaean_Heth | `Manichaean_Hundred | `Manichaean_Kaph | `Manichaean_Lamedh | `Manichaean_Mem | `Manichaean_Nun | `Manichaean_One | `Manichaean_Pe | `Manichaean_Qoph | `Manichaean_Resh | `Manichaean_Sadhe | `Manichaean_Samekh | `Manichaean_Taw | `Manichaean_Ten | `Manichaean_Teth | `Manichaean_Thamedh | `Manichaean_Twenty | `Manichaean_Waw | `Manichaean_Yodh | `Manichaean_Zayin | `Meem | `Mim | `No_Joining_Group | `Noon | `Nun | `Nya | `Pe | `Qaf | `Qaph | `Reh | `Reversed_Pe | `Rohingya_Yeh | `Sad | `Sadhe | `Seen | `Semkath | `Shin | `Straight_Waw | `Swash_Kaf | `Syriac_Waw | `Tah | `Taw | `Teh_Marbuta | `Teh_Marbuta_Goal | `Teth | `Waw | `Yeh | `Yeh_Barree | `Yeh_With_Tail | `Yudh | `Yudh_He | `Zain | `Zhain ] prop
val joining_type : [ `U | `C | `T | `D | `L | `R ] prop
val line_break : [ `AI | `AL | `B2 | `BA | `BB | `BK | `CB | `CJ | `CL | `CM | `CP | `CR | `EX | `GL | `H2 | `H3 | `HL | `HY | `ID | `IN | `IS | `JL | `JT | `JV | `LF | `NL | `NS | `NU | `OP | `PO | `PR | `QU | `RI | `SA | `SG | `SP | `SY | `WJ | `XX | `ZW | `EB | `EM | `ZWJ ] prop
val logical_order_exception : bool prop
val lowercase : bool prop
val lowercase_mapping : [ `Self | `Cps of cp list ] prop
val math : bool prop
val name : [ `Pattern of string | `Name of string ] prop

In the `Pattern case occurences of the character '#' (U+0023) in the string must be replaced by the value of the code point as four to six uppercase hexadecimal digits (the minimal needed). E.g. the pattern "CJK UNIFIED IDEOGRAPH-#" associated to code point U+3400 gives the name "CJK UNIFIED IDEOGRAPH-3400".

val name_alias : (string * [ `Abbreviation | `Alternate | `Control | `Correction | `Figment ]) list prop
val nfc_quick_check : [ `True | `False | `Maybe ] prop
val nfd_quick_check : [ `True | `False | `Maybe ] prop
val nfkc_quick_check : [ `True | `False | `Maybe ] prop
val nfkc_casefold : [ `Self | `Cps of cp list ] prop
val nfkd_quick_check : [ `True | `False | `Maybe ] prop
val noncharacter_code_point : bool prop
val numeric_type : [ `None | `De | `Di | `Nu ] prop
val numeric_value : [ `NaN | `Frac of int * int | `Num of int64 ] prop
val other_alphabetic : bool prop
val other_default_ignorable_code_point : bool prop
val other_grapheme_extend : bool prop
val other_id_continue : bool prop
val other_id_start : bool prop
val other_lowercase : bool prop
val other_math : bool prop
val other_uppercase : bool prop
val pattern_syntax : bool prop
val pattern_white_space : bool prop
val prepended_concatenation_mark : bool prop
val quotation_mark : bool prop
val radical : bool prop
val regional_indicator : bool prop
type script = [
  1. | `Adlm
  2. | `Aghb
  3. | `Ahom
  4. | `Arab
  5. | `Armi
  6. | `Armn
  7. | `Avst
  8. | `Bali
  9. | `Bamu
  10. | `Bass
  11. | `Batk
  12. | `Beng
  13. | `Bhks
  14. | `Bopo
  15. | `Brah
  16. | `Brai
  17. | `Bugi
  18. | `Buhd
  19. | `Cakm
  20. | `Cans
  21. | `Cari
  22. | `Cham
  23. | `Cher
  24. | `Copt
  25. | `Cprt
  26. | `Cyrl
  27. | `Deva
  28. | `Dogr
  29. | `Dsrt
  30. | `Dupl
  31. | `Egyp
  32. | `Elba
  33. | `Elym
  34. | `Ethi
  35. | `Geor
  36. | `Glag
  37. | `Gong
  38. | `Gonm
  39. | `Goth
  40. | `Gran
  41. | `Grek
  42. | `Gujr
  43. | `Guru
  44. | `Hang
  45. | `Hani
  46. | `Hano
  47. | `Hatr
  48. | `Hebr
  49. | `Hira
  50. | `Hluw
  51. | `Hmng
  52. | `Hmnp
  53. | `Hrkt
  54. | `Hung
  55. | `Ital
  56. | `Java
  57. | `Kali
  58. | `Kana
  59. | `Khar
  60. | `Khmr
  61. | `Khoj
  62. | `Knda
  63. | `Kthi
  64. | `Lana
  65. | `Laoo
  66. | `Latn
  67. | `Lepc
  68. | `Limb
  69. | `Lina
  70. | `Linb
  71. | `Lisu
  72. | `Lyci
  73. | `Lydi
  74. | `Mahj
  75. | `Maka
  76. | `Mand
  77. | `Mani
  78. | `Marc
  79. | `Medf
  80. | `Mend
  81. | `Merc
  82. | `Mero
  83. | `Mlym
  84. | `Modi
  85. | `Mong
  86. | `Mroo
  87. | `Mtei
  88. | `Mult
  89. | `Mymr
  90. | `Nand
  91. | `Narb
  92. | `Nbat
  93. | `Newa
  94. | `Nkoo
  95. | `Nshu
  96. | `Ogam
  97. | `Olck
  98. | `Orkh
  99. | `Orya
  100. | `Osge
  101. | `Osma
  102. | `Palm
  103. | `Pauc
  104. | `Perm
  105. | `Phag
  106. | `Phli
  107. | `Phlp
  108. | `Phnx
  109. | `Plrd
  110. | `Prti
  111. | `Qaai
  112. | `Rjng
  113. | `Rohg
  114. | `Runr
  115. | `Samr
  116. | `Sarb
  117. | `Saur
  118. | `Sgnw
  119. | `Shaw
  120. | `Shrd
  121. | `Sidd
  122. | `Sind
  123. | `Sinh
  124. | `Sogd
  125. | `Sogo
  126. | `Sora
  127. | `Soyo
  128. | `Sund
  129. | `Sylo
  130. | `Syrc
  131. | `Tagb
  132. | `Takr
  133. | `Tale
  134. | `Talu
  135. | `Taml
  136. | `Tang
  137. | `Tavt
  138. | `Telu
  139. | `Tfng
  140. | `Tglg
  141. | `Thaa
  142. | `Thai
  143. | `Tibt
  144. | `Tirh
  145. | `Ugar
  146. | `Vaii
  147. | `Wara
  148. | `Wcho
  149. | `Xpeo
  150. | `Xsux
  151. | `Yiii
  152. | `Zanb
  153. | `Zinh
  154. | `Zyyy
  155. | `Zzzz
]
val script : script prop
val script_extensions : script list prop
val sentence_break : [ `AT | `CL | `CR | `EX | `FO | `LE | `LF | `LO | `NU | `SC | `SE | `SP | `ST | `UP | `XX ] prop
val simple_case_folding : [ `Self | `Cp of cp ] prop
val simple_lowercase_mapping : [ `Self | `Cp of cp ] prop
val simple_titlecase_mapping : [ `Self | `Cp of cp ] prop
val simple_uppercase_mapping : [ `Self | `Cp of cp ] prop
val soft_dotted : bool prop
val sterm : bool prop
val terminal_punctuation : bool prop
val titlecase_mapping : [ `Self | `Cps of cp list ] prop
val uax_42_element : [ `Reserved | `Noncharacter | `Surrogate | `Char ] prop

Not normative, artefact of Uucd. Corresponds to the XML element name that describes the code point.

val unicode_1_name : string prop
val unified_ideograph : bool prop
val uppercase : bool prop
val uppercase_mapping : [ `Self | `Cps of cp list ] prop
val variation_selector : bool prop
val vertical_orientation : [ `U | `R | `Tu | `Tr ] prop
val white_space : bool prop
val word_break : [ `CR | `DQ | `EB | `EBG | `EM | `EX | `Extend | `FO | `GAZ | `HL | `KA | `LE | `LF | `MB | `ML | `MN | `NL | `NU | `RI | `SQ | `WSegSpace | `XX | `ZWJ ] prop
val xid_continue : bool prop
val xid_start : bool prop

Unihan properties

In alphabetic order. For now unihan properties are always represented as strings.

val kAccountingNumeric : string prop
val kAlternateHanYu : string prop
val kAlternateJEF : string prop
val kAlternateKangXi : string prop
val kAlternateMorohashi : string prop
val kBigFive : string prop
val kCCCII : string prop
val kCNS1986 : string prop
val kCNS1992 : string prop
val kCangjie : string prop
val kCantonese : string prop
val kCheungBauer : string prop
val kCheungBauerIndex : string prop
val kCihaiT : string prop
val kCompatibilityVariant : string prop
val kCowles : string prop
val kDaeJaweon : string prop
val kDefinition : string prop
val kEACC : string prop
val kFenn : string prop
val kFennIndex : string prop
val kFourCornerCode : string prop
val kFrequency : string prop
val kGB0 : string prop
val kGB1 : string prop
val kGB3 : string prop
val kGB5 : string prop
val kGB7 : string prop
val kGB8 : string prop
val kGSR : string prop
val kGradeLevel : string prop
val kHDZRadBreak : string prop
val kHKGlyph : string prop
val kHKSCS : string prop
val kHanYu : string prop
val kHangul : string prop
val kHanyuPinlu : string prop
val kHanyuPinyin : string prop
val kIBMJapan : string prop
val kIICore : string prop
val kIRGDaeJaweon : string prop
val kIRGDaiKanwaZiten : string prop
val kIRGHanyuDaZidian : string prop
val kIRGKangXi : string prop
val kIRG_GSource : string prop
val kIRG_HSource : string prop
val kIRG_JSource : string prop
val kIRG_KPSource : string prop
val kIRG_KSource : string prop
val kIRG_MSource : string prop
val kIRG_TSource : string prop
val kIRG_USource : string prop
val kIRG_VSource : string prop
val kJHJ : string prop
val kJIS0213 : string prop
val kJa : string prop
val kJapaneseKun : string prop
val kJapaneseOn : string prop
val kJinmeiyoKanji : string prop
val kJis0 : string prop
val kJis1 : string prop
val kJoyoKanji : string prop
val kKPS0 : string prop
val kKPS1 : string prop
val kKSC0 : string prop
val kKSC1 : string prop
val kKangXi : string prop
val kKarlgren : string prop
val kKorean : string prop
val kKoreanEducationHanja : string prop
val kKoreanName : string prop
val kLau : string prop
val kMainlandTelegraph : string prop
val kMandarin : string prop
val kMatthews : string prop
val kMeyerWempe : string prop
val kMorohashi : string prop
val kNelson : string prop
val kOtherNumeric : string prop
val kPhonetic : string prop
val kPrimaryNumeric : string prop
val kPseudoGB1 : string prop
val kRSAdobe_Japan1_6 : string prop
val kRSJapanese : string prop
val kRSKanWa : string prop
val kRSKangXi : string prop
val kRSKorean : string prop
val kRSMerged : string prop
val kRSTUnicode : string prop
val kRSUnicode : string prop
val kReading : string prop
val kSBGY : string prop
val kSemanticVariant : string prop
val kSimplifiedVariant : string prop
val kSpecializedSemanticVariant : string prop
val kSrc_NushuDuben : string prop
val kTGH : string prop
val kTGT_MergedSrc : string prop
val kTaiwanTelegraph : string prop
val kTang : string prop
val kTotalStrokes : string prop
val kTraditionalVariant : string prop
val kVietnamese : string prop
val kWubi : string prop
val kXHC1983 : string prop
val kXerox : string prop
val kZVariant : string prop

Unicode character databases

type block = (cp * cp) * string

The type for blocks. Code point range, name of the block.

type named_sequence = string * cp list

The type for named sequences. Sequence name, code point sequence.

type normalization_correction = cp * cp list * cp list * (int * int * int)

The type for normalization corrections. Code point, old normalization, new normalization, version

type standardized_variant = cp list * string * [ `Isolate | `Initial | `Medial | `Final ] list

The type for standarized variants. Code point sequence, description, when.

type cjk_radical = string * cp * cp

The type for CJK radicals. Radical number, CJK radical character, CJK unified ideograph.

type emoji_source = cp list * int option * int option * int option

The type for emoji sources. Unicode, docomo, kddi, softbank.

type t = {
  1. description : string;
  2. repertoire : props Cpmap.t;
  3. blocks : block list;
  4. named_sequences : named_sequence list;
  5. provisional_named_sequences : named_sequence list;
  6. normalization_corrections : normalization_correction list;
  7. standardized_variants : standardized_variant list;
  8. cjk_radicals : cjk_radical list;
  9. emoji_sources : emoji_source list;
}

The type for Unicode character databases.

Note. Absence of an optional top-level field in the database is denoted by the neutral element of its type (empty string, empty list, Cpmap.empty). This means that the module doesn't distinguish between absence of a field and presence of the field with empty data (but incurs no problems in this context).

val cp_prop : t -> cp -> 'a prop -> 'a option

cp_prop ucd cp p is the property p of the code point cp in db's repertoire, if p is in the repertoire and the property exists for cp.

Decode

type src = [
  1. | `Channel of in_channel
  2. | `String of string
]

The type for input sources.

type decoder

The type for Unicode character database decoders.

val decoder : [< src ] -> decoder

decoder src is a decoder that inputs from src.

val decode : decoder -> [ `Ok of t | `Error of string ]

decode d decodes a database from d or returns an error.

val decoded_range : decoder -> (int * int) * (int * int)

decoded_range d is the range of characters spanning the `Error decoded by d. A pair of line and column numbers respectively one and zero based.

Basics

The database and subsets of it for Unicode 12.0.0 are available here. Databases with groups should be preferred, they maximize value sharing and improve parsing performance.

A database is decoded as follows:

let ucd_or_die inf = try
  let ic = if inf = "-" then stdin else open_in inf in
  let d = Uucd.decoder (`Channel ic) in
  match Uucd.decode d with
  | `Ok db -> db
  | `Error e ->
    let (l0, c0), (l1, c1) = Uucd.decoded_range d in
    Printf.eprintf "%s:%d.%d-%d.%d: %s\n%!" inf l0 c0 l1 c1 e;
    exit 1
with Sys_error e -> Printf.eprintf "%s\n%!" e; exit 1

let ucd = ucd_or_die "/tmp/ucd.all.grouped.xml"

The convenience function cp_prop can be used to query the property of a given code point. For example the general category of U+1F42B is given by:

let u_1F42B_gc = Uucd.cp_prop ucd 0x1F42B Uucd.general_category
OCaml

Innovation. Community. Security.