TRON code

From Just Solve the File Format Problem
Jump to: navigation, search
File Format
Name TRON code
Ontology

This article describes the character encoding used in TRON. Unlike Unicode, it does not use the Han unification; it can clearly distinguish Japanese from Chinese texts.

Character codes are two byte codes and are split into four zones:

  • A zone: High byte and low byte are both in range 0x21 to 0x7E.
  • B zone: High byte in range 0x80 to 0xFD and low byte in range 0x21 to 0x7E.
  • C zone: High byte in range 0x21 to 0x7E and low byte in range 0x80 to 0xFD.
  • D zone: High byte and low byte are both in range 0x80 to 0xFD.

The character codes are grouped in planes; the language selection is by first byte 0xFE and then second byte makes the plane number added to 0x20 (for example, plane 1 is selection by code 0xFE21). The default plane (if not otherwise specified) is usually plane 1.

List of planes:

  • 1 = JIS, GB2312, KS X 1001, and Braille
  • 2,3 = GT
  • 6 = Big5
  • 8,9 = Dai-Kan-Wa-Jiten, hentaigana, etc
  • 10 = Dongba symbols

Conversion other formats into TRON is described below.

Contents

Plane 1

JIS X 0208, and first plane of JIS X 0213:

hi = ku+0x20
lo = ten+0x20

JIS X 0212:

hi = ku+0xA0
lo = ten+0x20

Second plane of JIS X 0213:

hi = numbers 0x87 to 0xA0, contiguous by valid rows of JIS X 0213 (1,3-5,8,12-15,78-94)
lo = ten+0x20

GB 2312:

hi = ((ku-1)*94+ten-1)/126+0x21
lo = ((ku-1)*94+ten-1)%126+0x80

KS X 1001:

hi = ((ku-1)*94+ten-1)/126+0xB7
lo = ((ku-1)*94+ten-1)%126+0x80

Six-dot Braille:

hi = 0x80
lo = 0x21 + (the number set by bits according to the pattern (by hexadecimal))
  01 08
  02 10
  04 20

Eight-dot Braille:

x = (the number set by bits according to the pattern (by hexadecimal))
  01 10
  02 20
  04 40
  08 80
hi = if x<0x5E then 0x81 else if x<0xBC then 0x82 else 0x83
lo = x+0x21-(if hi=0x83 then 0xBC else if hi=0x82 then 0x5E else 0x00)

Plane 8

  • All codes are Dai-Kan-Wa-Jiten characters 1 to 48055 (presumably according to "linear2hilo" function?).

Plane 9

  • Codes 0x8021 to 0x8230 are uncommon variants of hiragana/katakana, such as small letters and voice mark for letters which are not commonly used in this way.
  • Codes 0x8321 to 0x846A are hentaigana.
  • Codes 0x9621 to 0x967E are apparently something to do with Chinese elements (I don't know for sure if this is correct, or what the specific encoding is?)
  • Codes 0x9721 to 0x972A are the Chinese/Japanese numbers one to ten in the square.
  • Codes 0x972B to 0x975A are the katakana in parentheses. (They seem to be in the usual modern "grid order" of Japanese alphabets, excluding small letters and dakuten/handakuten, but including the "wi" and "we")
  • Codes 0x975B to 0x9766 are the lowercase roman numbers i to xii in the circle.
  • Codes 0x9767 to 0x977A are the numbers 1 to 20 in the triangle.
  • Codes 0x9830 to 0x9839 are Baronh numbers 0 to 9.
  • Codes 0x9840 to 0x985B are Baronh alphabets: a e i ï u ü é o c s t l n h p f m ai y ÿ œ r au eu g z d b

Plane 16 and 17

These plane are used for te Basic Multilingual Plane of Unicode 2.0 (even the ASCII control codes are mapped for some reason), but not CJK characters that have Han unification in Unicode that are mapped elsewhere in TRON without Han unification.

The encoding is linearly starting from the A zone, and then B zone, C zone, D zone.

Plane 22 and 23

These plane are used for GB 18030.

Encodings

The most common encodings are probably TRON-16BE (also called "TADTextBE" in Ruby programming language) and TRON-16LE. They use 16-bit code units. A code with 0xFE as the high byte is plane selection, with the plane number in the low byte (0x21 to 0xFD except 0x7F), or if it is 0xFEFE then it selects the next volume, and then the next is 0x0021 to 0x00FD (except 0x007F), or 0xFEFE and then 0xFE21 to 0xFEFD for the third volume, or the second is also 0xFEFE if the fourth volume, etc. Control characters are represented as 0x0000 to 0x0020, and 0x007F. Codes 0xFF21 to 0xFF7E are special codes used in some applications, while TAD files will also use 0xFF80 to 0xFFFD for segments.

A less common (also apparently unofficial) encoding is TRON-8, which is TRON-16BE encoded without leading zeros. (Due to this, it can then be used for null-terminated strings in C.)

TRON-32BE (also called "stateless-TADTextBE" in Ruby) and TRON-32LE work as follows: The low 16-bits are the code within the plane, the high 8-bits select the volume (where zero means the first volume), and the next 8-bits select the plane within that volume (0x21 to 0xFD, except 0x7F).

There is also the "&T" code, which is similar to the character entities in HTML and XML. It starts by &T and then the hex code same as in TRON-32 (but usually without leading zeros) and then a semicolon on the end.

Notes:

  • Usually "TRON code" means the TRON-16 encoding, if none other is specified.
  • The names "TRON-8", "TRON-16", and "TRON-32" do not seem to be official parts of the TRON project, although the TRON-16 encoding is (although not by that name).
  • The range of valid codepoints in TRON-32 do not overlap those of UTF-32 at all, so it is unambiguous and can even be mixed unambiguously if this is somehow desirable.
  • If you have a TRON-16BE text without null characters, you can convert to TRON-8 by stripping out all null bytes. (If it is TRON-16LE, you can byte swap first and then do that.)
  • Text that explicitly specifies the plane can be distinguished from (but not necessarily mixed with) Unicode with byte order marks.
  • Since the bytes in the ASCII control characters range 0x00 to 0x1F and 0x7F are not used for text in TRON-8 that doesn't have control characters, it can be used in contexts that use those control codes for other purposes, without interference.

External resources

Commentary

Implementations

  • UTFTOVLQ (includes source code in C) - can convert between TRON-8, TRON-16, TRON-32, EUC-TRON (a superset of EUC-JP), and Shift-JIS.
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox