{
    "mode": "man",
    "parameter": "charsets",
    "section": "7",
    "url": "https://www.chedong.com/phpMan.php/man/charsets/7/json",
    "generated": "2026-06-03T04:25:40Z",
    "sections": {
        "NAME": {
            "content": "charsets - character set standards and internationalization\n",
            "subsections": []
        },
        "DESCRIPTION": {
            "content": "This  manual  page  gives  an overview on different character set standards and how they were\nused on Linux before Unicode became ubiquitous.  Some of this information  is  still  helpful\nfor people working with legacy systems and documents.\n\nStandards discussed include such as ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode.\n\nThe  primary  emphasis is on character sets that were actually used by locale character sets,\nnot the myriad others that could be found in data from other systems.\n\nASCII\nASCII (American Standard Code For Information Interchange) is the  original  7-bit  character\nset,  originally designed for American English.  Also known as US-ASCII.  It is currently de‐\nscribed by the ISO 646:1991 IRV (International Reference Version) standard.\n\nVarious ASCII variants replacing the dollar sign with other currency  symbols  and  replacing\npunctuation with non-English alphabetic characters to cover German, French, Spanish, and oth‐\ners in 7 bits emerged.  All are deprecated; glibc does not support  locales  whose  character\nsets are not true supersets of ASCII.\n\nAs Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text still renders properly on\nmodern UTF-8 using systems.\n\nISO 8859\nISO 8859 is a series of 15 8-bit character sets, all of which have ASCII in their low (7-bit)\nhalf,  invisible  control  characters in positions 128 to 159, and 96 fixed-width graphics in\npositions 160–255.\n\nOf these, the most important is ISO 8859-1 (\"Latin Alphabet No .1\" / Latin-1).  It was widely\nadopted  and  supported  by  different systems, and is gradually being replaced with Unicode.\nThe ISO 8859-1 characters are also the first 256 characters of Unicode.\n\nConsole support for the other 8859 character sets is available under Linux through  user-mode\nutilities  (such  as setfont(8)) that modify keyboard bindings and the EGA graphics table and\nemploy the \"user mapping\" font table in the console driver.\n\nHere are brief descriptions of each set:\n\n8859-1 (Latin-1)\nLatin-1 covers many West European languages such as Albanian, Basque, Danish, English,\nFaroese,  Galician,  Icelandic,  Irish,  Italian,  Norwegian, Portuguese, Spanish, and\nSwedish.  The lack of the ligatures Dutch Ĳ/ĳ, French œ, and old-style „German“ quota‐\ntion marks was considered tolerable.\n\n8859-2 (Latin-2)\nLatin-2  supports  many  Latin-written  Central  and  East  European languages such as\nBosnian, Croatian, Czech, German, Hungarian, Polish, Slovak, and  Slovene.   Replacing\nRomanian ș/ț with ş/ţ was considered tolerable.\n\n8859-3 (Latin-3)\nLatin-3 was designed to cover of Esperanto, Maltese, and Turkish, but 8859-9 later su‐\nperseded it for Turkish.\n\n8859-4 (Latin-4)\nLatin-4 introduced letters for North European languages such as Estonian, Latvian, and\nLithuanian, but was superseded by 8859-10 and 8859-13.\n\n8859-5 Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian, Russian, Serbian, and\n(almost completely) Ukrainian.  It was  never  widely  used,  see  the  discussion  of\nKOI8-R/KOI8-U below.\n\n8859-6 Was  created  for  Arabic.   The 8859-6 glyph table is a fixed font of separate letter\nforms, but a proper display engine should combine these using the proper initial,  me‐\ndial, and final forms.\n\n8859-7 Was created for Modern Greek in 1987, updated in 2003.\n\n8859-8 Supports Modern Hebrew without niqud (punctuation signs).  Niqud and full-fledged Bib‐\nlical Hebrew were outside the scope of this character set.\n\n8859-9 (Latin-5)\nThis is a variant of Latin-1 that replaces Icelandic letters with Turkish ones.\n\n8859-10 (Latin-6)\nLatin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were missing  in\nLatin-4 to cover the entire Nordic area.\n\n8859-11\nSupports the Thai alphabet and is nearly identical to the TIS-620 standard.\n\n8859-12\nThis set does not exist.\n\n8859-13 (Latin-7)\nSupports  the  Baltic Rim languages; in particular, it includes Latvian characters not\nfound in Latin-4.\n\n8859-14 (Latin-8)\nThis is the Celtic character set, covering Old Irish, Manx,  Gaelic,  Welsh,  Cornish,\nand Breton.\n\n8859-15 (Latin-9)\nLatin-9  is  similar  to the widely used Latin-1 but replaces some less common symbols\nwith the Euro sign and French and Finnish letters that were missing in Latin-1.\n\n8859-16 (Latin-10)\nThis set covers many Southeast European languages, and most importantly supports Roma‐\nnian more completely than Latin-2.\n\nKOI8-R / KOI8-U\nKOI8-R is a non-ISO character set popular in Russia before Unicode.  The lower half is ASCII;\nthe upper is a Cyrillic character set somewhat better  designed  than  ISO  8859-5.   KOI8-U,\nbased  on  KOI8-R, has better support for Ukrainian.  Neither of these sets are ISO-2022 com‐\npatible, unlike the ISO 8859 series.\n\nConsole support for KOI8-R is available under Linux through user-mode utilities  that  modify\nkeyboard bindings and the EGA graphics table, and employ the \"user mapping\" font table in the\nconsole driver.\n\nGB 2312\nGB 2312 is a mainland Chinese national standard character set used to express simplified Chi‐\nnese.   Just like JIS X 0208, characters are mapped into a 94x94 two-byte matrix used to con‐\nstruct EUC-CN.  EUC-CN is the most important encoding for Linux and  includes  ASCII  and  GB\n2312.  Note that EUC-CN is often called as GB, GB 2312, or CN-GB.\n",
            "subsections": [
                {
                    "name": "Big5",
                    "content": "Big5  was  a popular character set in Taiwan to express traditional Chinese.  (Big5 is both a\ncharacter set and an encoding.)  It is a superset of ASCII.   Non-ASCII  characters  are  ex‐\npressed  in  two  bytes.   Bytes 0xa1–0xfe are used as leading bytes for two-byte characters.\nBig5 and its extension were widely used in Taiwan and Hong Kong.  It is not ISO 2022  compli‐\nant.\n\nJIS X 0208\nJIS  X  0208 is a Japanese national standard character set.  Though there are some more Japa‐\nnese national standard character sets (like JIS X 0201, JIS X 0212, and JIS X 0213), this  is\nthe  most important one.  Characters are mapped into a 94x94 two-byte matrix, whose each byte\nis in the range 0x21–0x7e.  Note that JIS X 0208 is a character set, not an  encoding.   This\nmeans  that  JIS X 0208 itself is not used for expressing text data.  JIS X 0208 is used as a\ncomponent to construct encodings such as EUC-JP, ShiftJIS, and ISO-2022-JP.  EUC-JP  is  the\nmost  important  encoding for Linux and includes ASCII and JIS X 0208.  In EUC-JP, JIS X 0208\ncharacters are expressed in two bytes, each of which is the JIS X 0208 code plus 0x80.\n\nKS X 1001\nKS X 1001 is a Korean national standard character set.  Just as JIS X  0208,  characters  are\nmapped  into  a  94x94 two-byte matrix.  KS X 1001 is used like JIS X 0208, as a component to\nconstruct encodings such as EUC-KR, Johab, and ISO-2022-KR.  EUC-KR is the most important en‐\ncoding for Linux and includes ASCII and KS X 1001.  KS C 5601 is an older name for KS X 1001.\n"
                },
                {
                    "name": "ISO 2022 and ISO 4873",
                    "content": "The  ISO 2022 and 4873 standards describe a font-control model based on VT100 practice.  This\nmodel is (partially) supported by the Linux kernel and by xterm(1).  Several  ISO  2022-based\ncharacter encodings have been defined, especially for Japanese.\n\nThere are 4 graphic character sets, called G0, G1, G2, and G3, and one of them is the current\ncharacter set for codes with high bit zero (initially G0), and one of  them  is  the  current\ncharacter  set for codes with high bit one (initially G1).  Each graphic character set has 94\nor 96 characters, and is essentially a 7-bit character set.  It uses  codes  either  040–0177\n(041–0176) or 0240–0377 (0241–0376).  G0 always has size 94 and uses codes 041–0176.\n\nSwitching  between character sets is done using the shift functions ^N (SO or LS1), ^O (SI or\nLS0), ESC n (LS2), ESC o (LS3), ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R),  ESC  |\n(LS3R).   The  function  LSn  makes  character set Gn the current one for codes with high bit\nzero.  The function LSnR makes character set Gn the current one for codes with high bit  one.\nThe  function  SSn  makes  character set Gn (n=2 or 3) the current one for the next character\nonly (regardless of the value of its high order bit).\n\nA 94-character set is designated as Gn character set by an escape sequence ESC ( xx (for G0),\nESC  )  xx  (for G1), ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol or a pair of\nsymbols found in the ISO 2375 International Register of Coded Character Sets.   For  example,\nESC  (  @  selects the ISO 646 character set as G0, ESC ( A selects the UK standard character\nset (with pound instead of number sign), ESC ( B selects ASCII (with dollar instead  of  cur‐\nrency  sign),  ESC  (  M selects a character set for African languages, ESC ( ! A selects the\nCuban character set, and so on.\n\nA 96-character set is designated as Gn character set by an escape sequence ESC - xx (for G1),\nESC  . xx (for G2) or ESC / xx (for G3).  For example, ESC - G selects the Hebrew alphabet as\nG1.\n\nA multibyte character set is designated as Gn character set by an escape sequence ESC $ xx or\nESC  $ ( xx (for G0), ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).  For ex‐\nample, ESC $ ( C selects the Korean character set for G0.  The  Japanese  character  set  se‐\nlected by ESC $ B has a more recent version selected by ESC & @ ESC $ B.\n\nISO  4873  stipulates  a narrower use of character sets, where G0 is fixed (always ASCII), so\nthat G1, G2 and G3 can be invoked only for codes with the high order bit set.  In particular,\n^N  and ^O are not used anymore, ESC ( xx can be used only with xx=B, and ESC ) xx, ESC * xx,\nESC + xx are equivalent to ESC - xx, ESC . xx, ESC / xx, respectively.\n\nTIS-620\nTIS-620 is a Thai national standard character set and a superset of ASCII.  In the same fash‐\nion as the ISO 8859 series, Thai characters are mapped into 0xa1–0xfe.\n"
                },
                {
                    "name": "Unicode",
                    "content": "Unicode  (ISO  10646)  is a standard which aims to unambiguously represent every character in\nevery human language.  Unicode's structure permits  20.1  bits  to  encode  every  character.\nSince  most  computers  don't include 20.1-bit integers, Unicode is usually encoded as 32-bit\nintegers internally and either a series of 16-bit integers (UTF-16) (needing two 16-bit inte‐\ngers only when encoding certain rare characters) or a series of 8-bit bytes (UTF-8).\n\nLinux  represents  Unicode using the 8-bit Unicode Transformation Format (UTF-8).  UTF-8 is a\nvariable length encoding of Unicode.  It uses 1 byte to code 7 bits, 2 bytes for 11  bits,  3\nbytes for 16 bits, 4 bytes for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.\n\nLet  0,1,x  stand  for a zero, one, or arbitrary bit.  A byte 0xxxxxxx stands for the Unicode\n00000000 0xxxxxxx which codes the same symbol as the ASCII 0xxxxxxx.  Thus,  ASCII  goes  un‐\nchanged  into  UTF-8,  and people using only ASCII do not notice any change: not in code, and\nnot in file size.\n\nA byte 110xxxxx is the start of a 2-byte  code,  and  110xxxxx  10yyyyyy  is  assembled  into\n00000xxx  xxyyyyyy.   A  byte  1110xxxx  is the start of a 3-byte code, and 1110xxxx 10yyyyyy\n10zzzzzz is assembled into xxxxyyyy yyzzzzzz.  (When UTF-8 is used to  code  the  31-bit  ISO\n10646 then this progression continues up to 6-byte codes.)\n\nFor  most  texts  in ISO 8859 character sets, this means that the characters outside of ASCII\nare now coded with two bytes.  This tends to expand ordinary text files by only  one  or  two\npercent.  For Russian or Greek texts, this expands ordinary text files by 100%, since text in\nthose languages is mostly outside of ASCII.  For Japanese users this means  that  the  16-bit\ncodes  now in common use will take three bytes.  While there are algorithmic conversions from\nsome character sets (especially ISO 8859-1) to Unicode, general conversion requires  carrying\naround conversion tables, which can be quite large for 16-bit codes.\n\nNote  that  UTF-8  is self-synchronizing: 10xxxxxx is a tail, any other byte is the head of a\ncode.  Note that the only way ASCII bytes occur in a UTF-8 stream, is as themselves.  In par‐\nticular, there are no embedded NULs ('\\0') or '/'s that form part of some larger code.\n\nSince  ASCII, and, in particular, NUL and '/', are unchanged, the kernel does not notice that\nUTF-8 is being used.  It does not care at all what the bytes it is handling stand for.\n\nRendering of Unicode data streams is typically handled through \"subfont\" tables which  map  a\nsubset  of  Unicode  to  glyphs.   Internally the kernel uses Unicode to describe the subfont\nloaded in video RAM.  This means that in the Linux console in UTF-8 mode, one can use a char‐\nacter  set with 512 different symbols.  This is not enough for Japanese, Chinese, and Korean,\nbut it is enough for most other purposes.\n"
                }
            ]
        },
        "SEE ALSO": {
            "content": "iconv(1), ascii(7), iso8859-1(7), unicode(7), utf-8(7)\n",
            "subsections": []
        },
        "COLOPHON": {
            "content": "This page is part of release 5.10 of the Linux  man-pages  project.   A  description  of  the\nproject,  information about reporting bugs, and the latest version of this page, can be found\nat https://www.kernel.org/doc/man-pages/.\n\n\n\nLinux                                        2020-08-13                                  CHARSETS(7)",
            "subsections": []
        }
    },
    "summary": "charsets - character set standards and internationalization",
    "flags": [],
    "examples": [],
    "see_also": [
        {
            "name": "iconv",
            "section": "1",
            "url": "https://www.chedong.com/phpMan.php/man/iconv/1/json"
        },
        {
            "name": "ascii",
            "section": "7",
            "url": "https://www.chedong.com/phpMan.php/man/ascii/7/json"
        },
        {
            "name": "iso8859-1",
            "section": "7",
            "url": "https://www.chedong.com/phpMan.php/man/iso8859-1/7/json"
        },
        {
            "name": "unicode",
            "section": "7",
            "url": "https://www.chedong.com/phpMan.php/man/unicode/7/json"
        },
        {
            "name": "utf-8",
            "section": "7",
            "url": "https://www.chedong.com/phpMan.php/man/utf-8/7/json"
        }
    ]
}