{ "content": [ { "type": "text", "text": "# Text::Unidecode (perldoc)\n\n## NAME\n\nText::Unidecode -- plain ASCII transliterations of Unicode text\n\n## SYNOPSIS\n\nuse utf8;\nuse Text::Unidecode;\nprint unidecode(\n\"北亰\\n\"\n# Chinese characters for Beijing (U+5317 U+4EB0)\n);\n# That prints: Bei Jing\n\n## DESCRIPTION\n\nIt often happens that you have non-Roman text data in Unicode, but you can't display it--\nusually because you're trying to show it to a user via an application that doesn't support\nUnicode, or because the fonts you need aren't accessible. You could represent the Unicode\ncharacters as \"???????\" or \"\\15BA\\15A0\\1610...\", but that's nearly useless to the user who\nactually wants to read what the text says.\n\n## Sections\n\n- **NAME**\n- **SYNOPSIS**\n- **DESCRIPTION**\n- **DESIGN PHILOSOPHY**\n- **FUNCTIONS**\n- **DESIGN GOALS AND CONSTRAINTS**\n- **A POD ENCODING TEST**\n- **TODO**\n- **MOTTO**\n- **CAVEATS**\n- **THANKS**\n- **PORTS**\n- **SEE ALSO**\n- **LICENSE**\n- **DISCLAIMER**\n- **AUTHOR**\n\nUse structuredContent.sections for detailed options, examples, and full documentation.\n" } ], "structuredContent": { "command": "Text::Unidecode", "section": "", "mode": "perldoc", "summary": "Text::Unidecode -- plain ASCII transliterations of Unicode text", "synopsis": "use utf8;\nuse Text::Unidecode;\nprint unidecode(\n\"北亰\\n\"\n# Chinese characters for Beijing (U+5317 U+4EB0)\n);\n# That prints: Bei Jing", "tldr_summary": null, "tldr_examples": [], "tldr_source": null, "flags": [], "examples": [], "see_also": [], "section_outline": [ { "name": "NAME", "lines": 2, "subsections": [] }, { "name": "SYNOPSIS", "lines": 9, "subsections": [] }, { "name": "DESCRIPTION", "lines": 19, "subsections": [] }, { "name": "DESIGN PHILOSOPHY", "lines": 30, "subsections": [] }, { "name": "FUNCTIONS", "lines": 54, "subsections": [] }, { "name": "DESIGN GOALS AND CONSTRAINTS", "lines": 60, "subsections": [] }, { "name": "A POD ENCODING TEST", "lines": 26, "subsections": [] }, { "name": "TODO", "lines": 18, "subsections": [] }, { "name": "MOTTO", "lines": 108, "subsections": [] }, { "name": "CAVEATS", "lines": 11, "subsections": [] }, { "name": "THANKS", "lines": 11, "subsections": [] }, { "name": "PORTS", "lines": 15, "subsections": [] }, { "name": "SEE ALSO", "lines": 23, "subsections": [] }, { "name": "LICENSE", "lines": 10, "subsections": [] }, { "name": "DISCLAIMER", "lines": 14, "subsections": [] }, { "name": "AUTHOR", "lines": 6, "subsections": [] } ], "sections": { "NAME": { "content": "Text::Unidecode -- plain ASCII transliterations of Unicode text\n", "subsections": [] }, "SYNOPSIS": { "content": "use utf8;\nuse Text::Unidecode;\nprint unidecode(\n\"北亰\\n\"\n# Chinese characters for Beijing (U+5317 U+4EB0)\n);\n\n# That prints: Bei Jing\n", "subsections": [] }, "DESCRIPTION": { "content": "It often happens that you have non-Roman text data in Unicode, but you can't display it--\nusually because you're trying to show it to a user via an application that doesn't support\nUnicode, or because the fonts you need aren't accessible. You could represent the Unicode\ncharacters as \"???????\" or \"\\15BA\\15A0\\1610...\", but that's nearly useless to the user who\nactually wants to read what the text says.\n\nWhat Text::Unidecode provides is a function, \"unidecode(...)\" that takes Unicode data and tries\nto represent it in US-ASCII characters (i.e., the universally displayable characters between\n0x00 and 0x7F). The representation is almost always an attempt at *transliteration*-- i.e.,\nconveying, in Roman letters, the pronunciation expressed by the text in some other writing\nsystem. (See the example in the synopsis.)\n\nNOTE:\n\nTo make sure your perldoc/Pod viewing setup for viewing this page is working: The six-letter\nword \"résumé\" should look like \"resume\" with an \"/\" accent on each \"e\".\n\nFor further tests, and help if that doesn't work, see below, \"A POD ENCODING TEST\".\n", "subsections": [] }, "DESIGN PHILOSOPHY": { "content": "Unidecode's ability to transliterate from a given language is limited by two factors:\n\n* The amount and quality of data in the written form of the original language\n\nSo if you have Hebrew data that has no vowel points in it, then Unidecode cannot guess what\nvowels should appear in a pronunciation. S f y hv n vwls n th npt, y wn't gt ny vwls n th\ntpt. (This is a specific application of the general principle of \"Garbage In, Garbage Out\".)\n\n* Basic limitations in the Unidecode design\n\nWriting a real and clever transliteration algorithm for any single language usually requires\na lot of time, and at least a passable knowledge of the language involved. But Unicode text\ncan convey more languages than I could possibly learn (much less create a transliterator\nfor) in the entire rest of my lifetime. So I put a cap on how intelligent Unidecode could\nbe, by insisting that it support only context-*in*sensitive transliteration. That means\nmissing the finer details of any given writing system, while still hopefully being useful.\n\nUnidecode, in other words, is quick and dirty. Sometimes the output is not so dirty at all:\nRussian and Greek seem to work passably; and while Thaana (Divehi, AKA Maldivian) is a\ndefinitely non-Western writing system, setting up a mapping from it to Roman letters seems to\nwork pretty well. But sometimes the output is *very dirty:* Unidecode does quite badly on\nJapanese and Thai.\n\nIf you want a smarter transliteration for a particular language than Unidecode provides, then\nyou should look for (or write) a transliteration algorithm specific to that language, and apply\nit instead of (or at least before) applying Unidecode.\n\nIn other words, Unidecode's approach is broad (knowing about dozens of writing systems), but\nshallow (not being meticulous about any of them).\n", "subsections": [] }, "FUNCTIONS": { "content": "Text::Unidecode provides one function, \"unidecode(...)\", which is exported by default. It can be\nused in a variety of calling contexts:\n\n\"$out = unidecode( $in );\" # scalar context\nThis returns a copy of $in, transliterated.\n\n\"$out = unidecode( @in );\" # scalar context\nThis is the same as \"$out = unidecode(join \"\", @in);\"\n\n\"@out = unidecode( @in );\" # list context\nThis returns a list consisting of copies of @in, each transliterated. This is the same as\n\"@out = map scalar(unidecode($)), @in;\"\n\n\"unidecode( @items );\" # void context\n\"unidecode( @bar, $foo, @baz );\" # void context\nEach item on input is replaced with its transliteration. This is the same as \"for(@bar,\n$foo, @baz) { $ = unidecode($) }\"\n\nYou should make a minimum of assumptions about the output of \"unidecode(...)\". For example, if\nyou assume an all-alphabetic (Unicode) string passed to \"unidecode(...)\" will return an\nall-alphabetic string, you're wrong-- some alphabetic Unicode characters are transliterated as\nstrings containing punctuation (e.g., the Armenian letter \"Թ\" (U+0539), currently transliterates\nas \"T`\" (capital-T then a backtick).\n\nHowever, these are the assumptions you *can* make:\n\n* Each character 0x0000 - 0x007F transliterates as itself. That is, \"unidecode(...)\" is 7-bit\npure.\n\n* The output of \"unidecode(...)\" always consists entirely of US-ASCII characters-- i.e.,\ncharacters 0x0000 - 0x007F.\n\n* All Unicode characters translate to a sequence of (any number of) characters that are\nnewline (\"\\n\") or in the range 0x0020-0x007E. That is, no Unicode character translates to\n\"\\x01\", for example. (Although if you have a \"\\x01\" on input, you'll get a \"\\x01\" in\noutput.)\n\n* Yes, some transliterations produce a \"\\n\" but it's just a few, and only with good reason.\nNote that the value of newline (\"\\n\") varies from platform to platform-- see perlport.\n\n* Some Unicode characters may transliterate to nothing (i.e., empty string).\n\n* Very many Unicode characters transliterate to multi-character sequences. E.g., Unihan\ncharacter U+5317, \"北\", transliterates as the four-character string \"Bei \".\n\n* Within these constraints, *I may change* the transliteration of characters in future\nversions. For example, if someone convinces me that that the Armenian letter \"Թ\", currently\ntransliterated as \"T`\", would be better transliterated as \"D\", I *may* well make that\nchange.\n\n* Unfortunately, there are many characters that Unidecode doesn't know a transliteration for.\nThis is generally because the character has been added since I last revised the Unidecode\ndata tables. I'm *always* catching up!\n", "subsections": [] }, "DESIGN GOALS AND CONSTRAINTS": { "content": "Text::Unidecode is meant to be a transliterator of last resort, to be used once you've decided\nthat you can't just display the Unicode data as is, *and once you've decided you don't have a\nmore clever, language-specific transliterator available,* or once you've *already applied*\nsmarter algorithms or mappings that you prefer and you now just want Unidecode to do cleanup.\n\nUnidecode transliterates context-insensitively-- that is, a given character is replaced with the\nsame US-ASCII (7-bit ASCII) character or characters, no matter what the surrounding characters\nare.\n\nThe main reason I'm making Text::Unidecode work with only context-insensitive substitution is\nthat it's fast, dumb, and straightforward enough to be feasible. It doesn't tax my (quite\nlimited) knowledge of world languages. It doesn't require me writing a hundred lines of code to\nget the Thai syllabification right (and never knowing whether I've gotten it wrong, because I\ndon't know Thai), or spending a year trying to get Text::Unidecode to use the ChaSen algorithm\nfor Japanese, or trying to write heuristics for telling the difference between Japanese,\nChinese, or Korean, so it knows how to transliterate any given Uni-Han glyph. And moreover,\ncontext-insensitive substitution is still mostly useful, but still clearly couldn't be mistaken\nfor authoritative.\n\nText::Unidecode is an example of the 80/20 rule in action-- you get 80% of the usefulness using\njust 20% of a \"real\" solution.\n\nA \"real\" approach to transliteration for any given language can involve such increasingly tricky\ncontextual factors as these:\n\nThe previous / preceding character(s)\nWhat a given symbol \"X\" means, could depend on whether it's followed by a consonant, or by\nvowel, or by some diacritic character.\n\nSyllables\nA character \"X\" at end of a syllable could mean something different from when it's at the\nstart-- which is especially problematic when the language involved doesn't explicitly mark\nwhere one syllable stops and the next starts.\n\nParts of speech\nWhat \"X\" sounds like at the end of a word, depends on whether that word is a noun, or a\nverb, or what.\n\nMeaning\nBy semantic context, you can tell that this ideogram \"X\" means \"shoe\" (pronounced one way)\nand not \"time\" (pronounced another), and that's how you know to transliterate it one way\ninstead of the other.\n\nOrigin of the word\n\"X\" means one thing in loanwords and/or placenames (and derivatives thereof), and another in\nnative words.\n\n\"It's just that way\"\n\"X\" normally makes the /X/ sound, except for this list of seventy exceptions (and words\nbased on them, sometimes indirectly). Or: you never can tell which of the three ways to\npronounce \"X\" this word actually uses; you just have to know which it is, so keep a\ndictionary on hand!\n\nLanguage\nThe character \"X\" is actually used in several different languages, and you have to figure\nout which you're looking at before you can determine how to transliterate it.\n\nOut of a desire to avoid being mired in *any* of these kinds of contextual factors, I chose to\nexclude *all of them* and just stick with context-insensitive replacement.\n", "subsections": [] }, "A POD ENCODING TEST": { "content": "* \"Brontë\" is six characters that should look like \"Bronte\", but with double-dots on the \"e\"\ncharacter.\n\n* \"Résumé\" is six characters that should look like \"Resume\", but with /-shaped accents on the\n\"e\" characters.\n\n* \"læti\" should be *four* letters long-- the second letter should not be two letters \"ae\", but\nshould be a single letter that looks like an \"a\" entirely fused with an \"e\".\n\n* \"χρονος\" is six Greek characters that should look kind of like: xpovoc\n\n* \"КАК ВАС ЗОВУТ\" is three short Russian words that should look a lot like: KAK BAC 3OBYT\n\n* \"ടധ\" is two Malayalam characters that should look like: sw\n\n* \"丫二十一\" is four Chinese characters that should look like: \"Y=+-\"\n\n* \"Ｈｅｌｌｏ\" is five characters that should look like: Hello\n\nIf all of those come out right, your Pod viewing setup is working fine-- welcome to the 2010s!\nIf those are full of garbage characters, consider viewing this page as HTML at\n or \n\nIf things look mostly okay, but the Malayalam and/or the Chinese are just question-marks or\nempty boxes, it's probably just that your computer lacks the fonts for those.\n", "subsections": [] }, "TODO": { "content": "Lots:\n\n* Rebuild the Unihan database. (Talk about hitting a moving target!)\n\n* Add tone-numbers for Mandarin hanzi? Namely: In Unihan, when tone marks are present (like in\n\"kMandarin: dào\", should I continue to transliterate as just \"Dao\", or should I put in the tone\nnumber: \"Dao4\"? It would be pretty jarring to have digits appear where previously there was just\nalphabetic stuff-- But tone numbers make Chinese more readable. (I have a clever idea about\ndoing this, for Unidecode v2 or v3.)\n\n* Start dealing with characters over U+FFFF. Cuneiform! Emojis! Whatever!\n\n* Fill in all the little characters that have crept into the Misc Symbols Etc blocks.\n\n* More things that need tending to are detailed in the TODO.txt file, included in this\ndistribution. Normal installs probably don't leave the TODO.txt lying around, but if nothing\nelse, you can see it at \n", "subsections": [] }, "MOTTO": { "content": "The Text::Unidecode motto is:\n\nIt's better than nothing!\n\n...in *both* meanings: 1) seeing the output of \"unidecode(...)\" is better than just having all\nfont-unavailable Unicode characters replaced with \"?\"'s, or rendered as gibberish; and 2) it's\nthe worst, i.e., there's nothing that Text::Unidecode's algorithm is better than. All sensible\ntransliteration algorithms (like for German, see below) are going to be smarter than\nUnidecode's.\n\nWHEN YOU DON'T LIKE WHAT UNIDECODE DOES\nI will repeat the above, because some people miss it:\n\nText::Unidecode is meant to be a transliterator of *last resort,* to be used once you've decided\nthat you can't just display the Unicode data as is, *and once you've decided you don't have a\nmore clever, language-specific transliterator available*-- or once you've *already applied* a\nsmarter algorithm and now just want Unidecode to do cleanup.\n\nIn other words, when you don't like what Unidecode does, *do it yourself.* Really, that's what\nthe above says. Here's how you would do this for German, for example:\n\nIn German, there's the typographical convention that an umlaut (the double-dots on: ä ö ü) can\nbe written as an \"-e\", like with \"Schön\" becoming \"Schoen\". But Unidecode doesn't do that-- I\nhave Unidecode simply drop the umlaut accent and give back \"Schon\".\n\n(I chose this not because I'm a big meanie, but because *generally* changing \"ü\" to \"ue\" is\ndisastrous for all text that's *not in German*. Finnish \"Hyvää päivää\" would turn into \"Hyvaeae\npaeivaeae\". And I discourage you from being *yet another* German who emails me, trying to impel\nme to consider a typographical nicety of German to be more important than *all other\nlanguages*.)\n\nIf you know that the text you're handling is probably in German, and you want to apply the\n\"umlaut becomes -e\" rule, here's how to do it for yourself (and then use Unidecode as *the\nfallback* afterwards):\n\nuse utf8; # <-- probably necessary.\n\nour( %GermanCharacters ) = qw(\nÄ AE ä ae\nÖ OE ö oe\nÜ UE ü ue\nß ss\n);\n\nuse Text::Unidecode qw(unidecode);\n\nsub germantoascii {\nmy($germantext) = @;\n\n$germantext =~\ns/([ÄäÖöÜüß])/$GermanCharacters{$1}/g;\n\n# And now, as a *fallthrough*:\n$germantext = unidecode( $germantext );\nreturn $germantext;\n}\n\nTo pick another example, here's something that's not about a specific language, but simply\nhaving a preference that may or may not agree with Unidecode's (i.e., mine). Consider the \"¥\"\nsymbol. Unidecode changes that to \"Y=\". If you want \"¥\" as \"YEN\", then...\n\nuse Text::Unidecode qw(unidecode);\n\nsub myfavoriteunidecode {\nmy($text) = @;\n\n$text =~ s/¥/YEN/g;\n\n# ...and anything else you like, such as:\n$text =~ s/€/Euro/g;\n\n# And then, as a fallback,...\n$text = unidecode($text);\n\nreturn $text;\n}\n\nThen if you do:\n\nprint myfavoriteunidecode(\"You just won ¥250,000 and €40,000!!!\");\n\n...you'll get:\n\nYou just won YEN250,000 and Euro40,000!!!\n\n...just as you like it.\n\n(By the way, the reason *I* don't have Unidecode just turn \"¥\" into \"YEN\" is that the same\nsymbol also stands for yuan, the Chinese currency. A \"Y=\" is nicely, *safely* neutral as to\nwhether we're talking about yen or yuan-- Japan, or China.)\n\nAnother example: for hanzi/kanji/hanja, I have designed Unidecode to transliterate according to\nthe value that that character has in Mandarin (otherwise Cantonese,...). Some users have\ncomplained that applying Unidecode to Japanese produces gibberish.\n\nTo make a long story short: transliterating from Japanese is *difficult* and it requires a *lot*\nof context-sensitivity. If you have text that you're fairly sure is in Japanese, you're going to\nhave to use a Japanese-specific algorithm to transliterate Japanese into ASCII. (And then you\ncan call Unidecode on the output from that-- it is useful for, for example, turning ｆｕｌｌｗｉｄｔｈ\ncharacters into their normal (ASCII) forms.\n\n(Note, as of August 2016: I have titanic but tentative plans for making the value of Unihan\ncharacters be something you could set parameters for at runtime, in changing the order of\n\"Mandarin else Cantonese else...\" in the value retrieval. Currently that preference list is\nhardwired on my end, at module-build time. Other options I'm considering allowing for: whether\nthe Mandarin and Cantonese values should have the tone numbers on them; whether every Unihan\nvalue should have a terminal space; and maybe other clever stuff I haven't thought of yet.)\n", "subsections": [] }, "CAVEATS": { "content": "If you get really implausible nonsense out of \"unidecode(...)\", make sure that the input data\nreally is a utf8 string. See perlunicode and perlunitut.\n\n*Unidecode will work disastrously bad on Japanese.* That's because Japanese is very very hard.\nTo extend the Unidecode motto, Unidecode is better than nothing, and with Japanese, *just\nbarely!*\n\nOn pure Mandarin, Unidecode will frequently give odd values-- that's because a single hanzi can\nhave several readings, and Unidecode only knows what the Unihan database says is the most common\none.\n", "subsections": [] }, "THANKS": { "content": "Thanks to (in only the sloppiest of sorta-chronological order): Jordan Lachler, Harald Tveit\nAlvestrand, Melissa Axelrod, Abhijit Menon-Sen, Mark-Jason Dominus, Joe Johnston, Conrad Heiney,\nfileformat.info, Philip Newton, 唐鳳, Tomaž Šolc, Mike Doherty, JT Smith and the MadMongers, Arden\nOgg, Craig Copris, David Cusimano, Brendan Byrd, Hex Martin, and *many* other pals who have\nhelped with the ideas or values for Unidecode's transliterations, or whose help has been in the\nsecret F5 tornado that constitutes the internals of Unidecode's implementation.\n\nAnd thank you to the many people who have encouraged me to plug away at this project. A decade\nwent by before I had any idea that more than about 4 or 5 people were using or getting any value\nout of Unidecode. I am told that actually my figure was missing some zeroes on the end!\n", "subsections": [] }, "PORTS": { "content": "Some wonderful people have ported Unidecode to other languages!\n\n* Python: \n\n* PHP: \n\n* Ruby: \n\n* JavaScript: \n\n* Java: \n\nI can't vouch for the details of each port, but these are clever people, so I'm sure they did a\nfine job.\n", "subsections": [] }, "SEE ALSO": { "content": "An article I wrote for *The Perl Journal* about Unidecode: \n(READ IT!)\n\nJukka Korpela's which is brilliantly useful, and its\ncode is brilliant (so, view source!). I was *kinda* thinking about maybe doing something *sort\nof* like that for the v2.x versions of Unicode-- but now he's got me convinced that I should go\nright ahead.\n\nTom Christiansen's *Perl Unicode Cookbook*,\n\n\nUnicode Consortium: \n\nSearchable Unihan database: \n\nGeoffrey Sampson. 1990. *Writing Systems: A Linguistic Introduction.* ISBN: 0804717567\n\nRandall K. Barry (editor). 1997. *ALA-LC Romanization Tables: Transliteration Schemes for\nNon-Roman Scripts.* ISBN: 0844409405 [ALA is the American Library Association; LC is the Library\nof Congress.]\n\nRupert Snell. 2000. *Beginner's Hindi Script (Teach Yourself Books).* ISBN: 0658009109\n", "subsections": [] }, "LICENSE": { "content": "Copyright (c) 2001, 2014, 2015, 2016 Sean M. Burke.\n\nUnidecode is distributed under the Perl Artistic License ( perlartistic ), namely:\n\nThis library is free software; you can redistribute it and/or modify it under the same terms as\nPerl itself.\n\nThis program is distributed in the hope that it will be useful, but without any warranty;\nwithout even the implied warranty of merchantability or fitness for a particular purpose.\n", "subsections": [] }, "DISCLAIMER": { "content": "Much of Text::Unidecode's internal data is based on data from The Unicode Consortium, with which\nI am unaffiliated. A good deal of the internal data comes from suggestions that have been\ncontributed by people other than myself.\n\nThe views and conclusions contained in my software and documentation are my own-- they should\nnot be interpreted as representing official policies, either expressed or implied, of The\nUnicode Consortium; nor should they be interpreted as necessarily the views or conclusions of\npeople who have contributed to this project.\n\nMoreover, I discourage you from inferring that choices that I've made in Unidecode reflect\npolitical or linguistic prejudices on my part. Just because Unidecode doesn't do great on your\nlanguage, or just because it might seem to do better on some another language, please don't\nthink I'm out to get you!\n", "subsections": [] }, "AUTHOR": { "content": "Your pal, Sean M. Burke \"sburke@cpan.org\"\n\nO HAI!\nIf you're using Unidecode for anything interesting, be cool and email me, I'm always curious\nwhat people use this for. (The answers so far have surprised me!)\n", "subsections": [] } } } }