Encode - character encodings in Perl
| Use Case | Command | Description |
|---|---|---|
| π€ Encode string to octets | encode(ENCODING, STRING [, CHECK]) | Convert Perl string to a byte sequence in given encoding |
| π Decode octets to string | decode(ENCODING, OCTETS [, CHECK]) | Convert byte sequence to Perl's internal string |
| π Find encoding object | find_encoding(ENCODING) | Return encoding object for reuse (e.g., $obj->decode()) |
| π Convert between encodings | from_to($octets, FROM_ENC, TO_ENC [, CHECK]) | Inβplace conversion of octet data |
| π File I/O with encoding | open(FH, "< :encoding(ENCODING)", $file) | Automatically decode/encode via PerlIO layer |
| π List loaded encodings | Encode->encodings() | Get list of canonical names of loaded encodings |
| π List all encodings | Encode->encodings(":all") | Get all available encodings (including unloaded) |
use Encode qw(decode encode);
$characters = decode('UTF-8', $octets, Encode::FB_CROAK);
$octets = encode('UTF-8', $characters, Encode::FB_CROAK);
Encode consists of a collection of modules whose details are too extensive to fit in one document. This one itself explains the top-level APIs and general topics at a glance. For other topics and more details, see the documentation for these modules:
The Encode module provides the interface between Perl strings and the rest of the system. Perl strings are sequences of characters.
The repertoire of characters that Perl can represent is a superset of those defined by the Unicode Consortium. On most platforms the ordinal values of a character as returned by ord(S)" is the Unicode codepoint for that character. The exceptions are platforms where the legacy encoding is some variant of EBCDIC rather than a superset of ASCII; see perlebcdic.
During recent history, data is moved around a computer in 8-bit chunks, often called "bytes" but also known as "octets" in standards documents. Perl is widely used to manipulate data of many types: not only strings of characters representing human or computer languages, but also "binary" data, being the machine's representation of numbers, pixels in an image, or just about anything.
When Perl is processing "binary data", the programmer wants Perl to process "sequences of bytes". This is not a problem for Perl: because a byte has 256 possible values, it easily fits in Perl's much larger "logical character".
This document mostly explains the how. perlunitut and perlunifaq explain the why.
encode(ENCODING, STRING[, CHECK]) - Encodes scalar STRING from Perl's internal form into ENCODING and returns octets. ENCODING can be canonical name or alias. See "Handling Malformed Data" for CHECK. CAVEAT: input may be modified in-place unless LEAVE_SRC is set. Example: $octets = encode("iso-8859-1", $string); If $string is undef, returns undef. Alias: str2bytes.decode(ENCODING, OCTETS[, CHECK]) - Decodes octets in ENCODING into Perl's internal form. Same caveats as encode. Example: $string = decode("iso-8859-1", $octets); Alias: bytes2str.find_encoding(ENCODING) - Returns the encoding object for ENCODING, or undef if not found. The object can be reused for efficiency: my $enc = find_encoding("iso-8859-1");
while(<>) { my $string = $enc->decode($_); ... } Methods available include name() for canonical name. See Encode::Encoding.find_mime_encoding(MIME_ENCODING) - Like find_encoding but only matches valid MIME encoding names (case insensitive). Returns undef for invalid MIME names like "utf8".from_to($octets, FROM_ENC, TO_ENC [, CHECK]) - Inβplace conversion between two encodings. $octets must be octets (not Perl characters). Returns length of converted string on success, undef on error. Example: from_to($octets, "iso-8859-1", "cp1250"); CAVEAT: does not respect $check during decoding. For full control, use decode then encode separately.encode_utf8($string) - WARNING: may produce invalid UTF-8. Prefer encode("UTF-8", $string). Equivalent to encode("utf8", $string) (loose utf8). Cannot fail.decode_utf8($octets [, CHECK]) - WARNING: accepts invalid UTF-8. Prefer decode("UTF-8", $octets [, CHECK]). Equivalent to decode("utf8", $octets [, CHECK]). May fail. CAVEAT: input may be modified in-place.Encode->encodings() - Returns canonical names of loaded encodings.Encode->encodings(":all") - Returns all available encodings (including unloaded).Encode->encodings("Encode::JP") - Returns encodings from a specific module. If "::" not in name, "Encode::" is assumed.See Encode::Supported for details.
use Encode;
use Encode::Alias;
define_alias(NEWNAME => ENCODING);
After that, NEWNAME can be used as alias for ENCODING (name or encoding object). Check alias existence with resolve_alias:
Encode::resolve_alias("latin1") eq "iso-8859-1" # true
Encode::resolve_alias("iso-8859-12") # false; nonexistent
Encode::resolve_alias($name) eq $name # true if $name is canonical
resolve_alias can be imported via use Encode qw(resolve_alias). See Encode::Alias.
Canonical names may not match IANA registry names (e.g., "utf-8-strict" vs "UTF-8"). Method mime_name() returns the proper IANA name:
use Encode;
my $enc = find_encoding("UTF-8");
warn $enc->name; # utf-8-strict
warn $enc->mime_name; # UTF-8
See Encode::Encoding.
Use :encoding(ENC) layer on filehandles for automatic encode/decode:
### Version 1 via PerlIO
open(INPUT, "< :encoding(shiftjis)", $infile)
|| die "Can't open < $infile for reading: $!";
open(OUTPUT, "> :encoding(euc-jp)", $outfile)
|| die "Can't open > $output for writing: $!";
while (<INPUT>) { # auto decodes $_
print OUTPUT; # auto encodes $_
}
### Version 2 via from_to()
open(INPUT, "< :raw", $infile) || die ...;
open(OUTPUT, "> :raw", $outfile) || die ...;
while (<INPUT>) {
from_to($_, "shiftjis", "euc-jp", 1);
print OUTPUT;
}
Check if encoding supports PerlIO with perlio_ok:
Encode::perlio_ok("hz"); # false
find_encoding("euc-cn")->perlio_ok; # true (where available)
use Encode qw(perlio_ok); # imported upon request
perlio_ok("euc-jp")
All core encodings except "hz" and "ISO-2022-kr" are PerlIO-savvy. See Encode::Encoding and Encode::PerlIO.
The optional CHECK argument controls behavior on malformed data. Default is Encode::FB_DEFAULT (== 0). As of version 2.12, coderef values are supported. Not all encodings support this; e.g., Encode::Unicode always croaks.
| Constant | Value | Behavior |
|---|---|---|
FB_DEFAULT | 0 | Replace malformed character with substitution character (SUBCHAR on encode, U+FFFD on decode). Warns if UTF-8. |
FB_CROAK | 1 | Die immediately with error message. |
FB_QUIET | bitmask | Return processed portion on error; unprocessed data remains in argument. |
FB_WARN | bitmask | Same as FB_QUIET but issues warning. Warnings are independent of pragma warnings; use ENCODE::ONLY_PRAGMA_WARNINGS to follow lexical warnings (since 2.99). |
FB_PERLQQ | bitmask | Insert \xHH on decode, \x{HHHH} on encode. |
FB_HTMLCREF | bitmask | Insert &#NNN; (decimal) on encode. |
FB_XMLCREF | bitmask | Insert &#xHHHH; (hex) on encode. |
Bitmask breakdown:
| Flag | Hex | FB_DEFAULT | FB_CROAK | FB_QUIET | FB_WARN | FB_PERLQQ |
|---|---|---|---|---|---|---|
DIE_ON_ERR | 0x0001 | X | ||||
WARN_ON_ERR | 0x0002 | X | ||||
RETURN_ON_ERR | 0x0004 | X | X | |||
LEAVE_SRC | 0x0008 | X | ||||
PERLQQ | 0x0100 | X | ||||
HTMLCREF | 0x0200 | |||||
XMLCREF | 0x0400 |
LEAVE_SRC: If not set, source string to encode() or decode() is overwritten. Bitwise-OR to preserve input.
As of version 2.12, CHECK can be a coderef. For encode: receives ordinal of unmapped character, returns octets for fallback.
$ascii = encode("ascii", $utf8, sub{ sprintf "<U+%04X>", shift });
For decode: receives list of ordinal values, returns decoded string.
$str = decode 'UTF-8', $octets, sub {
my $tmp = join '', map chr, @_;
return decode 'ISO-8859-15', $tmp;
};
use Encode qw(define_encoding);
define_encoding($object, CANONICAL_NAME [, alias...]);
Associates $object with canonical name and optional aliases. See Encode::Encoding.
Before Perl 5.8, eq compared strings directly. Since 5.8, eq considers the UTF8 flag. Quoting Programming Perl, 3rd ed.:
The UTF8 flag is not visible in scripts; you can peek with internal functions (see below).
is_utf8(STRING [, CHECK]) - [INTERNAL] Tests whether UTF8 flag is on. If CHECK true, also verifies well-formed UTF-8. Returns true/false. Do not use to distinguish character/binary data._utf8_on(STRING) - [INTERNAL] Turns UTF8 flag on. Does NOT validate content. Returns previous state or undef if not a string. Not for tainted values._utf8_off(STRING) - [INTERNAL] Turns UTF8 flag off. Returns previous state. Not for tainted values.Historically, Perl used a loose interpretation of UTF-8 (allowing 32-bit and surrogates). Official UTF-8 is stricter (0..0x10_FFFF, no surrogates, no non-shortest encodings). As of Perl 5.8.7 and Encode 2.10:
utf-8-strict).Examples:
encode("utf8", "\x{FFFF_FFFF}", 1); # okay (loose)
encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks (strict)
find_encoding("UTF-8")->name # 'utf-8-strict'
find_encoding("utf-8")->name # ditto (case/underscore insensitive)
find_encoding("UTF8")->name # 'utf8'
Encode::Encoding, Encode::Supported, Encode::PerlIO, encoding, perlebcdic, "open" in perlfunc, perlunicode, perluniintro, perlunifaq, perlunitut, utf8, the Perl Unicode Mailing List <http://lists.perl.org/list/perl-unicode.html>
This project was originated by the late Nick Ing-Simmons and later maintained by Dan Kogai <dankogai@cpan.org>. See AUTHORS for a full list of people involved. For any questions, send mail to <perl-unicode@perl.org> so that we can all share.
While Dan Kogai retains the copyright as a maintainer, credit should go to all those involved. See AUTHORS for a list of those who submitted code to the project.
Copyright 2002-2014 Dan Kogai <dankogai@cpan.org>.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Generated by phpman v4.9.22-1-g1b0fcb4 · Markdown · JSON · MCP Author: Che Dong Under GNU General Public License
2026-07-05 16:48 @216.73.216.52
CrawledBy Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Enhanced by LLM: deepseek-v4-pro / taotoken.net / www.chedong.com - original format