man > Encode

Che Dong

🟢 NAME

Encode - character encodings in Perl

🚀 Quick Reference

Use Case	Command	Description
🔤 Encode string to octets	`encode(ENCODING, STRING [, CHECK])`	Convert Perl string to a byte sequence in given encoding
🔓 Decode octets to string	`decode(ENCODING, OCTETS [, CHECK])`	Convert byte sequence to Perl's internal string
🔍 Find encoding object	`find_encoding(ENCODING)`	Return encoding object for reuse (e.g., `$obj->decode()`)
🔁 Convert between encodings	`from_to($octets, FROM_ENC, TO_ENC [, CHECK])`	In‑place conversion of octet data
📂 File I/O with encoding	`open(FH, "< :encoding(ENCODING)", $file)`	Automatically decode/encode via PerlIO layer
📋 List loaded encodings	`Encode->encodings()`	Get list of canonical names of loaded encodings
🌐 List all encodings	`Encode->encodings(":all")`	Get all available encodings (including unloaded)

📖 SYNOPSIS

use Encode qw(decode encode);
$characters = decode('UTF-8', $octets,     Encode::FB_CROAK);
$octets     = encode('UTF-8', $characters, Encode::FB_CROAK);

📖 Table of Contents

Encode consists of a collection of modules whose details are too extensive to fit in one document. This one itself explains the top-level APIs and general topics at a glance. For other topics and more details, see the documentation for these modules:

Encode::Alias - Alias definitions to encodings
Encode::Encoding - Encode Implementation Base Class
Encode::Supported - List of Supported Encodings
Encode::CN - Simplified Chinese Encodings
Encode::JP - Japanese Encodings
Encode::KR - Korean Encodings
Encode::TW - Traditional Chinese Encodings

📖 DESCRIPTION

The Encode module provides the interface between Perl strings and the rest of the system. Perl strings are sequences of characters.

The repertoire of characters that Perl can represent is a superset of those defined by the Unicode Consortium. On most platforms the ordinal values of a character as returned by ord(S)" is the Unicode codepoint for that character. The exceptions are platforms where the legacy encoding is some variant of EBCDIC rather than a superset of ASCII; see perlebcdic.

During recent history, data is moved around a computer in 8-bit chunks, often called "bytes" but also known as "octets" in standards documents. Perl is widely used to manipulate data of many types: not only strings of characters representing human or computer languages, but also "binary" data, being the machine's representation of numbers, pixels in an image, or just about anything.

When Perl is processing "binary data", the programmer wants Perl to process "sequences of bytes". This is not a problem for Perl: because a byte has 256 possible values, it easily fits in Perl's much larger "logical character".

This document mostly explains the how. perlunitut and perlunifaq explain the why.

📖 TERMINOLOGY

character: A character in the range 0 .. 2**32-1 (or more); what Perl's strings are made of.
byte: A character in the range 0..255; a special case of a Perl character.
octet: 8 bits of data, with ordinal values 0..255; term for bytes passed to or from a non-Perl context, such as a disk file, standard I/O stream, database, command-line argument, environment variable, socket etc.

⚙️ THE PERL ENCODING API

🔧 Basic methods

encode(ENCODING, STRING[, CHECK]) - Encodes scalar STRING from Perl's internal form into ENCODING and returns octets. ENCODING can be canonical name or alias. See "Handling Malformed Data" for CHECK. CAVEAT: input may be modified in-place unless LEAVE_SRC is set. Example:
```
$octets = encode("iso-8859-1", $string);
```
If $string is undef, returns undef. Alias: str2bytes.
decode(ENCODING, OCTETS[, CHECK]) - Decodes octets in ENCODING into Perl's internal form. Same caveats as encode. Example:
```
$string = decode("iso-8859-1", $octets);
```
Alias: bytes2str.
find_encoding(ENCODING) - Returns the encoding object for ENCODING, or undef if not found. The object can be reused for efficiency:
```
my $enc = find_encoding("iso-8859-1");
while(<>) { my $string = $enc->decode($_); ... }
```
Methods available include name() for canonical name. See Encode::Encoding.
find_mime_encoding(MIME_ENCODING) - Like find_encoding but only matches valid MIME encoding names (case insensitive). Returns undef for invalid MIME names like "utf8".
from_to($octets, FROM_ENC, TO_ENC [, CHECK]) - In‑place conversion between two encodings. $octets must be octets (not Perl characters). Returns length of converted string on success, undef on error. Example:
```
from_to($octets, "iso-8859-1", "cp1250");
```
CAVEAT: does not respect $check during decoding. For full control, use decode then encode separately.
encode_utf8($string) - WARNING: may produce invalid UTF-8. Prefer encode("UTF-8", $string). Equivalent to encode("utf8", $string) (loose utf8). Cannot fail.
decode_utf8($octets [, CHECK]) - WARNING: accepts invalid UTF-8. Prefer decode("UTF-8", $octets [, CHECK]). Equivalent to decode("utf8", $octets [, CHECK]). May fail. CAVEAT: input may be modified in-place.

📋 Listing available encodings

Encode->encodings() - Returns canonical names of loaded encodings.
Encode->encodings(":all") - Returns all available encodings (including unloaded).
Encode->encodings("Encode::JP") - Returns encodings from a specific module. If "::" not in name, "Encode::" is assumed.

See Encode::Supported for details.

🔗 Defining Aliases

use Encode;
use Encode::Alias;
define_alias(NEWNAME => ENCODING);

After that, NEWNAME can be used as alias for ENCODING (name or encoding object). Check alias existence with resolve_alias:

Encode::resolve_alias("latin1") eq "iso-8859-1" # true
Encode::resolve_alias("iso-8859-12")   # false; nonexistent
Encode::resolve_alias($name) eq $name  # true if $name is canonical

resolve_alias can be imported via use Encode qw(resolve_alias). See Encode::Alias.

🌐 Finding IANA Character Set Registry names

Canonical names may not match IANA registry names (e.g., "utf-8-strict" vs "UTF-8"). Method mime_name() returns the proper IANA name:

use Encode;
my $enc = find_encoding("UTF-8");
warn $enc->name;      # utf-8-strict
warn $enc->mime_name; # UTF-8

See Encode::Encoding.

📂 Encoding via PerlIO

Use :encoding(ENC) layer on filehandles for automatic encode/decode:

### Version 1 via PerlIO
open(INPUT,  "< :encoding(shiftjis)", $infile)
    || die "Can't open < $infile for reading: $!";
open(OUTPUT, "> :encoding(euc-jp)",  $outfile)
    || die "Can't open > $output for writing: $!";
while (<INPUT>) {   # auto decodes $_
    print OUTPUT;   # auto encodes $_
}

### Version 2 via from_to()
open(INPUT,  "< :raw", $infile) || die ...;
open(OUTPUT, "> :raw",  $outfile) || die ...;
while (<INPUT>) {
    from_to($_, "shiftjis", "euc-jp", 1);
    print OUTPUT;
}

Check if encoding supports PerlIO with perlio_ok:

Encode::perlio_ok("hz");             # false
find_encoding("euc-cn")->perlio_ok;  # true (where available)
use Encode qw(perlio_ok);            # imported upon request
perlio_ok("euc-jp")

All core encodings except "hz" and "ISO-2022-kr" are PerlIO-savvy. See Encode::Encoding and Encode::PerlIO.

⚠️ Handling Malformed Data

The optional CHECK argument controls behavior on malformed data. Default is Encode::FB_DEFAULT (== 0). As of version 2.12, coderef values are supported. Not all encodings support this; e.g., Encode::Unicode always croaks.

📋 List of CHECK values

Constant	Value	Behavior
`FB_DEFAULT`	0	Replace malformed character with substitution character (SUBCHAR on encode, U+FFFD on decode). Warns if UTF-8.
`FB_CROAK`	1	Die immediately with error message.
`FB_QUIET`	bitmask	Return processed portion on error; unprocessed data remains in argument.
`FB_WARN`	bitmask	Same as `FB_QUIET` but issues warning. Warnings are independent of pragma `warnings`; use `ENCODE::ONLY_PRAGMA_WARNINGS` to follow lexical warnings (since 2.99).
`FB_PERLQQ`	bitmask	Insert `\xHH` on decode, `\x{HHHH}` on encode.
`FB_HTMLCREF`	bitmask	Insert `&#NNN;` (decimal) on encode.
`FB_XMLCREF`	bitmask	Insert `&#xHHHH;` (hex) on encode.

Bitmask breakdown:

Flag	Hex	FB_CROAK	FB_QUIET	FB_WARN	FB_PERLQQ
`DIE_ON_ERR`	0x0001	X
`WARN_ON_ERR`	0x0002			X
`RETURN_ON_ERR`	0x0004		X	X
`LEAVE_SRC`	0x0008				X
`PERLQQ`	0x0100				X
`HTMLCREF`	0x0200
`XMLCREF`	0x0400

LEAVE_SRC: If not set, source string to encode() or decode() is overwritten. Bitwise-OR to preserve input.

🔧 Coderef for CHECK

As of version 2.12, CHECK can be a coderef. For encode: receives ordinal of unmapped character, returns octets for fallback.

$ascii = encode("ascii", $utf8, sub{ sprintf "<U+%04X>", shift });

For decode: receives list of ordinal values, returns decoded string.

$str = decode 'UTF-8', $octets, sub {
    my $tmp = join '', map chr, @_;
    return decode 'ISO-8859-15', $tmp;
};

🛠️ Defining Encodings

use Encode qw(define_encoding);
define_encoding($object, CANONICAL_NAME [, alias...]);

Associates $object with canonical name and optional aliases. See Encode::Encoding.

🚩 The UTF8 flag

Before Perl 5.8, eq compared strings directly. Since 5.8, eq considers the UTF8 flag. Quoting Programming Perl, 3rd ed.:

Goal #1: Old byte-oriented programs should not spontaneously break on old byte-oriented data.
Goal #2: Old byte-oriented programs should magically start working on new character-oriented data when appropriate.
Goal #3: Programs should run just as fast in character-oriented mode as in byte-oriented mode.
Goal #4: Perl should remain one language, not fork into byte- and character-oriented versions.

The UTF8 flag is not visible in scripts; you can peek with internal functions (see below).

🔧 Messing with Perl's Internals

is_utf8(STRING [, CHECK]) - [INTERNAL] Tests whether UTF8 flag is on. If CHECK true, also verifies well-formed UTF-8. Returns true/false. Do not use to distinguish character/binary data.
_utf8_on(STRING) - [INTERNAL] Turns UTF8 flag on. Does NOT validate content. Returns previous state or undef if not a string. Not for tainted values.
_utf8_off(STRING) - [INTERNAL] Turns UTF8 flag off. Returns previous state. Not for tainted values.

🧩 UTF-8 vs. utf8 vs. UTF8

Historically, Perl used a loose interpretation of UTF-8 (allowing 32-bit and surrogates). Official UTF-8 is stricter (0..0x10_FFFF, no surrogates, no non-shortest encodings). As of Perl 5.8.7 and Encode 2.10:

"UTF-8" (with hyphen) means strict UTF-8 (canonical name utf-8-strict).
"utf8" (no hyphen) means Perl's traditional loose UTF-8.
"UTF8" (no hyphen, no underscore) is Perl's internal flag name.

Examples:

encode("utf8",  "\x{FFFF_FFFF}", 1); # okay (loose)
encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks (strict)
find_encoding("UTF-8")->name # 'utf-8-strict'
find_encoding("utf-8")->name # ditto (case/underscore insensitive)
find_encoding("UTF8")->name  # 'utf8'

📚 SEE ALSO

Encode::Encoding, Encode::Supported, Encode::PerlIO, encoding, perlebcdic, "open" in perlfunc, perlunicode, perluniintro, perlunifaq, perlunitut, utf8, the Perl Unicode Mailing List <http://lists.perl.org/list/perl-unicode.html>

👤 MAINTAINER

This project was originated by the late Nick Ing-Simmons and later maintained by Dan Kogai <dankogai@cpan.org>. See AUTHORS for a full list of people involved. For any questions, send mail to <perl-unicode@perl.org> so that we can all share.

While Dan Kogai retains the copyright as a maintainer, credit should go to all those involved. See AUTHORS for a list of those who submitted code to the project.

©️ COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Generated by phpman v4.9.22-1-g1b0fcb4 · Markdown · JSON · MCP Author: Che Dong Under GNU General Public License
2026-07-05 16:48 @216.73.216.52
CrawledBy Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)

Enhanced by LLM: deepseek-v4-pro / taotoken.net / www.chedong.com - original format