This article was originally written for Pythons 3.0 and 2.6, but applies to all 3.X and 2.X. It became a new chapter in the book Learning Python, and was revised and expanded in the 5th Edition, but evolved independently here. "Contents" below opens a table of contents inline if JavaScript is enabled, or off page if not. See also the related reading links ahead for more resources. |
Jun-2009 (last polished Apr-2024)
One of the most noticeable changes in Python 3.0 is the mutation of string
object types. In a nutshell, 2.X's
str
and unicode
types have morphed into 3.X's
bytes
and str
types, and a new mutable
bytearray
type has been added. Especially if you process
data that is either Unicode or binary in nature, this can have substantial impacts
on your code. As a general rule of thumb, how much you need
to care about this topic depends in large part upon which of the
following categories you fall into:
str
.
struct
module—you will need to understand 3.0's new bytes
object, and its
different and sharper distinction between text and binary data and files.
str
string type,
text files, and all the familiar string operations. Your strings will be
encoded and decoded using your platform's default encoding
(e.g., ASCII, UTF-8, or Latin-1; the locale
module's
getpreferredencoding()
gives your open()
default if you must know), but you probably won't notice.
For example, if text is still always ASCII in your corner of the software world, you might be able to get by with normal string objects and text files, and can avoid most of the following story. As we'll see in a moment, ASCII is a simple kind of Unicode and a subset of other encodings, so string operations and files "just work" if your programs process ASCII text.
Even if you fall into the last category above, though, a basic understanding of 3.0's string model can help, both to demystify some of the underlying details now, and to help you master Unicode or binary data issues if they impact you in the future. Given the prominence of the web in most software careers today, that impact may be more a matter of "when" than "if."
Before looking at code, let's begin with a general overview of the 3.0 string model. To understand why 3.0 went the way it did, we have to start with a brief look at how characters are actually represented in computers.
Most programmers think of strings as a series of characters (really, their integer codes) used to represent textual data. That's still true in the brave new world of Unicode, but the way characters are stored in a computer's memory and files can vary, depending on both what sort of characters are recorded, and how programmers choose to record them.
For many programmers in the US, the ASCII standard defines their notion
of text strings. ASCII is a standard created in the US, which defines character
codes 0..127, and thus allows each character to be stored in one 8-bit byte.
For example, the ASCII standard maps character 'a' to the integer value 97 (61 in hex),
which can be stored in a single byte both in memory and on files. If you wish
to check, Python's ord()
shows the integer code of a given
character; chr()
reveals the character of a given integer code;
and hex()
gives the code's byte value as two hex digits, each of
which fits a 4-bit nibble;
the first of these is the value of a character's code—and byte—in ASCII:
>>> ord('a') # character => code 97 >>> chr(97) # code => character 'a' >>> hex(97) # byte value: fits 8 bits '0x61'
ASCII makes text processing simple, because characters directly correlate to bytes.
Sometimes, though, this isn't enough. Accented characters and special symbols, for
example, do not fit into the range of character codes defined by ASCII. To allow for
some such extra characters, other standards allow all possible values in an 8-bit byte,
0..255, to be used as codes, and assign values 128..255 to additional characters.
One such standard is known as Latin-1, and is widely used in Western Europe.
In Latin-1, character codes above 127 are assigned to accented and otherwise-special
characters. For instance, the character which Latin-1 assigns to code 196
(a.k.a. byte value 0xc4
)
is a specially marked and non-ASCII character. Per Python 3.X:
>>> chr(196) # too big for ASCII 'Ä' >>> ord('Ä') # okay for Latin-1 196 >>> hex(ord('Ä')) # byte value in Latin-1 '0xc4'
Still, some alphabets define so many characters that it is impossible to represent them as one byte-sized code per character. The integer codes of the symbols and characters in the following, for example, require more space than a byte—as do those of all the silly emojis that may not work in some viewers and editors, but manage to crop up in your emails anyhow:
>>> ord('☞') 9758 >>> hex(ord('☞')) # too big for one byte '0x261e' >>> [hex(ord(c)) for c in '真Л⇨'] # ditto: Unicode required ['0x771f', '0x41b', '0x21e8'] >>> [hex(ord(c)) for c in '🙂🙊👍'] # emojis > two bytes (16 bits) ['0x1f642', '0x1f64a', '0x1f44d']
Unicode provides the generality we need to deal with text containing non-ASCII characters and symbols like these. In fact, it defines and assigns enough character codes to represent almost every natural language in use, plus a large set of symbols. Unicode is sometimes referred to as "wide-character" strings, because its range of characters is so broad that multiple bytes may be needed to represent individual character codes. To allow for this, it also defines standard ways to map character codes to bytes for storage and transmission that are both platform and language neutral—the encodings we'll explore in the next section.
The takeaway here is that Unicode's combination of all-encompassing character codes and their predefined encodings make it a highly flexible model, and the standard way that programs deal with non-English and other text that may have more characters than 8-bit bytes can handle. As an added bonus, earlier schemes like ASCII also fall under the Unicode umbrella unchanged, but we have to move on to the next section to see how.
The key to understanding how Unicode works lies in the way its character codes (a.k.a. "code points") in memory are mapped to their encoded forms as needed for efficient storage or transfer. We say that characters are translated to and from raw bytes using an encoding—the rules for translating a Unicode string into a sequence of bytes, and extracting the string from a sequence of bytes. More procedurally, this translation back and forth between bytes and strings is defined by two terms:
As noted, Unicode defines both character codes and a set of standard encodings. For some of the encodings it defines, the translation process is trivial—ASCII and Latin-1, for instance, map each character to a single byte, so little or no work is required to encode and decode. For other encodings, the mapping can be more complex, and yield multiple bytes per character.
The widely used UTF-8 encoding, for example, allows more characters to be represented by employing a variable-number-of-bytes scheme that's both general and economical. Character codes less than 128 are represented as a single byte; codes between 128 and 0x7ff (2047) are turned into two bytes, where each byte has a value between 128 and 255; and codes above 0x7ff are turned into three- or four-byte sequences having values between 128 and 255. This keeps simple ASCII strings compact, sidesteps byte ordering issues, and avoids null (zero) bytes that can cause problems for C libraries and networking.
Despite such details, it's important to note that ASCII is a subset of both Latin-1 and UTF-8. This is true because these encodings both assign ASCII characters to the same codes, and encode those characters to bytes the same way. This makes Unicode compatible with existing ASCII data: every character string encoded per ASCII is also valid according to the Latin-1 and UTF-8 encodings, and every ASCII file is a valid Latin-1 and UTF-8 file. Technically, the ASCII encoding is a 7-bit subset of the other two: it's binary compatible for all character codes less than 128. Latin-1 and UTF-8 simply allow for additional characters: Latin-1 for characters mapped to values 128..255 within a byte, and UTF-8 for characters that may be represented with multiple bytes.
Other encodings support richer character sets in other ways. For instance, UTF-16 and UTF-32 use a fixed and larger 2 or 4 bytes per character, respectively, the former with a special "surrogate pair" protocol for codes too large for 2 bytes. We'll skip further details here, but keep in mind that all of these—ASCII, Latin-1, UTF-8, and others—are simply alternative Unicode encodings that yield the same Unicode code-point text when decoded. This net effect ensures that text is portable across all the tools that use it, in exchange for minor translation costs:
To Python programmers, an encoding is specified as a string containing the encoding's
name. Python comes with roughly 100 different encodings; see the Python Library
Reference for a list. Importing module encodings
and asking for help(encodings)
shows you many as well; some are implemented in Python, and some in C. Some encodings
have multiple names too; for example, "latin-1", "iso_8859_1" and "8859" are all
synonyms for the same encoding, Latin-1. We'll revisit encodings
later in this article,
when we study Unicode coding techniques.
For another take on the Unicode backstory, see the Python standard manual set. It includes a "Unicode HOWTO" in its "Python HOWTOs" section which provides additional details which we will skip here in the interest of space.
Update: some encodings also require or allow markers at the start of encoded text, known as Unicode BOMs. These markers can designate byte order and encoding type, and may be present whether the encoded text is stored in files or memory. Though not covered in this early-draft article, there is a brief look at this topic on this site here, and more complete surveys in later books. For more on why using the correct encoding matters in general, read the Latin-1/CP-1252 saga here.
Update:
this doc also does not discuss Unicode normalization, an advanced but
essential topic. In short, the Unicode standard oddly allows some non-ASCII
characters (e.g., ñ
and Ä
) to be represented with
multiple and differing code-point sequences when decoded. This in turn forces
many text- and filename-processing tools to make disparate forms equivalent
before running comparisons. For more details on this border case, see this
site's off-page coverage
here
and
here.
At a more concrete level, the Python language provides multiple string data types to represent content in your script: both textual data—integer code-point values of decoded Unicode characters in memory, as well as binary data—raw byte values, including text that is in encoded form. These types differ in the two Python lines.
For example, Python 2.X has a general string type for representing both simple 8-bit character text like ASCII and binary data, along with a specific type for representing richer Unicode text that may occupy multiple bytes when encoded or decoded:
str
—for representing both 8-bit text and binary data
unicode
—for representing Unicode text (decoded code points)
unicode
allows for the extra range of
Unicode characters, and has extra support for encoding and decoding), but their
operation sets largely overlap. Because the str
string type in 2.X represents
both text that can be represented with 8-bit bytes as well as binary data, it can be used
for both textual and non-textual content.
By contrast, Python 3.0 comes with 3 string object types:
str
—for representing Unicode text (decoded code points)
bytes
—for representing binary data (including encoded text)
bytearray
—a mutable flavor of the bytes
type
str
and unicode
) into
a single string type that supports both byte-oriented and richer
Unicode text. Developers wanted to remove the 2.X string dichotomy,
and make Unicode processing more uniform and natural.
To achieve this, the 3.0 str
type is defined as an immutable
sequence of characters (really, code points that are not necessarily bytes).
Its content may contain both simple text such as ASCII whose encoded and decoded
forms yield one byte per character, as well as richer Unicode text whose encoded
and decoded forms may both require multiple bytes per character.
In memory, a str
is just a sequence of Unicode code points.
When transferred to and from files, a str
is
automatically encoded and decoded using either the platform default,
or a provided encoding name to translate with an explicit scheme.
While 3.0's new str
type does achieve 2.X str
/unicode
merging for text, many programs still need to process raw binary content that is not encoded
per any Unicode format—as well as the bytes used to store text when it is encoded.
Image files, and packed data you might process with Python's struct
module
fall into this category. To support this, a new type, bytes
, also
was introduced to support processing of truly binary data. bytes
is just
bytes, not Unicode characters, though its content may include still-encoded text.
In 2.X, the general str
type filled this binary data role, because strings
were just sequences of bytes (the separate unicode
type handled richer text).
In 3.0, the bytes
type is defined as an immutable sequence of 8-bit integers
representing byte values, and supports almost all the same operations that the
str
type does; this includes string methods, sequence operations, and even
re
module pattern matching, but not formatting
(till later in 3.X's evolution: see the update ahead).
A bytes
object really is a sequence of small integers, each of which is in the
range 0..255; indexing a bytes
returns an int
, slicing one returns another bytes
,
and running list()
on one returns a list of integers, not characters. However,
when processed with operations that assume characters (e.g., the isalpha()
method),
the contents of bytes
objects are assumed to be ASCII-encoded bytes.
Further, bytes
items whose values fall in the range of ASCII character
codes are printed as ASCII characters instead of integers; this is done purportedly for
convenience, though it may also confuse the distinction between text and binary data.
While it was at it, Python also sprouted bytearray
in 3.0,
a variant of bytes
, which is mutable, and so supports in-place changes.
The bytearray
type supports the usual string operations that str
and bytes
do, but also has many of the same in-place change operations
as lists (e.g., append()
and extend()
, and assignment to indexes).
Assuming your strings can be treated as raw bytes, bytearray
finally adds
direct in-place mutability for string data—something not possible in 2.X,
or with 3.0's str
or bytes
.
File I/O has also been revamped in 3.0 to reflect the str
/bytes
distinction.
Really, text is just decoded integer character codes when it is in memory;
it's when text is transferred to and from external interfaces like files that
Unicode encodings come into play. By contrast, truly binary data may have
nothing at all to do with encodings (or text at all). Because of this, Python now
makes a sharp platform-independent distinction between text files and binary files:
str
; writing takes a str
, and automatically encodes it before
transferring to the file. Text mode files also support universal end-of-line
translation, and encoding specification arguments.
open()
call, reading its data does not decode it in any way, and simply
returns its content raw and unchanged, as a bytes
object; writing takes a bytes
object and transfers it to the file unchanged. Binary-mode files also accept
a bytearray
object for the content to be written to the file.
Because str
and bytes
are sharply differentiated by the
language, the net effect is that you must decide whether your data is text
or binary in nature, and use str
or bytes
objects to
represent its content in your script, respectively. Ultimately, the mode
in which you open a file will dictate which type of object your script will
use to represent its content.
bytes
and binary-mode
files. You might also opt for bytearray
to update the data without
making copies of it in memory.
str
and text-mode files.
Notice that the mode string argument to open()
(its second argument) becomes
fairly crucial in Python 3.0—its content not only specifies a file processing
mode, but also implies a Python object type. By adding a "b" (lower-case only) to
the mode string, you specify a binary mode file, and will receive, or must provide,
a bytes
object to represent the file's content when reading or writing. Without
the "b", your file is processed in text mode, and you'll use str
objects to
represent its content. For example, modes "rb", "wb", and "rb+", imply bytes
;
"r", "w+", and "rt" (the default) imply str
.
If you're anxious to see files in action, watch for the examples ahead, especially those of Unicode-text files. To understand file usage in full, though, we first need to explore string basics.
Let's step through a few examples that demonstrate how the 3.0 string types are used.
Note up front that the code in this section was run with and applies to 3.0 only, unless noted
otherwise. That said, although there is no bytes
type in Python 2.6 (it
has just the general str
), some cross-version compatibility is still possible:
in 2.6, the call bytes(X)
is present as a synonym for str(X)
,
and the new literal form b'...'
is taken to be the same as the normal string
literal '...'
(in this article, ...
means a string's characters).
You may still run into version skew in some cases, though; the 2.6 bytes()
call, for instance, does not allow the second argument (encoding name)
required by 3.0's bytes()
.
Python 3.0 string objects originate when you call a function such as str()
or
bytes()
; process a file created by calling open()
(described
later in this article);
or code literal syntax in your script. For the latter, a new literal form, b'...'
(and equivalently, B'...'
) is used to create bytes
objects in 3.0, and
bytearray
objects may be created by calling the bytearray()
function,
with a variety of possible arguments.
More formally, in 3.0 all the current string literal forms—'...'
,
"..."
, and triple-quoted blocks—generate a str
; adding a
"b" or "B" just before them creates a bytes
instead. This new b'...'
bytes
literal is similar in spirit to the r'...'
raw string, which suppresses
backslash escapes. Consider the following:
C:\misc>c:\python30\python >>> B = b'spam' # make a bytes object (8-bit bytes) >>> S = 'eggs' # make a str object (Unicode characters, 8-bit or wider) >>> type(B), type(S) (<class 'bytes'>, <class 'str'>) >>> B # prints as a character string, really a sequence of ints b'spam' >>> S 'eggs' >>> B[0], S[0] # indexing returns an int for bytes, str for str (115, 'e') >>> B[1:], S[1:] # slicing makes another bytes or str (b'pam', 'ggs') >>> list(B), list(S) ([115, 112, 97, 109], ['e', 'g', 'g', 's']) # bytes is really ints >>> B[0] = 'x' # both are immutable TypeError: 'bytes' object does not support item assignment >>> S[0] = 'x' TypeError: 'str' object does not support item assignment >>> B = B""" # bytes prefix works on single, double, triple quotes ... xxxx ... yyyy ... """ >>> B b'\nxxxx\nyyyy\n'
As mentioned, for forward compatibility, in Python 2.6 the 3.0 b'...'
literal is present but is the same as '...'
and makes a 2.X str
,
and bytes()
is just a synonym for str()
; in 3.0, both
these address the distinct bytes
type, as shown above for the literal.
Also note that the u'...'
and U'...'
unicode
string literal forms in 2.6 discussed
ahead
are gone in 3.0; use '...'
in 3.0 instead, since all text strings
are Unicode in the 3.X line, even if they contain only ASCII characters.
Update:
Python 3.X later reinstated 2.X's unicode
string literals to ease
migration of 2.X code: a 2.X u'...'
unicode
literal in Python 3.3 and later is now just a synonym
for a 3.X '...'
str
literal. This makes sense given 3.X's all-Unicode str
type, and is the
backward-compatible equivalent of 2.X's forward-compatible b'...'
support. It's tempting to read into
this that 2.X's str
and unicode
simply become 3.X's bytes
and str
,
but the division of these types' roles is much sharper in 3.X, as the next section explains.
Syntax aside, the first thing you might notice about Python 3.0 strings is what they cannot do.
Although Python 2.X allows its str
and unicode
objects to be freely mixed (if the
str
contains only 7-bit ASCII text, at least), 3.X draws a much sharper distinction—str
and bytes
never mix automatically in expressions, and as a rule are not converted to one
another automatically when passed to functions. That is, a function that expects an argument to
be a str
object won't generally accept a bytes
(and vice versa),
and operators are just as rigid in 3.X:
>>> 'eggs' + b'spam' TypeError: can only concatenate str (not "bytes") to str
This is easier to understand if you remember that a text string may be radically different
in its encoded and decoded forms, and Python has no idea what the content of a bytes
is:
if the bytes
is encoded text its encoding is unknown, but it may also be binary data that
has nothing to do with text at all. Because of this ambiguity, Python 3.0 basically
requires that you either commit to one type or the other, or perform manual, explicit
conversions with the following tools:
S.encode()
and bytes(S, encoding)
encode a str
S
to a new bytes
B.decode()
and str(B, encoding)
decode a bytes
B
to a new str
Both
the S.encode()
and B.decode()
methods above
and the file open()
call we'll explore
ahead
use either an explicitly passed-in encoding name or a
default. In Python 3.X, the methods' default is always UTF-8, but
open()
uses a value in the locale
module that may vary per
platform (and environment settings).
In 2.X both defaults are usually ASCII, as exposed in the sys
module
(which allows changes at start-up). For example, in 3.X:
>>> S = 'eggs' >>> S.encode() # str to bytes: encode text into raw bytes b'eggs' >>> bytes(S, encoding='ascii') # str to bytes, alternative b'eggs' >>> B = b'spam' >>> B.decode() # bytes to str: decode raw bytes into text 'spam' >>> str(B, encoding='ascii') # bytes to str, alternative 'spam'
Putting this together solves our original type error, and allows us to mix strings and bytes in 3.X as either encoded or decoded text:
>>> S, B ('eggs', b'spam') >>> S.encode('ascii') + B # bytes + bytes (encoded) b'eggsspam' >>> S + B.decode('ascii') # str + str (code points) 'eggsspam'
Two cautions here.
First of all, your platform's various default encodings are available in the
sys
and locale
modules, but the encoding argument to bytes()
is not optional, even though it is in S.encode()
(and B.decode()
).
Second, although str()
does not require the encoding argument like
bytes()
does, leaving it off in str()
calls does not mean it
defaults—instead, a str()
without an encoding returns the bytes
object's print string, not its decoded and converted str
form (this is usually not
what you'll want!). Assuming B
and S
are still as in the prior listing:
>>> import sys, locale >>> sys.platform # underlying platform 'win32' >>> locale.getpreferredencoding(False) # Windows open() default: a Latin-1 superset 'cp1252' >>> sys.getdefaultencoding() # but str() does not use defaults 'utf-8' >>> bytes(S) TypeError: string argument without an encoding >>> str(B) # str() without encoding "b'spam'" # print string, not conversion! >>> len(str(B)) 7 >>> len(str(B, encoding='ascii')) # use encoding to convert to str 4
Update:
as of 2024, Python's docs state that the default encoding for file content is now
locale.getencoding()
, not locale.getpreferredencoding(False)
,
but this is not true: the former is ignorant of a new UTF-8 mode option that can
be enabled by environment variable or command-line argument, though the difference
won't matter after Python 3.15 enables UTF-8 mode everywhere (per current plans).
You also shouldn't generally care: use explicit defaults in opens to avoid
interoperability hurdles today.
Having said all that, it's important to also note that encoding and decoding are substantially more than simple programming-language type conversions; really, they produce very different kinds of data. Encoding returns the bytes that result from transforming a text string per a Unicode scheme, and decoding returns the text string that is produced by undoing that transformation. While this is a conversion of sorts, and the mapping may seem trivial for simple text like ASCII, Unicode tends to make much more sense if you avoid blurring the distinction—especially for richer types of text like that in the next section.
Encoding and decoding get more meaningful when you start dealing with actual non-ASCII Unicode text. To code Unicode characters that may be difficult to type on your keyboard, Python string literals support both:
\xNN
hex escapes, where 2 hex digits (NN
)
specify a character code as a 1-byte (8-bit) numeric value
\uNNNN
and \UNNNNNNNN
Unicode escapes,
where the first form gives 4 hex digits to denote a 2-byte (16-bit) character
code, and the second gives 8 hex digits for a 4-byte (32-bit) code.
Importantly, in str
objects all three of the escapes listed above are used
to give a Unicode character's code point value, not its encoded bytes;
use bytes
objects if you need to represent a character's encoded bytes instead.
Let's see how this all translates to code. Simple 7-bit ASCII text is formatted with one character per byte under most of the encoding schemes described near the start of this article (again, this is why ASCII passes as a binary-compatible subset of many other schemes):
>>> ord('X') # 'X' has binary value 88 in the default encoding 88 >>> chr(88) # 88 stands for character 'X' 'X' >>> S = 'XYZ' # str (code points displayed as their character glyphs) >>> S 'XYZ' >>> len(S) # 3 characters long 3 >>> S.encode('ascii') # values 0..127 in 1 byte each (ASCII bytes shown as chars) b'XYZ' >>> S.encode('latin-1') # values 0..255 in 1 byte each b'XYZ' >>> S.encode('utf-8') # values 0..127 in 1 byte, 128..2047 in 2, others in 3 or 4 b'XYZ'
By contrast, the less common UTF-16 and UTF-32 use 2 and 4 bytes for every character,
respectively, even for simple text like ASCII. This makes these encodings' data fast
to process but may consume extra space and bandwidth, which renders them subpar in many
applications. In the following, ASCII bytes print as characters, non-ASCIIs print as
\xNN
escapes, and each result has a 2- or 4-byte BOM header at the
front whose details we're largely skipping here
(see the earlier update):
>>> S 'XYZ' >>> S.encode('utf-16') # always 2 or 4 bytes per character, plus a BOM header b'\xff\xfeX\x00Y\x00Z\x00' >>> S.encode('utf-32') b'\xff\xfe\x00\x00X\x00\x00\x00Y\x00\x00\x00Z\x00\x00\x00'
To code non-ASCII characters, you can use hex and Unicode escapes in your strings.
The numeric values coded as hexadecimal literals 0xC4
and 0xE8
,
for instance, are the Unicode code points used to represent two special characters outside
the 7-bit range of ASCII; we can embed them in str
objects, because
str
supports Unicode in 3.X today:
>>> chr(0xc4) # 0xC4 and 0xE8 are accented characters outside ASCII's range 'Ä' >>> chr(0xe8) 'è' >>> S = '\u00c4\u00e8' # 16-bit Unicode escapes >>> S 'Äè' >>> len(S) # 2 characters long (not number of bytes!) 2
Now, if we try to encode a non-ASCII string to raw bytes as ASCII, we'll get an error. Encoding as Latin-1 works, though, and allocates one byte per character; encoding as UTF-8 allocates 2 bytes per character instead. If you write this string to a file, the raw bytes shown is what is actually stored on the file for the encoding types given:
>>> S = '\u00c4\u00e8' >>> S.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) >>> S.encode('latin-1') # one byte per character b'\xc4\xe8' >>> S.encode('utf-8') # two bytes per character b'\xc3\x84\xc3\xa8' >>> len(S.encode('latin-1')) # 2 bytes in latin-1, 4 in utf-8 2 >>> len(S.encode('utf-8')) 4
Note that you can also go the other way—from raw bytes back to a Unicode
string. You could read raw bytes from a file and decode manually this way,
but the encoding mode you give to the open()
call causes this decoding to
be done for you automatically (and avoids issues that may arise from reading
partial character sequences when reading by blocks of bytes):
>>> B = b'\xc4\xe8' >>> B b'\xc4\xe8' >>> len(B) # 2 raw bytes, 2 characters 2 >>> B.decode('latin-1') # decode to latin-1 text 'Äè' >>> B = b'\xc3\x84\xc3\xa8' >>> len(B) # 4 raw bytes 4 >>> B.decode('utf-8') 'Äè' >>> len(B.decode('utf-8')) # 2 Unicode characters 2
When needed, you can also specify both 16- and 32-bit Unicode code-point values
for characters in your str
strings: use \u...
with 4 hex digits
for the former, and \U...
with 8 hex digits for the latter. As
the last example in the following shows, you can also build such strings
up piecemeal using chr()
, but it might become tedious for large strings:
>>> S = 'A\u00c4B\U000000e8C' >>> S # A, B, C, and 2 non-ASCII characters 'AÄBèC' >>> len(S) # 5 characters long 5 >>> S.encode('latin-1') b'A\xc4B\xe8C' >>> len(S.encode('latin-1')) # 5 bytes in latin-1 5 >>> S.encode('utf-8') b'A\xc3\x84B\xc3\xa8C' >>> len(S.encode('utf-8')) # 7 bytes in utf-8 7 >>> S.encode('cp500') # two other western european encodings b'\xc1c\xc2T\xc3' >>> S.encode('cp850') # 5 bytes each b'A\x8eB\x8aC' >>> S = 'spam' # ascii text is the same in most >>> S.encode('latin-1') b'spam' >>> S.encode('utf-8') b'spam' >>> S.encode('cp500') # cp500 is ibm ebcdic b'\xa2\x97\x81\x94' >>> S.encode('cp850') b'spam' >>> S = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C' >>> S 'AÄBèC'
Notice that Python 3.0 allows special characters' code points to be
coded with both hex and Unicode escapes in str
string literals, but
allows only hex escapes in bytes
literals; in fact, Unicode escape sequences
are taken verbatim in bytes
, and not as escapes. This makes sense if
you remember that bytes
objects hold characters' encoded bytes—not
their decoded code points. This is true even though code-point and
encoded-byte values happen to be the same for some characters in some
encodings (confusingly!). Because bytes
are not code points, they
also must be decoded to str
to print their non-ASCII characters properly:
>>> S = 'A\xC4B\xE8C' # str recognizes hex and Unicode escapes >>> S 'AÄBèC' >>> S = 'A\u00C4B\U000000E8C' # 4- and 8-digit Unicode escapes (str only) >>> S 'AÄBèC' >>> B = b'A\xC4B\xE8C' # bytes recognizes hex but not Unicode >>> B b'A\xc4B\xe8C' >>> B = b'A\u00C4B\U000000E8C' # Unicode escape sequences taken literally >>> B # bytes are encoded bytes, not code points b'A\\u00C4B\\U000000E8C' >>> B = b'A\xC4B\xE8C' # use hex escapes for latin-1 bytes >>> B # prints non-ASCII as hex b'A\xc4B\xe8C' >>> print(B) b'A\xc4B\xe8C' >>> B.decode('latin-1') # decode to str to interpret as text 'AÄBèC'
Finally, notice that bytes
literals assume that embedded
characters are ASCII, and require escapes for byte values > 127;
str
literals allow embedding any character supported by the file's
source-code encoding (which defaults to UTF-8 in 3.X, unless encoding declarations
are given—discussed ahead):
>>> S = 'AÄBèC' # chars from UTF-8 if no encoding declaration >>> S 'AÄBèC' >>> B = b'AÄBèC' SyntaxError: bytes can only contain ASCII literal characters. >>> B = b'A\xC4B\xE8C' # chars must be ASCII, or escapes >>> B # non-ASCIIs are latin-1 encoded bytes b'A\xc4B\xe8C' >>> B.decode('latin-1') 'AÄBèC' >>> S.encode() # source code encoded per UTF-8 by default b'A\xc3\x84B\xc3\xa8C' # uses system default to encode, unless passed >>> S.encode('utf-8') b'A\xc3\x84B\xc3\xa8C' >>> B.decode() # raw bytes do not correspond to utf-8 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: ... >>> S = 'AÄBèC' >>> S 'AÄBèC' >>> S.encode() # default utf-8 encoding b'A\xc3\x84B\xc3\xa8C' >>> >>> T = S.encode('cp500') # convert to EBCDIC >>> T b'\xc1c\xc2T\xc3' >>> >>> U = T.decode('cp500') # convert back to Unicode >>> U 'AÄBèC' >>> >>> U.encode() # back to UTF-8 bytes, by default b'A\xc3\x84B\xc3\xa8C'
Now that you've seen the basics of Unicode strings in 3.0, it's
also important to know that you can do much the same in 2.6, though the
tools differ. Unicode is already available in Python 2.6, but it is
a distinct data type from str
, and 2.6 allows free mixing of
normal and unicode
strings when compatible. In fact, you can
essentially pretend 2.6's str
is 3.0's bytes
when it comes
to decoding into a unicode
string, as long as it's in proper form.
Here's 2.6 string support in action (all other sections in this topic but this one are run under 3.0):
>>> import sys >>> sys.version '2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)]' >>> S = 'A\xC4B\xE8C' # string of 8-bit bytes >>> print S # some are non-ascii AÄBèC >>> S.decode('latin-1') # decode byte to latin-1 unicode u'A\xc4B\xe8C' >>> S.decode('utf-8') # not formatted as utf-8 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid data >>> S.decode('ascii') # outside ascii range UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal not in range(128)
To store arbitrarily encoded Unicode text, make a Unicode
object with the u'...'
literal form; this is no longer
available in 3.0, since all strings support Unicode in that version
(update: as noted
earlier, Python 3.X
later reinstated 2.X's u'...'
unicode
string
literals to ease migration of 2.X code):
>>> U = u'A\xC4B\xE8C' # make unicode string, hex escapes >>> U u'A\xc4B\xe8C' >>> print U AÄBèC
Once created, you convert Unicode text to different encodings;
this is similar to encoding str
objects into bytes
objects in 3.0:
>>> U.encode('latin-1') # encode per latin-1: 8-bit bytes 'A\xc4B\xe8C' >>> U.encode('utf-8') # encode per utf-8: multi-byte 'A\xc3\x84B\xc3\xa8C'
Non-ASCII characters can be coded with hex or Unicode escapes
in string literals just as in 3.0, but just as for bytes
in 3.0,
the \u...
and \U...
escapes are recognized only for
unicode
strings in 2.6, not 8-bit str
strings:
>>> U = u'A\xC4B\xE8C' # hex escapes for non-ascii >>> U u'A\xc4B\xe8C' >>> print U AÄBèC >>> U = u'A\u00C4B\U000000E8C' # unicode escapes for non-ASCII >>> U # u'' = 16 bits, U''= 32 bits u'A\xc4B\xe8C' >>> print U AÄBèC >>> S = 'A\xC4B\xE8C' # hex escapes work >>> S 'A\xc4B\xe8C' >>> print S # but some print oddly, unless decoded A-BFC >>> print S.decode('latin-1') AÄBèC >>> S = 'A\u00C4B\U000000E8C' # not unicode escapes: taken literally! >>> S 'A\\u00C4B\\U000000E8C' >>> print S A\u00C4B\U000000E8C >>> len(S) 19
Like 3.0's str
and bytes
,
2.6's unicode
and str
share nearly identical
operation sets, so you can often treat unicode
as though it were str
unless you need to convert to other encodings. One of the primary
differences between 2.6 and 3.0 is that unicode
and non-unicode str
objects
can be freely mixed in expressions, as long as the non-unicode object
contains only 7-bit ASCII characters; the non-unicode str
is automatically
converted up to unicode
in the process (in 3.0, str
and bytes
never
mix automatically, and require manual conversions):
>>> u'ab' + 'cd' # can mix if compatible (if str is all ASCII) u'abcd' >>> S = 'A\xC4B\xE8C' # can't mix if incompatible >>> U = u'A\xC4B\xE8C' >>> S + U UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal not in range(128) >>> S.decode('latin-1') + U # manual conversion still required u'A\xc4B\xe8CA\xc4B\xe8C' >>> print S.decode('latin-1') + U AÄBèCAÄBèC
Finally, note that 2.6's open()
call supports only files of
8-bit bytes, and returns their content as str
strings; it's up
to you to interpret that content as text or binary data. To
read and write Unicode files and encode or decode their content
in the process, see 2.6's library manual for information on the
codecs.open()
call. This call provides much the same functionality as 3.0's
open()
, and uses 2.6 unicode
objects to represent file
content—reading a file translates encoded bytes into decoded
Unicode characters, and writing translates Unicode strings to the
desired encoding specified when opened.
We'll see more on files in both Pythons
ahead.
One last note on coding non-ASCII text: Unicode escapes suffice for the occasional Unicode character in string literals, but can become tedious if you need to code non-ASCII text in your strings frequently. For string literals and other text that you embed in your script files, Python uses the UTF-8 encoding in 3.X (and ASCII in 2.X) by default to read your code's text, but allows you to change this per file to use an arbitrary encoding—and hence directly embed any unescaped characters that the chosen encoding supports.
To make this work, simply include a comment which names the encoding used to save your source file. This special encoding-declaration comment must appear as either the first or second line in your script, and is usually of the following form (see Python's manuals for other formats it accepts):
# -*- coding: latin-1 -*-
When present, Python will recognize strings represented natively in the given encoding. That way, you can edit your script file in a text editor that accepts, displays, and saves accented and other non-ASCII characters, and Python will correctly decode them when reading your string literals and other program-file text.
For example, notice how the comment at the top of the following file, "text.py," allows Python to recognize Latin-1 characters embedded in strings when the file is saved with this encoding:
# -*- coding: latin-1 -*- # any of the following string literal forms work in latin-1; # changing the encoding above to either ascii or utf-8 fails, # because the 0xc4 and 0xe8 in myStr1 are not valid in either myStr1 = 'aÄBèC' myStr2 = 'A\u00c4B\U000000e8C' myStr3 = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C' import sys, locale print('Sys default encoding: ', sys.getdefaultencoding()) print('Open default encoding:', locale.getencoding()) # later Python 3.X for aStr in myStr1, myStr2, myStr3: print('{0}, strlen={1}, '.format(aStr, len(aStr)), end='') bytes1 = aStr.encode() # per default utf-8: 2 bytes for non-ASCII bytes2 = aStr.encode('latin-1') # one byte per char #bytes3 = aStr.encode('ascii') # ascii fails: outside 0..127 range print('byteslen1={0}, byteslen2={1}'.format(len(bytes1), len(bytes2))) C:\misc>c:\python30\python text.py Sys default encoding: utf-8 Open default encoding: cp1252 aÄBèC, strlen=5, byteslen1=7, byteslen2=5 AÄBèC, strlen=5, byteslen1=7, byteslen2=5 AÄBèC, strlen=5, byteslen1=7, byteslen2=5
Since most programmers are likely to fall back on the default source encodings (especially the general UTF-8 in Python 3.X), we'll defer to Python's standard manual set for more details on this option, as well as more advanced Unicode support such as properties and character-name escapes in strings that we'll skip here.
We'll see the string types we've met in action again when we study files
ahead.
First, though, let's take a brief detour
to dig a bit deeper into the operation sets provided by the new bytes
type in 3.0.
As mentioned earlier,
the 3.0 bytes
type supports sequence operations and most of the same
methods available on str
(and present in 2.X's str
type).
However, bytes
does not
support the format()
method or the %
formatting expression
(until 3.5, per the update ahead).
Moreover, you cannot mix and match bytes
and str
without explicit
conversions—you generally will use all str
type objects and text files for
text data, and all bytes
type objects and binary files for binary data.
If you really want to see what attributes str
has that bytes
doesn't, you can
always check their dir()
results; this can also tell you something about the expression
operators they support (e.g., __mod__
and __rmod__
implement the %
operator):
C:\misc>c:\python30\python Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. # Attributes unique to str >>> set(dir('abc')) - set(dir(b'abc')) {'isprintable', 'format', '__mod__', 'encode', 'isidentifier', '_formatter_field_name_split', 'isnumeric', '__rmod__', 'isdecimal', '_formatter_parser', 'maketrans'} # Attributes unique to bytes >>> set(dir(b'abc')) - set(dir('abc')) {'decode', 'fromhex'}
As you can see, str
and bytes
have almost identical functionality; their unique
attributes are generally methods that don't apply to the other. For instance,
decode()
translates a raw bytes
into its str
representation, and
encode()
translates a str
into its raw bytes
representation).
Most methods are shared between str
and bytes
, though. Moreover,
bytes
are immutable just like str
in both 2.6 and 3.0
(error messages here have been shortened for brevity):
>>> B = b'spam' # b'...' bytes literal >>> B.find(b'pa') 1 >>> B.replace(b'pa', b'XY') b'sXYm' >>> B b'spam' >>> B[0] = 'x' TypeError: 'bytes' object does not support item assignment
One notable exception to this rule: string formatting works only on str
in 3.0,
not on bytes
. As told here,
3.0 also convolutes the string formatting story in general by adding redundant
functionality, but that story is beyond the scope of this page:
>>> b'%s' % 99 TypeError: unsupported operand type(s) for %: 'bytes' and 'int' >>> '%s' % 99 '99' >>> b'{0}'.format(99) AttributeError: 'bytes' object has no attribute 'format' >>> '{0}'.format(99) '99'
Update:
Python 3.5 eventually extended %
formatting
(only) to bytes
objects per
this page—for better
or worse. The extension has a heavily ASCII bias which clashes badly with the
generalized Unicode text model of 3.X, but may be useful in limited contexts.
Besides method calls, all the usual generic sequence operations you know (and
possibly love) from Python 2.X strings and lists work as expected on both str
and bytes
in 3.0; this includes indexing, slicing, concatenation, and so on.
Notice in the following that indexing bytes
returns an integer giving the
byte's binary value; bytes
really is a sequence of 8-bit integers, but it
prints as a string of ASCII-coded characters (plus non-ASCII escapes) when
displayed as a whole. To check a given byte's text interpretation, use
chr()
to convert it back to its character:
>>> B = b'spam' >>> B b'spam' >>> B[0] 115 >>> B[-1] 109 >>> chr(B[0]) 's' >>> B[1:], B[:-1] (b'pam', b'spa') >>> len(B) 4 >>> B + b'lmn' b'spamlmn' >>> B * 4 b'spamspamspamspam'
So far, we've been making bytes
objects with the b'...'
literal syntax;
they can also be created by calling the bytes()
constructor with a str
and
an encoding name, calling bytes
with an iterable of integers representing
byte values, or encoding a str
object per the default (or passed-in) encoding.
Encoding takes a str
and returns the raw binary bytes value of the string
according to its encoding specification; decoding takes a raw bytes
sequence
and encodes it to its string representation—a series of Unicode
characters:
>>> B = b'abc' >>> B b'abc' >>> B = bytes('abc', 'ascii') >>> B b'abc' >>> ord('a') 97 >>> B = bytes([97, 98, 99]) >>> B b'abc' >>> B = 'spam'.encode() # or bytes() >>> B b'spam' >>> >>> S = B.decode() # or str() >>> S 'spam'
From a larger perspective, the last two of these operations can also be seen as
tools for converting between str
and bytes
, introduced
earlier
and expanded upon in the next section.
Notice in the replace()
call of the earlier method-calls
section how we have to pass in two
bytes objects—str
types won't work there. Although Python 2.X automatically converts
str
to and from unicode
when possible (that is, when the str
is only 7-bit ASCII
text), Python requires specific string types in some contexts, and expects manual conversions if needed:
# Must pass expected types to function and method calls >>> B = b'spam' >>> B.replace('pa', 'XY') TypeError: expected an object with the buffer interface >>> B.replace(b'pa', b'XY') b'sXYm' >>> B = B'spam' >>> B.replace(bytes('pa'), bytes('xy')) TypeError: string argument without an encoding >>> B.replace(bytes('pa', 'ascii'), bytes('xy', 'utf-8')) b'sxym' # Must convert manually in mixed-type expressions >>> b'ab' + 'cd' TypeError: can't concat bytes to str >>> b'ab'.decode() + 'cd' # bytes to str 'abcd' >>> b'ab' + 'cd'.encode() # str to bytes b'abcd' >>> b'ab' + bytes('cd', 'ascii') # str to bytes b'abcd'
Two footnotes here. First, remember that encoding and decoding are more
than a simple type conversion; as we learned in the fuller coverage
earlier, they create different
types of data altogether. Second,
although you can create bytes
objects yourself to represent packed
binary data, they can also be made automatically by reading files opened in binary
mode, as we'll see
later in this article.
First, though, let's briefly meet bytes
' changeable cousin.
So far, we've focused on str
and bytes
,
since they subsume 2.6's unicode
and str
.
Python 3.0 has a third string type, though—bytearray
is essentially a mutable
variant of bytes
, and thus a mutable sequence of integers in the range 0..255. As such,
it supports the same string methods and sequence operations as bytes
, as well as
the mutable in-place-change operations found on lists:
# Creation: a mutable sequence of small (0..255) ints >>> B = b'spam' # str 'spam' works in 2.X only >>> C = bytearray(B) >>> C bytearray(b'spam') >>> C[0], chr(C[0]) # ASCII integer code for 's' (115, 's') # Mutable, but must assign ints, not strings >>> C[0] = 'x' TypeError: an integer is required >>> C[0] = b'x' TypeError: an integer is required >>> C[0] = ord('x') >>> C bytearray(b'xpam') >>> C[1] = b'Y'[0] >>> C bytearray(b'xYam') # Methods overlap with both str and bytes, but also has list's mutable methods >>> set(dir(b'abc')) - set(dir(bytearray(b'abc'))) {'__getnewargs__'} >>> set(dir(bytearray(b'abc'))) - set(dir(b'abc')) {'insert', '__alloc__', 'reverse', 'extend', '__delitem__', 'pop', '__setitem__' , '__iadd__', 'remove', 'append', '__imul__'} # Mutable method calls >>> C bytearray(b'xYam') >>> C.append(b'LMN') TypeError: an integer is required >>> C.append(ord('L')) >>> C bytearray(b'xYamL') >>> C.extend(b'MNO') >>> C bytearray(b'xYamLMNO') # Sequence operations and string methods >>> C + b'!#' bytearray(b'xYamLMNO!#') >>> C[0] 120 >>> C[1:] bytearray(b'YamLMNO') >>> len(C) 8 >>> C bytearray(b'xYamLMNO') >>> C.replace('xY', 'sp') TypeError: Type str doesn't support the buffer API >>> C.replace(b'xY', b'sp') bytearray(b'spamLMNO') >>> C bytearray(b'xYamLMNO') >>> C * 4 bytearray(b'xYamLMNOxYamLMNOxYamLMNOxYamLMNO')
Finally, by way of summary, the following examples demonstrate how bytes
and bytearray
are sequences of ints
, and str
is a sequence of characters
(i.e., decoded Unicode code points); although all three can contain character values and support
many of the same operations, you should use str
for textual data, bytes
for binary data,
and bytearray
for binary data you wish to change in place:
>>> B b'spam' >>> list(B) [115, 112, 97, 109] >>> C bytearray(b'xYamLMNO') >>> list(C) [120, 89, 97, 109, 76, 77, 78, 79] >>> S = 'spam' >>> list(S) ['s', 'p', 'a', 'm']
Now that we've learned all about Python's string types, let's turn to their roles in files—the main context in which most programmers will likely encounter Unicode and bytes, and the last major topic of this tutorial.
As also mentioned
above, the mode in which you open a file is crucial: it
determines which object type you will use to represent the file's content
in your script. Text mode implies str
objects, and binary mode implies bytes
:
open()
, you can force conversions for various types of Unicode
files. Text mode files may also perform universal line-end translations for you or
not; by default, all line-end forms map to the \n
character in your script,
regardless of which platform you are on.
In terms of code, the second positional argument to open()
determines whether you want
text or binary processing and types, just as it does in 2.X Python—adding a "b" to
the string implies binary mode. The default mode is "rt" which is the same as "r",
which means text input, just as in 2.X. In 3.0, though, this mode argument to open()
also implies an object type for file content representation regardless of the
underlying platform—text files return a str
for reads and expect one for writes,
but binary files return a bytes
for reads and expect bytes
(or bytearray
)
for writes.
To demonstrate, let's begin with basic file I/O. As long as you're processing
basic text files (e.g., ASCII) and don't care about circumventing the platform-default
encoding of strings, files look and feel much as they do in 2.X (for that matter, so
do strings in general). The following, for instance, writes one line of text to a
file and reads it back in 3.0, exactly as it would in 2.6 (note that file
is no
longer a built-in name in 3.0, and it's perfectly okay to use it as a variable here either way):
C:\misc>c:\python30\python # Basic text files (and strings) work the same as in 2.X >>> file = open('temp', 'w') >>> size = file.write('abc\n') # returns number bytes written >>> file.close() # manual close to flush output buffer >>> file = open('temp') # default mode is "r" (== "rt"), which means text input >>> text = file.read() >>> text 'abc\n'
Next, we'll write a text file and read it back in both modes in 3.0. Notice
that we are required to provide a str
for writing, but reading gives us a
str
or bytes
depending on the open mode (I've strung operations
together here into one-liners just for brevity):
# Write and read a text file >>> open('temp', 'w').write('abc\n') # text mode output, provide a str 4 >>> open('temp', 'r').read() # text mode input, returns a str 'abc\n' >>> open('temp', 'rb').read() # binary mode input, returns a bytes b'abc\r\n'
Now, let's do the same, but with a binary file; we must provide a bytes
to
write, and still get back a str
or bytes
depending on the input mode:
# Write and read a binary file >>> open('temp', 'wb').write(b'abc\n') # binary mode output, provide a bytes 4 >>> open('temp', 'r').read() # text mode input, returns a str 'abc\n' >>> open('temp', 'rb').read() # binary mode input, returns a bytes b'abc\n'
Notice that the same holds even if the data we're writing to the binary file is
truly binary in nature; in the following, the \x00
is a binary zero byte, and
not a printable character (though it passes as a text code point in the default encoding):
# Write and read binary data >>> open('temp', 'wb').write(b'a\x00c') 3 >>> open('temp', 'r').read() 'a\x00c' >>> open('temp', 'rb').read() b'a\x00c'
Binary mode files always return contents as a bytes
object, but accept either
a bytes
or bytearray
object for writing. This naturally follows, given that
bytearray
is mostly just a mutable variant of bytes
. In fact, most APIs in
Python 3.0 that accept a bytes
also allow a bytearray
:
# Bytearrays work too >>> BA = bytearray(b'\x01\x02\x03') >>> >>> open('temp', 'wb').write(BA) 3 >>> open('temp', 'r').read() '\x01\x02\x03' >>> open('temp', 'rb').read() b'\x01\x02\x03'
Finally, notice that you can't get away with violating Python's str
/bytes
distinction when it comes to files; in the following we get errors (shortened
here) if we try to write a bytes
to a text file, or a str
to a binary file.
Although it is often possible to convert between these two types (as described
earlier
in this article), you will usually want to stick to str
for text data and bytes
for binary data:
# Types are not flexible for file content >>> open('temp', 'w').write('abc\n') # auto encodes str to bytes 4 >>> open('temp', 'w').write(b'abc\n') # but bytes is not decoded text TypeError: can't write bytes to text stream >>> open('temp', 'wb').write(b'abc\n') # writes raw bytes 4 >>> open('temp', 'wb').write('abc\n') # but str is not raw bytes TypeError: can't write str to binary stream
This may seem strict, but Python cannot guess how you may wish to interpret the
contents of a bytes
or str
when used in the opposite
context, and wisely refuses to decode or encode content implicitly.
Moreover, because str
and bytes
operation sets largely
intersect, the choice of types won't be much of a dilemma for most programs.
See earlier in this article for more
on bytes
operations and mixed-type
constraints,
and the struct
module coverage ahead for another binary-file
example.
open()
encodings for text files. In brief,
text files allow a specific Unicode encoding-scheme name to be passed in with an
encoding
argument, and use it to automatically decode and encode
text on input and output, respectively:
open(filepathname, 'r', encoding='utf8')
decodes on reads
open(filepathname, 'w', encoding='latin1')
encodes on writes
codecs.open()
is equivalent in 2.X
The first of the above, for instance, assumes the file's content is
encoded per UTF-8, and automatically decodes its data to str
code points when read by the program. Similarly, the second bullet above
encodes str
code points to their Latin-1 format as they are
output to the file. File transfers raise exceptions whenever a requested
format doesn't work.
For example, in Python 3.X:
>>> file = open('uni.txt', 'w', encoding='utf8') # auto encodes to bytes >>> file.write('spÄm') 4 >>> file.close() >>> text = open('uni.txt', 'r', encoding='utf8').read() # auto decodes to str >>> text 'spÄm' >>> raw = open('uni.txt', 'rb').read() # no decoding applied >>> raw b'sp\xc3\x84m' >>> text = open('uni.txt', 'r', encoding='ascii').read() # Ä's utf8 bytes aren't ascii UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128) >>> import codecs >>> codecs.open('uni.txt', 'r', encoding='utf8').read() # 2.X's flavor in 3.X 'spÄm'
In the absence of encoding
, text files still encode and decode per a
platform- and version-specific default noted earlier.
Although you may not notice these translations if your default and files agree,
you generally should not rely on the default; it makes your programs dependent
on the context in which their files were created, and can lead to portability issues.
A program run on a UTF-8 default platform, for instance, may have trouble using
a file made under a Latin-1 default (an interoperability pitfall also noted
here).
For more coverage and examples of these topics, try this site's posts here and here, and see this article's later version in this book.
In closing, it's worth noting that many
of the popular string-processing tools in Python's
standard library have also been revamped for the new str
/bytes
dichotomy. We won't cover any of these application-focused
tools in much detail in this core-language book, but as a sample,
here's a quick look at two of the major tools impacted.
Python's re
pattern-matching module has been generalized to
work on any objects of any string type in 3.0—str
, bytes
, and
bytearray
. Note that you can't mix str
and bytes
types in its
calls' arguments, though:
>>> import re >>> S = 'Bugger all down here on earth!' >>> B = b'Bugger all down here on earth!' >>> >>> re.match('(.*) down (.*) on (.*)', S).groups() ('Bugger all', 'here', 'earth!') >>> >>> re.match(b'(.*) down (.*) on (.*)', B).groups() (b'Bugger all', b'here', b'earth!') >>> re.match('(.*) down (.*) on (.*)', B).groups() ... TypeError: can't use a string pattern on a bytes-like object >>> >>> re.match(b'(.*) down (.*) on (.*)', S).groups() ... TypeError: can't use a bytes pattern on a string-like object >>> re.match(b'(.*) down (.*) on (.*)', bytearray(B)).groups() (bytearray(b'Bugger all'), bytearray(b'here'), bytearray(b'earth!')) >>> >>> re.match('(.*) down (.*) on (.*)', bytearray(B)).groups() ... TypeError: can't use a string pattern on a bytes-like object
Along similar lines, the Python struct
module, used to create and extract
packed binary data from strings, works in 3.0 as it does in 2.X, but in 3.X
operates on bytes
and bytearray
only, not str
(which makes sense, given that it's intended for processing binary data, not decoded
text):
>>> import struct >>> B = struct.pack('>i4sh', 7, b'spam', 8) # 's' requires bytes as of 3.2 >>> B # (it encodes str as utf8 in 3.0/3.1) b'\x00\x00\x00\x07spam\x00\x08' >>> >>> vals = struct.unpack('>i4sh', B) # packed data is bytes, not str >>> vals (7, b'spam', 8) >>> >>> vals = struct.unpack('>i4sh', B.decode()) TypeError: 'str' does not have the buffer interface
Apart from the new syntax for bytes
, creating and reading binary files
works almost the same in 3.0 as it does in 2.X (and as described briefly
earlier in this article,
and in more detail in this book):
C:\misc>c:\python30\python.exe >>> F = open('data.bin', 'wb') # open binary output file >>> import struct >>> data = struct.pack('>i4sh', 7, b'spam', 8) # create packed binary data >>> data # bytes in 3.0, not str b'\x00\x00\x00\x07spam\x00\x08' >>> F.write(data) # write to the file 10 >>> F.close() >>> F = open('data.bin', 'rb') # open binary input file >>> data = F.read() # read bytes >>> data b'\x00\x00\x00\x07spam\x00\x08' >>> values = struct.unpack('>i4sh', data) # extract packed binary data >>> values # back to Python objects (7, b'spam', 8)
Update: Python 3.2 changed the struct.pack()
call to require a bytes
(or bytearray
) object for its "s"
conversion code; using a str
is an error. In 3.0 and 3.1 a str
is allowed and automatically encoded to bytes as UTF-8 text—arguably
too large an assumption, but 3.2 changes working and documented behavior. To
placate 3.2 and later, the examples above simply use a b'...'
literal
instead of '...'
; in your code, encode as needed (e.g.,
mystr.encode('utf8')
.
For more on re
, struct
,
and other string-related modules impacted by 3.0's new Unicode support,
consult the Python library
manual;
this article's published version,
which also covers pickle
object serialization and XML parsing;
or application-focused follow-up books such as
Programming Python.
An edited and enhanced version of this page's material appeared in this book, and was later expanded in this edition; see the latter for additional coverage of strings in both Python 3.X and 2.X.
For related resources online at this site, you may also be interested in a 2016 review of Python 3.5 bytes-string formatting; 2018 program-usage notes regarding Unicode BOMs, defaults, and encodings; and the 2022 coverage of normalization.
For additional reading, try these other articles popular at learning-python.com:
These and more are available on the blog page.