[LP5E] This article was originally written for Pythons 3.0 and 2.6, but applies to all 3.X and 2.X. It became a new chapter in the book Learning Python, and was revised and expanded in the 5th Edition. You may also be interested in a 2016 review of Python 3.5 bytes-string formatting, and 2018 program-usage notes regarding BOMs, defaults, and encodings.


June 2009 (polished Mar-Sep 2018)

icon

Strings in 3.X: Unicode and Binary Data

One of the most noticeable changes in Python 3.0 is the mutation of string object types. In a nutshell, 2.X's str and unicode types morph into 3.X's bytes and str types, along with a new bytearray type. Especially if you process data that is either Unicode or binary in nature, this can have substantial impacts on your code. In fact, as a general rule of thumb, how much you need to care about this topic depends in large part upon which of the following categories you fall into:

  1. If you deal with non-ASCII Unicode data—for instance, in the context of internationalized applications, Internet content, or XML parsers—you will find support for text encodings to be different in 3.0, but also probably more direct, accessible, and seamless than in 2.6, thanks to 3.0's all-Unicode str.
  2. If you deal with binary data—for example, in the form of image or audio files, network transfers, or packed data processed with the struct module—you will need to understand 3.0's new bytes object, and its different and sharper distinction between text and binary data and files.
  3. If you fall into neither of the prior two categories, you can generally use strings in 3.0 much as you would in 2.6: with the general str string type, text files, and all the familiar string operations. Your strings will be encoded and decoded using your platform's default encoding (e.g., ASCII, UTF-8, or Latin-1; locale.getpreferredencoding() gives your open() default if you must know), but you probably won't notice.

In other words, if your text is always ASCII, you might be able to get by with normal string objects and text files, and can avoid most of the following story. As we'll see in a moment, ASCII is a simple kind of Unicode and a subset of other encodings, so string operations and files "just work" if your programs process ASCII text.

Even if you fall into the last category above, though, a basic understanding of 3.0's string model can help, both to demystify some of the underlying details now, and to help you master Unicode or binary data issues if they impact you in the future. Given the prominence of the web in most software careers today, that impact may be more a matter of "when" than "if."

The Basics

Before looking at code, let's begin with a general overview of the 3.0 string model. To understand why 3.0 went the way it did, we have to start with a brief look at how characters are actually represented in computers.

Character Representations

Most programmers think of strings as a series of characters (really, their integer codes) used to represent textual data. That's still true in the brave new world of Unicode, but the way characters are stored in a computer's memory and files can vary, depending on both what sort of characters are recorded, and how programmers choose to record them.

For many programmers in the US, the ASCII standard defines their notion of text strings. ASCII is a standard created in the US, which defines character codes 0..127, and thus allows each character to be stored in one 8-bit byte. For example, the ASCII standard maps character 'a' to the integer value 97 (61 in hex), which can be stored in a single byte both in memory and on files. If you wish to check, Python's ord() gives the binary value for a character, and chr() returns the character for a given integer code value:

>>> ord('a')
97
>>> hex(97)
'0x61'
>>> chr(97)
'a'

Sometimes this isn't enough, though. Various symbols and accented characters do not fit into the range of possible characters in ASCII. To allow for some special characters, some standards allow all possible values in an 8-bit byte, 0..255, to represent characters, and assign values 128..255 to special characters. One such standard is known as "Latin-1," and is widely used in Western Europe. In Latin-1, character codes above 127 are assigned to accented, and otherwise special characters. The character assigned to byte value 196, for example, is a specially marked and non-ASCII character:

>>> 0xC4
196
>>> chr(196)
'Ä'
Still, some alphabets define so many characters that it is impossible to represent them as one byte-sized code per character. Unicode text allows much more flexibility; in fact, it defines enough character codes to represent almost every natural language in use, plus a large set of symbols. It is sometimes referred to as "wide-character" strings, because its range of characters is so broad that multiple bytes may be needed to represent individual character codes. Because of its flexibility, though, Unicode is the standard way that programs deal with non-English and other text that may have more characters than 8-bit bytes can handle.

Character Encoding Schemes

The key to understanding how Unicode works lies in the way its character codes (a.k.a. "code points") in memory are mapped to their encoded forms as needed for efficient storage or transfer. We say that characters are translated to and from raw bytes using an encoding—the rules for translating a Unicode string into a sequence of bytes, and extracting the string from a sequence of bytes. More procedurally, this translation back and forth between bytes and strings is defined by two terms:

Unicode defines both character codes and a set of standard encodings. For some of the encodings it defines, the translation process is trivial—ASCII and Latin-1, for instance, map each character to a single byte, so little or no work is required to encode and decode. For other encodings, the mapping can be more complex, and yield multiple bytes per character.

The widely used "UTF-8" encoding, for example, allows more characters to be represented by employing a variable-number-of-bytes scheme that's both general and economical. Character codes less than 128 are represented as a single byte; codes between 128 and 0x7ff (2047) are turned into two bytes, where each byte has a value between 128 and 255; and codes above 0x7ff are turned into three- or four-byte sequences having values between 128 and 255. This keeps simple ASCII strings compact, sidesteps byte ordering issues, and avoids null (zero) bytes that can cause problems for C libraries and networking.

Because they assign characters to the same codes for compatibility, ASCII is a subset of both Latin-1 and UTF-8. That is, every character string encoded per ASCII is also valid according to the Latin-1 and UTF-8 encodings. This is true whether the encoded text resides in memory or on files: every ASCII file is a valid UTF-8 file, because ASCII is a 7-bit subset of UTF-8. Technically, the UTF-8 encoding is binary compatible with ASCII for all character codes less than 128. Latin-1 and UTF-8 simply allow for additional characters: Latin-1 for characters mapped to values 128..255 within a byte, and UTF-8 for characters that may be represented with multiple bytes.

Other encodings support richer character sets in other ways. For instance, UTF-16 and UTF-32 use a fixed and larger 2 or 4 bytes per character, respectively. We'll skip further details here, but keep in mind that all of these—ASCII, Latin-1, UTF-8, and others—are simply alternative Unicode encodings that yield the same Unicode code-point text when decoded. Once decoded, those code points may or may not occupy multiple bytes in memory, depending on content and programming-language implementation; when encoded, though, their format is determined by the Unicode encoding applied.

To Python programmers, an encoding is specified as a string containing the encoding's name. Python comes with roughly 100 different encodings; see the Python Library Reference for a list. Importing module encodings and asking for help(encodings) shows you many as well; some are implemented in Python, and some in C. Some encodings have multiple names too; for example, "latin-1", "iso_8859_1" and "8859" are all synonyms for the same encoding, Latin-1. We'll revisit encodings later in this section, when we study Unicode coding techniques.

For much more on the Unicode story, see the Python standard manual set. It includes a "Unicode HOWTO" in its "Python HOWTOs" section which provides additional background which we will skip here in the interest of space.

(Update: some encodings also require or allow markers at the start of encoded text, known as Unicode BOMs. These markers can designate byte order and encoding type, and may be present whether the encoded text is stored in files or memory. Though not covered in this early-draft article, there is a brief look at this topic on this site here, and more complete surveys in later books. For more on why using the correct encoding matters in general, read the Latin-1/CP-1252 saga here.)

Python's String Types

At a more concrete level, the Python language provides multiple string data types to represent data in your script: both textual data (the integer "code point" values of decoded Unicode characters in memory), as well as binary data (raw byte values, including text that is in encoded form). These types differ in the two Python lines.

For example, Python 2.X has a general string type for representing both simple 8-bit character text like ASCII and binary data, along with a specific type for representing richer Unicode text that may occupy multiple bytes when encoded:

Python 2.X's two string types are different (unicode allows for the extra range of Unicode characters, and has extra support for encoding and decoding), but their operation sets largely overlap. The str string type in 2.X represents both text that can be represented with 8-bit bytes, as well as binary data.

By contrast, Python 3.0 comes with 3 string object types:

All 3 types support similar operation sets, but have different roles. The main goal behind this change was to merge the normal and Unicode string types of 2.X into a single string type that supports both normal and Unicode text. Developers wanted to remove the 2.X string dichotomy, and make Unicode processing more natural.

To achieve this, the 3.0 str type is defined as an immutable sequence of characters (not necessarily bytes), which may contain both simple text such as ASCII whose encoding yields one byte per character, and richer Unicode text whose encoding may yield multiple bytes per character. Strings are encoded per the platform default, but explicit encoding names may be provided to translate str objects to/from different schemes, both in memory, and when transferring to and from files.

While 3.0's new str type does achieve 2.X str/unicode merging, many programs still need to process raw binary data that is not encoded per any text format—as well as the bytes used to store text when it is encoded. Image files, and packed data you might process with Python's struct module fall into this category. To support this, a new type, bytes, also was introduced to support processing of truly binary data.

In 2.X, the general str type filled this binary data role, because strings were just sequences of bytes (the separate unicode type handled richer text). In 3.0, the bytes type is defined as an immutable sequence of 8-bit integers representing byte values, and supports almost all the same operations that the str type does; this includes string methods, sequence operations, and even re module pattern matching, but not formatting (till later in 3.X's evolution: see ahead).

A bytes object really is a sequence of small integers, each of which is in the range 0..255; indexing a bytes returns an int, slicing one returns another bytes, and running list() on one returns a list of integers, not characters. However, when processed with operations that assume characters (e.g., the isalpha() method), the contents of bytes objects are assumed to be ASCII-encoded bytes. Further, bytes are printed as ASCII character strings instead of integers whenever possible, for convenience.

While it was at it, Python also sprouted bytearray in 3.0, a variant of bytes, which is mutable, and so supports in-place changes. The bytearray type supports the usual string operations that str and bytes do, but also has many of the same in-place change operations as lists (e.g., append() and extend(), and assignment to indexes). Assuming your strings can be treated as raw bytes, bytearray finally adds direct in-place mutability for string data—something not possible in 2.X, or with 3.0's str or bytes.

Text and Binary Files

File I/O has also been revamped in 3.0 to reflect the str/bytes distinction. Really, text is just decoded integer character codes when it is in memory; it is when it transferred to and from external interfaces like files that Unicode encodings come into play. By contrast, truly binary data may have nothing at all to do with encodings (or text at all). Because of this, Python now makes a sharp platform-independent distinction between text files and binary files:

Because str and bytes are sharply differentiated by the language, the net effect is that you must decide whether your data is text or binary in nature, and use str or bytes objects to represent its content in your script, respectively. Ultimately, the mode in which you open a file will dictate which type of object your script will use to represent its content.

Notice that the mode string argument to open() (its second argument) becomes fairly crucial in Python 3.0—its content not only specifies a file processing mode, but also implies a Python object type. By adding a "b" (lower-case only) to the mode string, you specify binary mode files, and will receive, or must provide, a bytes object to represent its content when reading or writing. Without the "b", your file is processed in text mode, and you'll use str objects to represent its content. For example, modes "rb", "wb", and "rb+", imply bytes; "r", "w+", and "rt" (the default) imply str.

Python 3.0 Strings in Action

Let's step through a few examples that demonstrate how the 3.0 string types are used. Note that the following was run with and applies to 3.0 only. Although there is no bytes type in Python 2.6 (it has just the general str), some cross-version compatibility is still possible: in 2.6, the call bytes(X) is present as a synonym for str(X), and the new literal form b'...' is taken to be the same as the normal string literal '...'. You may still run into version skew in some cases, though; the 2.6 bytes() call, for instance, does not allow the second argument (encoding name) required by 3.0's bytes().

Literals and Basic Properties

Python 3.0 string objects originate when you call a function such as str() or bytes(); process a file created by calling open() (described in the next section); or code literal syntax in your script. For the latter, a new literal form, b'xxx' (and equivalently, B'xxx') is used to create bytes objects in 3.0, and bytearray objects may be created by calling the bytearray() function, with a variety of possible arguments.

More formally, in 3.0 all the current string literal forms—'xxx', "xxx", and triple-quoted blocks—generate a str; adding a "b" or "B" just before them creates a bytes instead. This new b'...' bytes literal is similar in spirit to the r'...' raw string, which suppresses backslash escapes. Consider the following (all examples in this section are run in 3.0, unless otherwise stated):

C:\misc>c:\python30\python

>>> B = b'spam'                # make a bytes object (8-bit bytes)
>>> S = 'eggs'                 # make a str object (Unicode characters, 8-bit or wider)

>>> type(B), type(S)
(<class 'bytes'>, <class 'str'>)

>>> B                          # prints as a character string, really a sequence of ints
b'spam'
>>> S
'eggs'

>>> B[0], S[0]                 # indexing returns an int for bytes, str for str
(115, 'e')

>>> B[1:], S[1:]               # slicing makes another bytes or str
(b'pam', 'ggs')

>>> list(B), list(S)
([115, 112, 97, 109], ['e', 'g', 'g', 's'])     # bytes is really ints

>>> B[0] = 'x'                                  # both are immutable
TypeError: 'bytes' object does not support item assignment

>>> S[0] = 'x'
TypeError: 'str' object does not support item assignment

>>> B = B"""                   # bytes prefix works on single, double, triple quotes
... xxxx
... yyyy
... """
>>> B
b'\nxxxx\nyyyy\n'

For forward compatibility, in Python 2.6 the 3.0 b'xxx' literal is present but is the same as 'xxx' and makes a 2.X str, and bytes() is just a synonym for str(); as shown above, in 3.0, both these address the distinct bytes type. Also note that the u'xxx' and U'xxx' unicode string literal forms in 2.6 discussed ahead are gone in 3.0; use 'xxx' in 3.0 instead, since all strings are Unicode, even if they contain only ASCII characters.

(Update: Python 3.X later reinstated 2.X's unicode string literals to ease migration of 2.X code: a 2.X u'xxx' unicode literal in Python 3.3 and later is now just a synonym for a 3.X 'xxx' str literal. This makes sense given 3.X's all-Unicode str type, and is the backward-compatible equivalent of 2.X's forward-compatible b'xxx' support. It's tempting to read into this that 2.X's str and unicode simply become 3.X's bytes and str, but the division of these types' roles is much sharper in 3.X.)

Conversions

Although Python 2.X allows its str and unicode objects to be freely mixed (if the str contains only 7-bit ASCII text, at least), 3.0 draws a much sharper distinction—str and bytes never mix automatically in expressions, and as a rule are not converted to one another automatically when passed to functions. A function that expects an argument to be a str object won't generally accept a bytes, and vice versa.

Because of this, Python 3.0 basically requires that you commit to one type or the other, or perform manual, explicit conversions:

Both these encode() and decode() methods and the file open() calls we'll explore ahead use either an explicitly passed-in encoding name or a default. In Python 3.X, the methods' default is always UTF-8, but open() uses a value in the locale module that may vary per platform. In 2.X both defaults are usually ASCII, as exposed in the sys module (which allows changes at start-up). For example, in 3.0:

>>> S = 'eggs'
>>> S.encode()                     # str to bytes: encode text into raw bytes
b'eggs'

>>> bytes(S, encoding='ascii')     # str to bytes, alternative
b'eggs'

>>> B = b'spam'
>>> B.decode()                     # bytes to str: decode raw bytes into text
'spam'

>>> str(B, encoding='ascii')       # bytes to str, alternative
'spam'

Two cautions here. First of all, your platform's various default encodings are available in the sys and locale modules, but the encoding argument to bytes() is not optional, even though it is in str.encode() (and bytes.decode()). Second, although str() does not require the encoding argument like bytes() does, leaving it off in str() calls does not mean it defaults—instead, a str() without an encoding returns the bytes object's print string, not its str converted form (this is usually not what you'll want!). Assuming B and S are still as in the prior listing:

>>> import sys, locale
>>> sys.platform                         # underlying platform
'win32'
>>> locale.getpreferredencoding(False)   # Windows open() default: a Latin-1 superset
'cp1252'
>>> sys.getdefaultencoding()             # but str() does not use defaults
'utf-8'

>>> bytes(S)
TypeError: string argument without an encoding

>>> str(B)                               # str without encoding
"b'spam'"                                # print string, not conversion!
>>> len(str(B))
7

>>> len(str(B, encoding='ascii'))        # use encoding to convert to str
4

Coding Unicode Strings in Python 3.0

Encoding and decoding get more meaningful when you start dealing with actual non-ASCII Unicode text. To code Unicode characters in your strings that cannot be typed on your keyboard, Python string literals support both \xNN hex byte value escapes, as well as \uNNNN and \UNNNNNNNN Unicode escapes in string literals; the first Unicode form gives 4 hex digits to denote a 2-byte (16-bit) character code, and the second gives 8 hex digits for a 4-byte (32-bit) code. Importantly, in str objects all three escapes are used to give a Unicode character's code point value, not its encoded bytes; use bytes objects if you need to represent a character's encoded bytes instead.

Let's see how this all translates to code. Normal 7-bit ASCII text is represented one character per byte under each of the encoding schemes described near the start of this article:

>>> ord('X')                # 'X' has binary value 88 in the default encoding 
88
>>> chr(88)                 # 88 stands for character 'X'
'X'

>>> S = 'XYZ'
>>> S
'XYZ'
>>> len(S)                  # 3 characters long
3

>>> S.encode('ascii')       # values 0..127 in 1 byte each
b'XYZ'
>>> S.encode('latin-1')     # values 0..255 in 1 byte each
b'XYZ'
>>> S.encode('utf-8')       # values 0..127 in 1 byte, 128..2047 in 2, others in 3 or 4
b'XYZ'

To code non-ASCII characters, use Unicode escapes in your strings; hexadecimal values 0xC4 and 0xE8, for instance, are codes (code points) for two special characters outside the 7-bit range of ASCII, but we can embed them in str objects, because str supports Unicode in 3.X today:

>>> chr(0xc4)               # 0xC4 and 0xE8 are accented characters outside ASCII's range
'Ä'
>>> chr(0xe8)
'è'

>>> S = '\u00c4\u00e8'      # 16-bit Unicode escapes
>>> S
'Äè'
>>> len(S)                  # 2 characters long (not number of bytes!)
2

Now, if we try to encode a non-ASCII string to raw bytes as ASCII, we'll get an error. Encoding as Latin-1 works, though, and allocates one byte per character; encoding as UTF-8 allocates 2 bytes per character instead. If you write this string to a file, the raw bytes shown is what is actually stored on the file for the encoding types given:

>>> S = '\u00c4\u00e8' 
>>> S.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

>>> S.encode('latin-1')              # one byte per character
b'\xc4\xe8'

>>> S.encode('utf-8')                # two bytes per character
b'\xc3\x84\xc3\xa8'

>>> len(S.encode('latin-1'))         # 2 bytes in latin-1, 4 in utf-8
2
>>> len(S.encode('utf-8'))
4

Note that you can also go the other way—from raw bytes back to a Unicode string. You could read raw bytes from a file and decode manually this way, but the encoding mode you give to the open() call causes this decoding to be done for you automatically (and avoids issues that may arise from reading partial character sequences when reading by blocks of bytes):

>>> B = b'\xc4\xe8'
>>> B
b'\xc4\xe8'
>>> len(B)                             # 2 raw bytes, 2 characters
2
>>> B.decode('latin-1')                # decode to latin-1 text
'Äè'

>>> B = b'\xc3\x84\xc3\xa8'
>>> len(B)                             # 4 raw bytes
4
>>> B.decode('utf-8')
'Äè'
>>> len(B.decode('utf-8'))             # 2 Unicode characters
2

When needed, you can specify both 16- and 32-bit Unicode code-point values for characters in your str strings; \u... with 4 hex digits for the former, and \U.... with 8 hex digits for the latter. As the last example in the following shows, you can also build such strings up piecemeal using chr(), but it might become tedious for large strings:

>>> S = 'A\u00c4B\U000000e8C'
>>> S                                  # A, B, C, and 2 non-ASCII characters
'AÄBèC'
>>> len(S)                             # 5 characters long
5

>>> S.encode('latin-1')
b'A\xc4B\xe8C'
>>> len(S.encode('latin-1'))           # 5 bytes in latin-1
5

>>> S.encode('utf-8')
b'A\xc3\x84B\xc3\xa8C'
>>> len(S.encode('utf-8'))             # 7 bytes in utf-8
7

>>> S.encode('cp500')                  # two other western european encodings
b'\xc1c\xc2T\xc3'
>>> S.encode('cp850')                  # 5 bytes each
b'A\x8eB\x8aC'

>>> S = 'spam'                         # ascii text is the same in most
>>> S.encode('latin-1')
b'spam'
>>> S.encode('utf-8')
b'spam'
>>> S.encode('cp500')                  # cp500 is ibm ebcdic
b'\xa2\x97\x81\x94'
>>> S.encode('cp850')
b'spam'

>>> S = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C'
>>> S
'AÄBèC'

Note that Python 3.0 allows special characters' code points to be coded with both hex and Unicode escapes in str string literals, but allows only hex escapes in bytes literals; in fact, Unicode escape sequences are taken verbatim in bytes, and not as escapes. This makes sense if you remember that bytes objects hold characters' encoded bytes—not their decoded code points. This is true even though code-point and encoded-byte values happen to be the same for some characters in some encodings (confusingly!). Because bytes are not code points, they also must be decoded to str to print their non-ASCII characters properly:

>>> S = 'A\xC4B\xE8C'            # str recognizes hex and Unicode escapes
>>> S
'AÄBèC'

>>> S = 'A\u00C4B\U000000E8C'    # 4- and 8-digit Unicode escapes (str only)
>>> S
'AÄBèC'

>>> B = b'A\xC4B\xE8C'           # bytes recognizes hex but not Unicode
>>> B
b'A\xc4B\xe8C'

>>> B = b'A\u00C4B\U000000E8C'   # Unicode escape sequences taken literally
>>> B                            # bytes are encoded bytes, not code points
b'A\\u00C4B\\U000000E8C'

>>> B = b'A\xC4B\xE8C'           # use hex escapes for bytes
>>> B                            # prints non-ASCII as hex 
b'A\xc4B\xe8C'
>>> print(B)
b'A\xc4B\xe8C'
>>> B.decode('latin-1')          # decode as latin-1 to interpret as text 
'AÄBèC'

Finally, notice that bytes literals assume that embedded characters are ASCII, and require escapes for byte values > 127; str literals allow embedding any character in the source-code character set (which defaults to UTF-8 in 3.X, unless encoding declarations are given—discussed ahead):

>>> S = 'AÄBèC'                  # chars from UTF-8 if no encoding declaration 
>>> S
'AÄBèC'

>>> B = b'AÄBèC'
SyntaxError: bytes can only contain ASCII literal characters.

>>> B = b'A\xC4B\xE8C'           # chars must be ASCII, or escapes
>>> B
b'A\xc4B\xe8C'
>>> B.decode('latin-1')
'AÄBèC'

>>> S.encode()                   # source code encoded per UTF-8 by default 
b'A\xc3\x84B\xc3\xa8C'           # uses system default to encode, unless passed
>>> S.encode('utf-8')
b'A\xc3\x84B\xc3\xa8C'
>>> B.decode()                   # raw bytes do not correspond to utf-8
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: ...

>>> S = 'AÄBèC'
>>> S
'AÄBèC'
>>> S.encode()                   # default utf-8 encoding
b'A\xc3\x84B\xc3\xa8C'
>>>
>>> T = S.encode('cp500')        # convert to EBCDIC
>>> T
b'\xc1c\xc2T\xc3'
>>>
>>> U = T.decode('cp500')        # convert back to Unicode
>>> U
'AÄBèC'
>>>
>>> U.encode()                   # back to UTF-8 bytes, by default
b'A\xc3\x84B\xc3\xa8C'

Coding Unicode Strings in Python 2.6

Now that you've seen the basics of Unicode strings in 3.0, it's also important to know that you can do much the same in 2.6, though the tools differ. Unicode is already available in Python 2.6, but it is a distinct data type from str, and 2.6 allows free mixing of normal and unicode strings when compatible. In fact, you can essentially pretend 2.6's str is 3.0's bytes when it comes to decoding into a unicode string, as long as it's in proper form.

Here's 2.6 string support in action (all other sections in this topic but this one are run under 3.0):

>>> import sys
>>> sys.version
'2.6 (r26:66721, Oct  2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)]'

>>> S = 'A\xC4B\xE8C'          # string of 8-bit bytes
>>> print S                    # some are non-ascii
AÄBèC

>>> S.decode('latin-1')        # decode byte to latin-1 unicode
u'A\xc4B\xe8C'

>>> S.decode('utf-8')          # not formatted as utf-8
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid data

>>> S.decode('ascii')          # outside ascii range
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal not in range(128)

To store arbitrarily encoded Unicode text, make a Unicode object with the u'xxx' literal form; this is no longer available in 3.0, since all strings support Unicode there (update: as noted earlier, Python 3.X later reinstated 2.X's u'xxx' unicode string literals to ease migration of 2.X code):

>>> U = u'A\xC4B\xE8C'         # make unicode string, hex escapes
>>> U
u'A\xc4B\xe8C'
>>> print U
AÄBèC

Once created, you convert Unicode text to different encodings; this is similar to encoding str objects into bytes objects in 3.0:

>>> U.encode('latin-1')        # encode per latin-1: 8-bit bytes
'A\xc4B\xe8C'
>>> U.encode('utf-8')          # encode per utf-8: multi-byte
'A\xc3\x84B\xc3\xa8C'

Non-ASCII characters can be coded with hex or Unicode escapes in string literals just as in 3.0, but just as for bytes in 3.0, the \u... and \U... escapes are recognized only for unicode strings in 2.6, not 8-bit str strings:

>>> U = u'A\xC4B\xE8C'           # hex escapes for non-ascii
>>> U
u'A\xc4B\xe8C'
>>> print U
AÄBèC

>>> U = u'A\u00C4B\U000000E8C'   # unicode escapes for non-ASCII
>>> U                            # u'' = 16 bits, U''= 32 bits
u'A\xc4B\xe8C'
>>> print U
AÄBèC

>>> S = 'A\xC4B\xE8C'            # hex escapes work
>>> S
'A\xc4B\xe8C'
>>> print S                      # but some print oddly, unless decoded
A-BFC
>>> print S.decode('latin-1')
AÄBèC

>>> S = 'A\u00C4B\U000000E8C'    # not unicode escapes: taken literally!
>>> S
'A\\u00C4B\\U000000E8C'
>>> print S
A\u00C4B\U000000E8C
>>> len(S)
19

Like 3.0's str and bytes, 2.6's unicode and str share nearly identical operation sets, so you can often treat unicode as though it were str unless you need to convert to other encodings. One of the primary differences between 2.6 and 3.0 is that unicode and non-unicode str objects can be freely mixed in expressions, as long as the non-unicode object contains only 7-bit ASCII characters; the non-unicode str is automatically converted up to unicode in the process (in 3.0, str and bytes never mix automatically, and require manual conversions):

>>> u'ab' + 'cd'                # can mix if compatible (if str is all ASCII)
u'abcd'

>>> S = 'A\xC4B\xE8C'           # can't mix if incompatible
>>> U = u'A\xC4B\xE8C'
>>> S + U
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal not in range(128)

>>> S.decode('latin-1') + U     # manual conversion still required 
u'A\xc4B\xe8CA\xc4B\xe8C'

>>> print S.decode('latin-1') + U
AÄBèCAÄBèC

Finally, note that 2.6's open() call supports only files of 8-bit bytes, and returns their content as str strings; it's up to you to interpret that content as text or binary data. To read and write Unicode files and encode or decode their content in the process, see 2.6's library manual for information on the codecs.open() call. This call provides much the same functionality as 3.0's open(), and uses 2.6 unicode objects to represent file content—reading a file translates encoded bytes into decoded Unicode characters, and writing translates Unicode strings to the desired encoding specified when opened.

Source-File Character Set Encoding Declarations

Unicode escape codes are fine for the occasional Unicode character in string literals, but can become tedious if you need to embed non-ASCII text in your strings frequently. For string literals and other text that you code within your script files, Python uses the UTF-8 encoding in 3.X (and ASCII in 2.X) by default to read your code, but allows you to change this to use arbitrary encodings and the character sets they support.

To make this work, simply include a comment that names your desired encoding. This special encoding-declaration comment must appear as either the first or second line in your script, and is usually of the following form (see Python's manuals for other formats it accepts):

# -*- coding: latin-1 -*-

When present, Python will recognize strings represented natively in the given encoding. That way, you can edit your script file in a text editor that accepts and displays accented and other non-ASCII characters correctly, and Python will decode them in your string literals correctly. For example, notice how the comment at the top of the following file, "text.py," allows Latin-1 characters to be embedded in strings when the file is saved with this encoding:

# -*- coding: latin-1 -*-

# any of the following string literal forms work in latin-1;
# changing the encoding above to either ascii or utf-8 fails,
# because the 0xc4 and 0xe8 in myStr1 are not valid in either

myStr1 = 'aÄBèC'

myStr2 = 'A\u00c4B\U000000e8C'

myStr3 = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C'

import sys
print('Default encoding:', sys.getdefaultencoding())

for aStr in myStr1, myStr2, myStr3:
    print('{0}, strlen={1}, '.format(aStr, len(aStr)), end='')

    bytes1 = aStr.encode()                     # per default utf-8: 2 bytes for non-ASCII
    bytes2 = aStr.encode('latin-1')            # one byte per char 
   #bytes3 = aStr.encode('ascii')              # ascii fails: outs 0..127 range

    print('byteslen1={0}, byteslen2={1}'.format(len(bytes1), len(bytes2)))


C:\misc>c:\python30\python text.py
Default encoding: utf-8
aÄBèC, strlen=5, byteslen1=7, byteslen2=5
AÄBèC, strlen=5, byteslen1=7, byteslen2=5
AÄBèC, strlen=5, byteslen1=7, byteslen2=5

Since most programmers are likely to fall back on the default source encodings (especially the general UTF-8 in Python 3.X), I'll defer to Python's standard manual set for more details on this option, as well as more advanced Unicode support such as properties and character-name escapes in string that we'll skip here.

Processing 3.0 Bytes Objects

Let's dig a bit deeper into the operation sets provided by the new bytes type in 3.0. As mentioned, the 3.0 bytes type supports sequence operations and most of the same methods available on str (and present in 2.X's str type). However, bytes does not support the format() method or the % formatting expression (until 3.5, per the update ahead). Moreover, you cannot mix and match bytes and str without explicit conversions—you generally will use all str type objects and text files for text data, and all bytes type objects and binary files for binary data.

Method Calls

If you really want to see what attributes str has that bytes doesn't, you can always check their dir() results; this can also tell you something about the expression operators they support (e.g., __mod__ and __rmod__ implement the % operator):

C:\misc>c:\python30\python
Python 3.0 (r30:67507, Dec  3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.


# Attributes unique to str

>>> set(dir('abc')) - set(dir(b'abc'))
{'isprintable', 'format', '__mod__', 'encode', 'isidentifier', '_formatter_field_name_split', 
'isnumeric', '__rmod__', 'isdecimal', '_formatter_parser', 'maketrans'}


# Attributes unique to bytes

>>> set(dir(b'abc')) - set(dir('abc'))
{'decode', 'fromhex'}

As you can see, str and bytes have almost identical functionality; their unique attributes are generally methods that don't apply to the other. For instance, decode() translates a raw bytes into its str representation, and encode() translates a str into its raw bytes representation). Most methods are shared between str and bytes, though. Moreover, bytes are immutable just like str in both 2.6 and 3.0 (error messages here have been shortened for brevity):

>>> B = b'spam'                    # b'...' bytes literal
>>> B.find(b'pa')
1

>>> B.replace(b'pa', b'XY')
b'sXYm'

>>> B
b'spam'

>>> B[0] = 'x'
TypeError: 'bytes' object does not support item assignment

One notable exception to this rule: string formatting works only on str in 3.0, not on bytes. As told here, 3.0 also convolutes the string formatting story in general by adding redundant functionality, but that story is beyond the scope of this page:

>>> b'%s' % 99
TypeError: unsupported operand type(s) for %: 'bytes' and 'int'

>>> '%s' % 99
'99'

>>> b'{0}'.format(99)
AttributeError: 'bytes' object has no attribute 'format'

>>> '{0}'.format(99)
'99'

(Update: Python 3.5 eventually extended % formatting (only) to bytes objects per this page—for better or worse. The extension has a heavily ASCII bias which clashes badly with the generalized Unicode text model of 3.X, but may be useful in limited contexts.

Sequence Operations

Besides method calls, all the usual generic sequence operations you know (and possibly love) from Python 2.X strings and lists work as expected on both str and bytes in 3.0; this includes indexing, slicing, concatenation, and so on. Notice in the following that indexing bytes returns an integer giving the byte's binary value; bytes really is a sequence of 8-bit integers, but it prints as a string of ASCII-coded characters (plus non-ASCII escapes) when displayed as a whole. To check a given byte's text interpretation, use chr() to convert it back to its character:

>>> B = b'spam'
>>> B
b'spam'

>>> B[0]
115
>>> B[-1]
109

>>> chr(B[0])
's'

>>> B[1:], B[:-1]
(b'pam', b'spa')
 
>>> len(B)
4

>>> B + b'lmn'
b'spamlmn'
>>> B * 4
b'spamspamspamspam'

Other Ways to Make Bytes

So far, we've been making bytes objects with the b'...' literal syntax; they can also be created by calling the bytes() constructor with a str and an encoding name, calling bytes with an iterable of integers representing byte values, or encoding a str object per the default (or passed-in) encoding. Encoding takes a str and returns the raw binary bytes value of the string according to its encoding specification; decoding takes a raw bytes sequence and encodes it to its string representation—a series of Unicode characters:

>>> B = b'abc'
>>> B
b'abc'

>>> B = bytes('abc', 'ascii')
>>> B
b'abc'

>>> ord('a')
97
>>> B = bytes([97, 98, 99])
>>> B
b'abc'

>>> B = 'spam'.encode()       # or bytes()
>>> B
b'spam'
>>>
>>> S = B.decode()            # or str()
>>> S
'spam'

From a larger perspective, the last two of these operations are really tools for converting between str and bytes, introduced earlier and expanded upon in the next section.

Mixing String Types

Notice in the replace() call of the earlier method-calls section how we have to pass in two bytes objects—str types won't work there. Although Python 2.X automatically converts str to and from unicode when possible (that is, when the str is only 7-bit ASCII text), Python requires specific string types in some contexts, and expects manual conversions if needed:

# Must pass expected types to function and method calls

>>> B = b'spam'

>>> B.replace('pa', 'XY')
TypeError: expected an object with the buffer interface

>>> B.replace(b'pa', b'XY')
b'sXYm'

>>> B = B'spam'
>>> B.replace(bytes('pa'), bytes('xy'))
TypeError: string argument without an encoding

>>> B.replace(bytes('pa', 'ascii'), bytes('xy', 'utf-8'))
b'sxym'


# Must convert manually in mixed-type expressions

>>> b'ab' + 'cd'
TypeError: can't concat bytes to str

>>> b'ab'.decode() + 'cd'                   # bytes to str
'abcd'

>>> b'ab' + 'cd'.encode()                   # str to bytes
b'abcd'

>>> b'ab' + bytes('cd', 'ascii')            # str to bytes
b'abcd'

Although you can create bytes objects yourself to represent packed binary data, they can also be made automatically by reading files opened in binary mode, as we'll see later in this article.

Using 3.0 bytearray Objects

So far, we've focused on str and bytes, since they subsume 2.6's unicode and str. Python 3.0 has a third string type, though—bytearray is essentially a mutable variant of bytes, and thus a mutable sequence of integers in the range 0..255. As such, it supports the same string methods and sequence operations as bytes, as well as the mutable in-place-change operations found on lists:

# Creation: a mutable sequence of small (0..255) ints

>>> B = b'spam'               # str 'spam' works in 2.X only
>>> C = bytearray(B)
>>> C
bytearray(b'spam')
>>> C[0], chr(C[0])           # ASCII integer code for 's'
(115, 's')


# Mutable, but must assign ints, not strings

>>> C[0] = 'x'
TypeError: an integer is required

>>> C[0] = b'x'
TypeError: an integer is required

>>> C[0] = ord('x')
>>> C
bytearray(b'xpam')

>>> C[1] = b'Y'[0]
>>> C
bytearray(b'xYam')


# Methods overlap with both str and bytes, but also has list's mutable methods

>>> set(dir(b'abc')) - set(dir(bytearray(b'abc')))
{'__getnewargs__'}

>>> set(dir(bytearray(b'abc'))) - set(dir(b'abc'))
{'insert', '__alloc__', 'reverse', 'extend', '__delitem__', 'pop', '__setitem__'
, '__iadd__', 'remove', 'append', '__imul__'}


# Mutable method calls

>>> C
bytearray(b'xYam')

>>> C.append(b'LMN')
TypeError: an integer is required

>>> C.append(ord('L'))
>>> C
bytearray(b'xYamL')

>>> C.extend(b'MNO')
>>> C
bytearray(b'xYamLMNO')


# Sequence operations and string methods

>>> C + b'!#'
bytearray(b'xYamLMNO!#')

>>> C[0]
120

>>> C[1:]
bytearray(b'YamLMNO')

>>> len(C)
8

>>> C
bytearray(b'xYamLMNO')

>>> C.replace('xY', 'sp')
TypeError: Type str doesn't support the buffer API

>>> C.replace(b'xY', b'sp')
bytearray(b'spamLMNO')

>>> C
bytearray(b'xYamLMNO')

>>> C * 4
bytearray(b'xYamLMNOxYamLMNOxYamLMNOxYamLMNO')

Finally, by way of summary, the following examples demonstrate how bytes and bytearray are sequences of ints, and str is a sequence of characters (i.e., decoded Unicode code points); although all three can contain character values and support many of the same operations, you should use str for textual data, bytes for binary data, and bytearray for binary data you wish to change in place:

>>> B
b'spam'
>>> list(B)
[115, 112, 97, 109]

>>> C
bytearray(b'xYamLMNO')
>>> list(C)
[120, 89, 97, 109, 76, 77, 78, 79]

>>> S = 'spam'
>>> list(S)
['s', 'p', 'a', 'm']

3.0 File Modes and String Types in Action

As also mentioned above, the mode in which you open a file is crucial—it determines which object type you will use to represent the file's content in your script. Text mode implies str objects, and binary mode implies bytes:

In terms of code, the second positional argument to open() determines whether you want text or binary processing and types, just as it does in 2.X Python—adding a "b" to the string implies binary mode. The default mode is "rt" which is the same as "r", which means text input, just as in 2.X. In 3.0, though, this mode argument to open() also implies an object type for file content representation regardless of the underlying platform—text files return a str for reads and expect one for writes, but binary files return a bytes for reads and expect bytes (or bytearray) for writes.

Text File Basics

To demonstrate, let's begin with basic file I/O. As long as you're processing basic text files (e.g., ASCII) and don't care about circumventing the platform-default encoding of strings, files look and feel much as they do in 2.X (for that matter, so do strings in general). The following, for instance, writes one line of text to a file and reads it back in 3.0, exactly as it would in 2.6 (note that file is no longer a built-in name in 3.0, and it's perfectly okay to use it as a variable here either way):

C:\misc>c:\python30\python

# Basic text files (and strings) work the same as in 2.X

>>> file = open('temp', 'w')
>>> size = file.write('abc\n')       # returns number bytes written
>>> file.close()                     # manual close to flush output buffer

>>> file = open('temp')              # default mode is "r" (== "rt"), which means text input
>>> text = file.read()
>>> text
'abc\n'

Using Text and Binary Modes

Next we'll write a text file and read it back in both modes in 3.0. Notice that we are required to provide a str for writing, but reading gives us a str or bytes depending on the open mode (I've strung operations together here into one-liners just for brevity):

# Write and read a text file

>>> open('temp', 'w').write('abc\n')       # text mode output, provide a str
4

>>> open('temp', 'r').read()               # text mode input, returns a str
'abc\n'

>>> open('temp', 'rb').read()              # binary mode input, returns a bytes
b'abc\r\n'

Now, let's do the same, but with a binary file; we must provide a bytes to write, and still get back a str or bytes depending on the input mode:

# Write and read a binary file

>>> open('temp', 'wb').write(b'abc\n')     # binary mode output, provide a bytes
4

>>> open('temp', 'r').read()               # text mode input, returns a str
'abc\n'

>>> open('temp', 'rb').read()              # binary mode input, returns a bytes
b'abc\n'

Notice that the same holds even if the data we're writing to the binary file is truly binary in nature; in the following, the \x00 is a binary zero byte, and not a printable character (though it passes as a txt code point in the default encoding):

# Write and read binary data

>>> open('temp', 'wb').write(b'a\x00c')
3

>>> open('temp', 'r').read()
'a\x00c'

>>> open('temp', 'rb').read()
b'a\x00c'

Binary mode files always return contents as a bytes object, but accept either a bytes or bytearray object for writing. This naturally follows, given that bytearray is mostly just a mutable variant of bytes. In fact, most APIs in Python 3.0 that accept a bytes also allow a bytearray:

# Bytearrays work too

>>> BA = bytearray(b'\x01\x02\x03')
>>>
>>> open('temp', 'wb').write(BA)
3

>>> open('temp', 'r').read()
'\x01\x02\x03'

>>> open('temp', 'rb').read()
b'\x01\x02\x03'

Finally, notice that you can't get away with violating Python's str/bytes distinction when it comes to files; in the following we get errors (shortened here) if we try to write a bytes to a text file, or a str to a binary file. Although it is often possible to convert between the types (as described earlier in this article), you will usually want to stick to str for text data and bytes for binary data. Because str and bytes operation sets largely intersect, the choice won't be much of a dilemma for most programs (e.g., see the binary file example using the struct module in the next section):

# Types are not flexible for file content

>>> open('temp', 'w').write('abc\n')
4
>>> open('temp', 'w').write(b'abc\n')
TypeError: can't write bytes to text stream

>>> open('temp', 'wb').write(b'abc\n')
4
>>> open('temp', 'wb').write('abc\n')
TypeError: can't write str to binary stream

Using Unicode Text Files

(Update: this draft article originally stopped short here before demonstrating open() encodings for text files. In brief, text files allow a specific Unicode encoding-scheme name to be passed in with an encoding argument, and use it to automatically decode and encode text on input and output, respectively:

The first of the above, for instance, assumes the file's content is encoded per UTF-8, and automatically decodes its data to str code points when read by the program. Similarly, the second bullet above encodes str code points to their Latin-1 format as they are output to the file. File transfers raise exceptions whenever a requested format doesn't work.

In the absence of encoding, text files still encode and decode per a platform- and version-specific default noted earlier. Although you may not notice these translations if your default and files agree, you generally should not rely on the default; it makes your programs dependent on the context in which their files were created, and can lead to portability issues. A program run on a UTF-8 default platform, for instance, may have trouble using a file made under a Latin-1 default. Read more on this here.

For more coverage and examples of these topics, try this site's posts here and here, and see this article's later version in this book.)

Other String Tool Changes in 3.0

Finally, some of Python's other popular string-processing tools in its standard library have been revamped for the new str/bytes dichotomy too. We won't cover any of these application-focused tools in much detail in this core language book, but here's a quick look at two of the major tools impacted.

The re Pattern-Matching Module

Python's re pattern-matching module has been generalized to work on any objects of any string type in 3.0—str, bytes, and bytearray. Note that you can't mix str and bytes types in its calls' arguments, though:

>>> import re
>>> S = 'Bugger all down here on earth!'
>>> B = b'Bugger all down here on earth!'
>>>
>>> re.match('(.*) down (.*) on (.*)', S).groups()
('Bugger all', 'here', 'earth!')
>>>
>>> re.match(b'(.*) down (.*) on (.*)', B).groups()
(b'Bugger all', b'here', b'earth!')


>>> re.match('(.*) down (.*) on (.*)', B).groups()
...
TypeError: can't use a string pattern on a bytes-like object
>>>
>>> re.match(b'(.*) down (.*) on (.*)', S).groups()
...
TypeError: can't use a bytes pattern on a string-like object


>>> re.match(b'(.*) down (.*) on (.*)', bytearray(B)).groups()
(bytearray(b'Bugger all'), bytearray(b'here'), bytearray(b'earth!'))
>>>
>>> re.match('(.*) down (.*) on (.*)', bytearray(B)).groups()
...
TypeError: can't use a string pattern on a bytes-like object

The struct Binary-Data Module

Along similar lines, the Python struct module, used to create and extract packed binary data from strings, works in 3.0 as it does in 2.X, but only operates on bytes and bytearray only, not str (which makes sense, given that it's intended for processing binary data, not text):

>>> import struct
>>> B = struct.pack('>i4sh', 7, 'spam', 8)
>>> B
b'\x00\x00\x00\x07spam\x00\x08'
>>>
>>> vals = struct.unpack('>i4sh', B)
>>> vals
(7, b'spam', 8)
>>>
>>> vals = struct.unpack('>i4sh', B.decode())
TypeError: 'str' does not have the buffer interface

Apart from the new syntax for bytes, creating and reading binary files works almost the same in 3.0 as it does in 2.X (and as described briefly on earlier in this book:

C:\misc>c:\python30\python.exe
>>> F = open('data.bin', 'wb')                  # open binary output file
>>> import struct
>>> data = struct.pack('>i4sh', 7, 'spam', 8)   # create packed binary data
>>> data                                        # bytes in 3.0, not str
b'\x00\x00\x00\x07spam\x00\x08'
>>> F.write(data)                               # write to the file
10
>>> F.close()

>>> F = open('data.bin', 'rb')                  # open binary input file
>>> data = F.read()                             # read bytes
>>> data
b'\x00\x00\x00\x07spam\x00\x08'
>>> values = struct.unpack('>i4sh', data)       # extract packed binary data
>>> values                                      # back to Python objects
(7, b'spam', 8)

For more on re, struct, and other string-related modules, consult the Python Library Manual, the published version of this article, or application-focused follow-up books such as Programming Python.


An edited and enhanced version of this page's material appeared in this book, and was later expanded in this edition. See the latter for additional coverage of strings in both Python 3.X and 2.X.



[Home] Books Programs Blog Python Author Training Search Email ©M.Lutz