[LP5E] This article was originally written for Pythons 3.0 and 2.6, but applies to all 3.X and 2.X. It became a new chapter in the book Learning Python, and was revised and expanded in the 5th Edition, but evolved independently here. "Contents" below opens a table of contents inline if JavaScript is enabled, or off page if not. See also the related reading links ahead for more resources.

Jun-2009 (last polished Apr-2024)

icon

Strings in 3.X: Unicode and Binary Data

One of the most noticeable changes in Python 3.0 is the mutation of string object types. In a nutshell, 2.X's str and unicode types have morphed into 3.X's bytes and str types, and a new mutable bytearray type has been added. Especially if you process data that is either Unicode or binary in nature, this can have substantial impacts on your code. As a general rule of thumb, how much you need to care about this topic depends in large part upon which of the following categories you fall into:

  1. If you deal with non-ASCII Unicode text—for instance, in the context of internationalized applications, Internet content, or XML parsers—you will find support for text encodings to be different in 3.0, but also probably more direct, accessible, and seamless than in 2.6, thanks to 3.0's all-Unicode str.
  2. If you deal with binary data—for example, in the form of image or audio files, network transfers, or packed data processed with the struct module—you will need to understand 3.0's new bytes object, and its different and sharper distinction between text and binary data and files.
  3. If you fall into neither of the prior two categories, you can generally use strings in 3.0 much as you would in 2.6: with the general str string type, text files, and all the familiar string operations. Your strings will be encoded and decoded using your platform's default encoding (e.g., ASCII, UTF-8, or Latin-1; the locale module's getpreferredencoding() gives your open() default if you must know), but you probably won't notice.

For example, if text is still always ASCII in your corner of the software world, you might be able to get by with normal string objects and text files, and can avoid most of the following story. As we'll see in a moment, ASCII is a simple kind of Unicode and a subset of other encodings, so string operations and files "just work" if your programs process ASCII text.

Even if you fall into the last category above, though, a basic understanding of 3.0's string model can help, both to demystify some of the underlying details now, and to help you master Unicode or binary data issues if they impact you in the future. Given the prominence of the web in most software careers today, that impact may be more a matter of "when" than "if."

The Basics

Before looking at code, let's begin with a general overview of the 3.0 string model. To understand why 3.0 went the way it did, we have to start with a brief look at how characters are actually represented in computers.

Character Representations

Most programmers think of strings as a series of characters (really, their integer codes) used to represent textual data. That's still true in the brave new world of Unicode, but the way characters are stored in a computer's memory and files can vary, depending on both what sort of characters are recorded, and how programmers choose to record them.

For many programmers in the US, the ASCII standard defines their notion of text strings. ASCII is a standard created in the US, which defines character codes 0..127, and thus allows each character to be stored in one 8-bit byte. For example, the ASCII standard maps character 'a' to the integer value 97 (61 in hex), which can be stored in a single byte both in memory and on files. If you wish to check, Python's ord() shows the integer code of a given character; chr() reveals the character of a given integer code; and hex() gives the code's byte value as two hex digits, each of which fits a 4-bit nibble; the first of these is the value of a character's code—and byte—in ASCII:

>>> ord('a')      # character => code
97
>>> chr(97)       # code => character
'a'
>>> hex(97)       # byte value: fits 8 bits
'0x61'

ASCII makes text processing simple, because characters directly correlate to bytes. Sometimes, though, this isn't enough. Accented characters and special symbols, for example, do not fit into the range of character codes defined by ASCII. To allow for some such extra characters, other standards allow all possible values in an 8-bit byte, 0..255, to be used as codes, and assign values 128..255 to additional characters. One such standard is known as Latin-1, and is widely used in Western Europe. In Latin-1, character codes above 127 are assigned to accented and otherwise-special characters. For instance, the character which Latin-1 assigns to code 196 (a.k.a. byte value 0xc4) is a specially marked and non-ASCII character. Per Python 3.X:

>>> chr(196)         # too big for ASCII
'Ä'
>>> ord('Ä')         # okay for Latin-1
196
>>> hex(ord('Ä'))    # byte value in Latin-1
'0xc4'

Still, some alphabets define so many characters that it is impossible to represent them as one byte-sized code per character. The integer codes of the symbols and characters in the following, for example, require more space than a byte—as do those of all the silly emojis that may not work in some viewers and editors, but manage to crop up in your emails anyhow:

>>> ord('☞')
9758
>>> hex(ord('☞'))                    # too big for one byte
'0x261e'

>>> [hex(ord(c)) for c in '真Л⇨']    # ditto: Unicode required
['0x771f', '0x41b', '0x21e8']

>>> [hex(ord(c)) for c in '🙂🙊👍']   # emojis > two bytes (16 bits)
['0x1f642', '0x1f64a', '0x1f44d']

Unicode provides the generality we need to deal with text containing non-ASCII characters and symbols like these. In fact, it defines and assigns enough character codes to represent almost every natural language in use, plus a large set of symbols. Unicode is sometimes referred to as "wide-character" strings, because its range of characters is so broad that multiple bytes may be needed to represent individual character codes. To allow for this, it also defines standard ways to map character codes to bytes for storage and transmission that are both platform and language neutral—the encodings we'll explore in the next section.

The takeaway here is that Unicode's combination of all-encompassing character codes and their predefined encodings make it a highly flexible model, and the standard way that programs deal with non-English and other text that may have more characters than 8-bit bytes can handle. As an added bonus, earlier schemes like ASCII also fall under the Unicode umbrella unchanged, but we have to move on to the next section to see how.

Character Encoding Schemes

The key to understanding how Unicode works lies in the way its character codes (a.k.a. "code points") in memory are mapped to their encoded forms as needed for efficient storage or transfer. We say that characters are translated to and from raw bytes using an encoding—the rules for translating a Unicode string into a sequence of bytes, and extracting the string from a sequence of bytes. More procedurally, this translation back and forth between bytes and strings is defined by two terms:

As noted, Unicode defines both character codes and a set of standard encodings. For some of the encodings it defines, the translation process is trivial—ASCII and Latin-1, for instance, map each character to a single byte, so little or no work is required to encode and decode. For other encodings, the mapping can be more complex, and yield multiple bytes per character.

The widely used UTF-8 encoding, for example, allows more characters to be represented by employing a variable-number-of-bytes scheme that's both general and economical. Character codes less than 128 are represented as a single byte; codes between 128 and 0x7ff (2047) are turned into two bytes, where each byte has a value between 128 and 255; and codes above 0x7ff are turned into three- or four-byte sequences having values between 128 and 255. This keeps simple ASCII strings compact, sidesteps byte ordering issues, and avoids null (zero) bytes that can cause problems for C libraries and networking.

Despite such details, it's important to note that ASCII is a subset of both Latin-1 and UTF-8. This is true because these encodings both assign ASCII characters to the same codes, and encode those characters to bytes the same way. This makes Unicode compatible with existing ASCII data: every character string encoded per ASCII is also valid according to the Latin-1 and UTF-8 encodings, and every ASCII file is a valid Latin-1 and UTF-8 file. Technically, the ASCII encoding is a 7-bit subset of the other two: it's binary compatible for all character codes less than 128. Latin-1 and UTF-8 simply allow for additional characters: Latin-1 for characters mapped to values 128..255 within a byte, and UTF-8 for characters that may be represented with multiple bytes.

Other encodings support richer character sets in other ways. For instance, UTF-16 and UTF-32 use a fixed and larger 2 or 4 bytes per character, respectively, the former with a special "surrogate pair" protocol for codes too large for 2 bytes. We'll skip further details here, but keep in mind that all of these—ASCII, Latin-1, UTF-8, and others—are simply alternative Unicode encodings that yield the same Unicode code-point text when decoded. This net effect ensures that text is portable across all the tools that use it, in exchange for minor translation costs:

To Python programmers, an encoding is specified as a string containing the encoding's name. Python comes with roughly 100 different encodings; see the Python Library Reference for a list. Importing module encodings and asking for help(encodings) shows you many as well; some are implemented in Python, and some in C. Some encodings have multiple names too; for example, "latin-1", "iso_8859_1" and "8859" are all synonyms for the same encoding, Latin-1. We'll revisit encodings later in this article, when we study Unicode coding techniques.

For another take on the Unicode backstory, see the Python standard manual set. It includes a "Unicode HOWTO" in its "Python HOWTOs" section which provides additional details which we will skip here in the interest of space.

Update: some encodings also require or allow markers at the start of encoded text, known as Unicode BOMs. These markers can designate byte order and encoding type, and may be present whether the encoded text is stored in files or memory. Though not covered in this early-draft article, there is a brief look at this topic on this site here, and more complete surveys in later books. For more on why using the correct encoding matters in general, read the Latin-1/CP-1252 saga here.

Update: this doc also does not discuss Unicode normalization, an advanced but essential topic. In short, the Unicode standard oddly allows some non-ASCII characters (e.g., and ) to be represented with multiple and differing code-point sequences when decoded. This in turn forces many text- and filename-processing tools to make disparate forms equivalent before running comparisons. For more details on this border case, see this site's off-page coverage here and here.

Python's String Types

At a more concrete level, the Python language provides multiple string data types to represent content in your script: both textual data—integer code-point values of decoded Unicode characters in memory, as well as binary data—raw byte values, including text that is in encoded form. These types differ in the two Python lines.

For example, Python 2.X has a general string type for representing both simple 8-bit character text like ASCII and binary data, along with a specific type for representing richer Unicode text that may occupy multiple bytes when encoded or decoded:

Python 2.X's two string types are different (unicode allows for the extra range of Unicode characters, and has extra support for encoding and decoding), but their operation sets largely overlap. Because the str string type in 2.X represents both text that can be represented with 8-bit bytes as well as binary data, it can be used for both textual and non-textual content.

By contrast, Python 3.0 comes with 3 string object types:

All 3 types support similar operation sets, but have different roles. The main goal behind this change was to merge the normal and Unicode string types of 2.X (its str and unicode) into a single string type that supports both byte-oriented and richer Unicode text. Developers wanted to remove the 2.X string dichotomy, and make Unicode processing more uniform and natural.

To achieve this, the 3.0 str type is defined as an immutable sequence of characters (really, code points that are not necessarily bytes). Its content may contain both simple text such as ASCII whose encoded and decoded forms yield one byte per character, as well as richer Unicode text whose encoded and decoded forms may both require multiple bytes per character. In memory, a str is just a sequence of Unicode code points. When transferred to and from files, a str is automatically encoded and decoded using either the platform default, or a provided encoding name to translate with an explicit scheme.

While 3.0's new str type does achieve 2.X str/unicode merging for text, many programs still need to process raw binary content that is not encoded per any Unicode format—as well as the bytes used to store text when it is encoded. Image files, and packed data you might process with Python's struct module fall into this category. To support this, a new type, bytes, also was introduced to support processing of truly binary data. bytes is just bytes, not Unicode characters, though its content may include still-encoded text.

In 2.X, the general str type filled this binary data role, because strings were just sequences of bytes (the separate unicode type handled richer text). In 3.0, the bytes type is defined as an immutable sequence of 8-bit integers representing byte values, and supports almost all the same operations that the str type does; this includes string methods, sequence operations, and even re module pattern matching, but not formatting (till later in 3.X's evolution: see the update ahead).

A bytes object really is a sequence of small integers, each of which is in the range 0..255; indexing a bytes returns an int, slicing one returns another bytes, and running list() on one returns a list of integers, not characters. However, when processed with operations that assume characters (e.g., the isalpha() method), the contents of bytes objects are assumed to be ASCII-encoded bytes. Further, bytes items whose values fall in the range of ASCII character codes are printed as ASCII characters instead of integers; this is done purportedly for convenience, though it may also confuse the distinction between text and binary data.

While it was at it, Python also sprouted bytearray in 3.0, a variant of bytes, which is mutable, and so supports in-place changes. The bytearray type supports the usual string operations that str and bytes do, but also has many of the same in-place change operations as lists (e.g., append() and extend(), and assignment to indexes). Assuming your strings can be treated as raw bytes, bytearray finally adds direct in-place mutability for string data—something not possible in 2.X, or with 3.0's str or bytes.

Text and Binary Files

File I/O has also been revamped in 3.0 to reflect the str/bytes distinction. Really, text is just decoded integer character codes when it is in memory; it's when text is transferred to and from external interfaces like files that Unicode encodings come into play. By contrast, truly binary data may have nothing at all to do with encodings (or text at all). Because of this, Python now makes a sharp platform-independent distinction between text files and binary files:

Because str and bytes are sharply differentiated by the language, the net effect is that you must decide whether your data is text or binary in nature, and use str or bytes objects to represent its content in your script, respectively. Ultimately, the mode in which you open a file will dictate which type of object your script will use to represent its content.

Notice that the mode string argument to open() (its second argument) becomes fairly crucial in Python 3.0—its content not only specifies a file processing mode, but also implies a Python object type. By adding a "b" (lower-case only) to the mode string, you specify a binary mode file, and will receive, or must provide, a bytes object to represent the file's content when reading or writing. Without the "b", your file is processed in text mode, and you'll use str objects to represent its content. For example, modes "rb", "wb", and "rb+", imply bytes; "r", "w+", and "rt" (the default) imply str.

If you're anxious to see files in action, watch for the examples ahead, especially those of Unicode-text files. To understand file usage in full, though, we first need to explore string basics.

Python 3.0 Strings in Action

Let's step through a few examples that demonstrate how the 3.0 string types are used. Note up front that the code in this section was run with and applies to 3.0 only, unless noted otherwise. That said, although there is no bytes type in Python 2.6 (it has just the general str), some cross-version compatibility is still possible: in 2.6, the call bytes(X) is present as a synonym for str(X), and the new literal form b'...' is taken to be the same as the normal string literal '...' (in this article, ... means a string's characters). You may still run into version skew in some cases, though; the 2.6 bytes() call, for instance, does not allow the second argument (encoding name) required by 3.0's bytes().

Literals and Basic Properties

Python 3.0 string objects originate when you call a function such as str() or bytes(); process a file created by calling open() (described later in this article); or code literal syntax in your script. For the latter, a new literal form, b'...' (and equivalently, B'...') is used to create bytes objects in 3.0, and bytearray objects may be created by calling the bytearray() function, with a variety of possible arguments.

More formally, in 3.0 all the current string literal forms—'...', "...", and triple-quoted blocks—generate a str; adding a "b" or "B" just before them creates a bytes instead. This new b'...' bytes literal is similar in spirit to the r'...' raw string, which suppresses backslash escapes. Consider the following:

C:\misc>c:\python30\python

>>> B = b'spam'                # make a bytes object (8-bit bytes)
>>> S = 'eggs'                 # make a str object (Unicode characters, 8-bit or wider)

>>> type(B), type(S)
(<class 'bytes'>, <class 'str'>)

>>> B                          # prints as a character string, really a sequence of ints
b'spam'
>>> S
'eggs'

>>> B[0], S[0]                 # indexing returns an int for bytes, str for str
(115, 'e')

>>> B[1:], S[1:]               # slicing makes another bytes or str
(b'pam', 'ggs')

>>> list(B), list(S)
([115, 112, 97, 109], ['e', 'g', 'g', 's'])     # bytes is really ints

>>> B[0] = 'x'                                  # both are immutable
TypeError: 'bytes' object does not support item assignment

>>> S[0] = 'x'
TypeError: 'str' object does not support item assignment

>>> B = B"""                   # bytes prefix works on single, double, triple quotes
... xxxx
... yyyy
... """
>>> B
b'\nxxxx\nyyyy\n'

As mentioned, for forward compatibility, in Python 2.6 the 3.0 b'...' literal is present but is the same as '...' and makes a 2.X str, and bytes() is just a synonym for str(); in 3.0, both these address the distinct bytes type, as shown above for the literal. Also note that the u'...' and U'...' unicode string literal forms in 2.6 discussed ahead are gone in 3.0; use '...' in 3.0 instead, since all text strings are Unicode in the 3.X line, even if they contain only ASCII characters.

Update: Python 3.X later reinstated 2.X's unicode string literals to ease migration of 2.X code: a 2.X u'...' unicode literal in Python 3.3 and later is now just a synonym for a 3.X '...' str literal. This makes sense given 3.X's all-Unicode str type, and is the backward-compatible equivalent of 2.X's forward-compatible b'...' support. It's tempting to read into this that 2.X's str and unicode simply become 3.X's bytes and str, but the division of these types' roles is much sharper in 3.X, as the next section explains.

String Type Conversions

Syntax aside, the first thing you might notice about Python 3.0 strings is what they cannot do. Although Python 2.X allows its str and unicode objects to be freely mixed (if the str contains only 7-bit ASCII text, at least), 3.X draws a much sharper distinction—str and bytes never mix automatically in expressions, and as a rule are not converted to one another automatically when passed to functions. That is, a function that expects an argument to be a str object won't generally accept a bytes (and vice versa), and operators are just as rigid in 3.X:

>>> 'eggs' + b'spam'
TypeError: can only concatenate str (not "bytes") to str

This is easier to understand if you remember that a text string may be radically different in its encoded and decoded forms, and Python has no idea what the content of a bytes is: if the bytes is encoded text its encoding is unknown, but it may also be binary data that has nothing to do with text at all. Because of this ambiguity, Python 3.0 basically requires that you either commit to one type or the other, or perform manual, explicit conversions with the following tools:

Both the S.encode() and B.decode() methods above and the file open() call we'll explore ahead use either an explicitly passed-in encoding name or a default. In Python 3.X, the methods' default is always UTF-8, but open() uses a value in the locale module that may vary per platform (and environment settings). In 2.X both defaults are usually ASCII, as exposed in the sys module (which allows changes at start-up). For example, in 3.X:

>>> S = 'eggs'
>>> S.encode()                     # str to bytes: encode text into raw bytes
b'eggs'

>>> bytes(S, encoding='ascii')     # str to bytes, alternative
b'eggs'

>>> B = b'spam'
>>> B.decode()                     # bytes to str: decode raw bytes into text
'spam'

>>> str(B, encoding='ascii')       # bytes to str, alternative
'spam'

Putting this together solves our original type error, and allows us to mix strings and bytes in 3.X as either encoded or decoded text:

>>> S, B
('eggs', b'spam')

>>> S.encode('ascii') + B          # bytes + bytes (encoded)
b'eggsspam'

>>> S + B.decode('ascii')          # str + str (code points)
'eggsspam'

Two cautions here. First of all, your platform's various default encodings are available in the sys and locale modules, but the encoding argument to bytes() is not optional, even though it is in S.encode() (and B.decode()). Second, although str() does not require the encoding argument like bytes() does, leaving it off in str() calls does not mean it defaults—instead, a str() without an encoding returns the bytes object's print string, not its decoded and converted str form (this is usually not what you'll want!). Assuming B and S are still as in the prior listing:

>>> import sys, locale
>>> sys.platform                         # underlying platform
'win32'
>>> locale.getpreferredencoding(False)   # Windows open() default: a Latin-1 superset
'cp1252'
>>> sys.getdefaultencoding()             # but str() does not use defaults
'utf-8'

>>> bytes(S)
TypeError: string argument without an encoding

>>> str(B)                               # str() without encoding
"b'spam'"                                # print string, not conversion!
>>> len(str(B))
7

>>> len(str(B, encoding='ascii'))        # use encoding to convert to str
4

Update: as of 2024, Python's docs state that the default encoding for file content is now locale.getencoding(), not locale.getpreferredencoding(False), but this is not true: the former is ignorant of a new UTF-8 mode option that can be enabled by environment variable or command-line argument, though the difference won't matter after Python 3.15 enables UTF-8 mode everywhere (per current plans). You also shouldn't generally care: use explicit defaults in opens to avoid interoperability hurdles today.

Having said all that, it's important to also note that encoding and decoding are substantially more than simple programming-language type conversions; really, they produce very different kinds of data. Encoding returns the bytes that result from transforming a text string per a Unicode scheme, and decoding returns the text string that is produced by undoing that transformation. While this is a conversion of sorts, and the mapping may seem trivial for simple text like ASCII, Unicode tends to make much more sense if you avoid blurring the distinction—especially for richer types of text like that in the next section.

Coding Unicode Strings in Python 3.0

Encoding and decoding get more meaningful when you start dealing with actual non-ASCII Unicode text. To code Unicode characters that may be difficult to type on your keyboard, Python string literals support both:

Importantly, in str objects all three of the escapes listed above are used to give a Unicode character's code point value, not its encoded bytes; use bytes objects if you need to represent a character's encoded bytes instead.

Let's see how this all translates to code. Simple 7-bit ASCII text is formatted with one character per byte under most of the encoding schemes described near the start of this article (again, this is why ASCII passes as a binary-compatible subset of many other schemes):

>>> ord('X')                # 'X' has binary value 88 in the default encoding 
88
>>> chr(88)                 # 88 stands for character 'X'
'X'

>>> S = 'XYZ'               # str (code points displayed as their character glyphs)
>>> S
'XYZ'
>>> len(S)                  # 3 characters long
3

>>> S.encode('ascii')       # values 0..127 in 1 byte each (ASCII bytes shown as chars)
b'XYZ'
>>> S.encode('latin-1')     # values 0..255 in 1 byte each
b'XYZ'
>>> S.encode('utf-8')       # values 0..127 in 1 byte, 128..2047 in 2, others in 3 or 4
b'XYZ'

By contrast, the less common UTF-16 and UTF-32 use 2 and 4 bytes for every character, respectively, even for simple text like ASCII. This makes these encodings' data fast to process but may consume extra space and bandwidth, which renders them subpar in many applications. In the following, ASCII bytes print as characters, non-ASCIIs print as \xNN escapes, and each result has a 2- or 4-byte BOM header at the front whose details we're largely skipping here (see the earlier update):

>>> S
'XYZ'

>>> S.encode('utf-16')      # always 2 or 4 bytes per character, plus a BOM header
b'\xff\xfeX\x00Y\x00Z\x00'

>>> S.encode('utf-32')
b'\xff\xfe\x00\x00X\x00\x00\x00Y\x00\x00\x00Z\x00\x00\x00'

To code non-ASCII characters, you can use hex and Unicode escapes in your strings. The numeric values coded as hexadecimal literals 0xC4 and 0xE8, for instance, are the Unicode code points used to represent two special characters outside the 7-bit range of ASCII; we can embed them in str objects, because str supports Unicode in 3.X today:

>>> chr(0xc4)               # 0xC4 and 0xE8 are accented characters outside ASCII's range
'Ä'
>>> chr(0xe8)
'è'

>>> S = '\u00c4\u00e8'      # 16-bit Unicode escapes
>>> S
'Äè'
>>> len(S)                  # 2 characters long (not number of bytes!)
2

Now, if we try to encode a non-ASCII string to raw bytes as ASCII, we'll get an error. Encoding as Latin-1 works, though, and allocates one byte per character; encoding as UTF-8 allocates 2 bytes per character instead. If you write this string to a file, the raw bytes shown is what is actually stored on the file for the encoding types given:

>>> S = '\u00c4\u00e8' 
>>> S.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

>>> S.encode('latin-1')              # one byte per character
b'\xc4\xe8'

>>> S.encode('utf-8')                # two bytes per character
b'\xc3\x84\xc3\xa8'

>>> len(S.encode('latin-1'))         # 2 bytes in latin-1, 4 in utf-8
2
>>> len(S.encode('utf-8'))
4

Note that you can also go the other way—from raw bytes back to a Unicode string. You could read raw bytes from a file and decode manually this way, but the encoding mode you give to the open() call causes this decoding to be done for you automatically (and avoids issues that may arise from reading partial character sequences when reading by blocks of bytes):

>>> B = b'\xc4\xe8'
>>> B
b'\xc4\xe8'
>>> len(B)                             # 2 raw bytes, 2 characters
2
>>> B.decode('latin-1')                # decode to latin-1 text
'Äè'

>>> B = b'\xc3\x84\xc3\xa8'
>>> len(B)                             # 4 raw bytes
4
>>> B.decode('utf-8')
'Äè'
>>> len(B.decode('utf-8'))             # 2 Unicode characters
2

When needed, you can also specify both 16- and 32-bit Unicode code-point values for characters in your str strings: use \u... with 4 hex digits for the former, and \U... with 8 hex digits for the latter. As the last example in the following shows, you can also build such strings up piecemeal using chr(), but it might become tedious for large strings:

>>> S = 'A\u00c4B\U000000e8C'
>>> S                                  # A, B, C, and 2 non-ASCII characters
'AÄBèC'
>>> len(S)                             # 5 characters long
5

>>> S.encode('latin-1')
b'A\xc4B\xe8C'
>>> len(S.encode('latin-1'))           # 5 bytes in latin-1
5

>>> S.encode('utf-8')
b'A\xc3\x84B\xc3\xa8C'
>>> len(S.encode('utf-8'))             # 7 bytes in utf-8
7

>>> S.encode('cp500')                  # two other western european encodings
b'\xc1c\xc2T\xc3'
>>> S.encode('cp850')                  # 5 bytes each
b'A\x8eB\x8aC'

>>> S = 'spam'                         # ascii text is the same in most
>>> S.encode('latin-1')
b'spam'
>>> S.encode('utf-8')
b'spam'
>>> S.encode('cp500')                  # cp500 is ibm ebcdic
b'\xa2\x97\x81\x94'
>>> S.encode('cp850')
b'spam'

>>> S = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C'
>>> S
'AÄBèC'

Notice that Python 3.0 allows special characters' code points to be coded with both hex and Unicode escapes in str string literals, but allows only hex escapes in bytes literals; in fact, Unicode escape sequences are taken verbatim in bytes, and not as escapes. This makes sense if you remember that bytes objects hold characters' encoded bytes—not their decoded code points. This is true even though code-point and encoded-byte values happen to be the same for some characters in some encodings (confusingly!). Because bytes are not code points, they also must be decoded to str to print their non-ASCII characters properly:

>>> S = 'A\xC4B\xE8C'            # str recognizes hex and Unicode escapes
>>> S
'AÄBèC'

>>> S = 'A\u00C4B\U000000E8C'    # 4- and 8-digit Unicode escapes (str only)
>>> S
'AÄBèC'

>>> B = b'A\xC4B\xE8C'           # bytes recognizes hex but not Unicode
>>> B
b'A\xc4B\xe8C'

>>> B = b'A\u00C4B\U000000E8C'   # Unicode escape sequences taken literally
>>> B                            # bytes are encoded bytes, not code points
b'A\\u00C4B\\U000000E8C'

>>> B = b'A\xC4B\xE8C'           # use hex escapes for latin-1 bytes
>>> B                            # prints non-ASCII as hex 
b'A\xc4B\xe8C'
>>> print(B)
b'A\xc4B\xe8C'
>>> B.decode('latin-1')          # decode to str to interpret as text 
'AÄBèC'

Finally, notice that bytes literals assume that embedded characters are ASCII, and require escapes for byte values > 127; str literals allow embedding any character supported by the file's source-code encoding (which defaults to UTF-8 in 3.X, unless encoding declarations are given—discussed ahead):

>>> S = 'AÄBèC'                  # chars from UTF-8 if no encoding declaration 
>>> S
'AÄBèC'

>>> B = b'AÄBèC'
SyntaxError: bytes can only contain ASCII literal characters.

>>> B = b'A\xC4B\xE8C'           # chars must be ASCII, or escapes
>>> B                            # non-ASCIIs are latin-1 encoded bytes
b'A\xc4B\xe8C'
>>> B.decode('latin-1')
'AÄBèC'

>>> S.encode()                   # source code encoded per UTF-8 by default 
b'A\xc3\x84B\xc3\xa8C'           # uses system default to encode, unless passed
>>> S.encode('utf-8')
b'A\xc3\x84B\xc3\xa8C'
>>> B.decode()                   # raw bytes do not correspond to utf-8
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: ...

>>> S = 'AÄBèC'
>>> S
'AÄBèC'
>>> S.encode()                   # default utf-8 encoding
b'A\xc3\x84B\xc3\xa8C'
>>>
>>> T = S.encode('cp500')        # convert to EBCDIC
>>> T
b'\xc1c\xc2T\xc3'
>>>
>>> U = T.decode('cp500')        # convert back to Unicode
>>> U
'AÄBèC'
>>>
>>> U.encode()                   # back to UTF-8 bytes, by default
b'A\xc3\x84B\xc3\xa8C'

Coding Unicode Strings in Python 2.6

Now that you've seen the basics of Unicode strings in 3.0, it's also important to know that you can do much the same in 2.6, though the tools differ. Unicode is already available in Python 2.6, but it is a distinct data type from str, and 2.6 allows free mixing of normal and unicode strings when compatible. In fact, you can essentially pretend 2.6's str is 3.0's bytes when it comes to decoding into a unicode string, as long as it's in proper form.

Here's 2.6 string support in action (all other sections in this topic but this one are run under 3.0):

>>> import sys
>>> sys.version
'2.6 (r26:66721, Oct  2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)]'

>>> S = 'A\xC4B\xE8C'          # string of 8-bit bytes
>>> print S                    # some are non-ascii
AÄBèC

>>> S.decode('latin-1')        # decode byte to latin-1 unicode
u'A\xc4B\xe8C'

>>> S.decode('utf-8')          # not formatted as utf-8
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid data

>>> S.decode('ascii')          # outside ascii range
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal not in range(128)

To store arbitrarily encoded Unicode text, make a Unicode object with the u'...' literal form; this is no longer available in 3.0, since all strings support Unicode in that version (update: as noted earlier, Python 3.X later reinstated 2.X's u'...' unicode string literals to ease migration of 2.X code):

>>> U = u'A\xC4B\xE8C'         # make unicode string, hex escapes
>>> U
u'A\xc4B\xe8C'
>>> print U
AÄBèC

Once created, you convert Unicode text to different encodings; this is similar to encoding str objects into bytes objects in 3.0:

>>> U.encode('latin-1')        # encode per latin-1: 8-bit bytes
'A\xc4B\xe8C'
>>> U.encode('utf-8')          # encode per utf-8: multi-byte
'A\xc3\x84B\xc3\xa8C'

Non-ASCII characters can be coded with hex or Unicode escapes in string literals just as in 3.0, but just as for bytes in 3.0, the \u... and \U... escapes are recognized only for unicode strings in 2.6, not 8-bit str strings:

>>> U = u'A\xC4B\xE8C'           # hex escapes for non-ascii
>>> U
u'A\xc4B\xe8C'
>>> print U
AÄBèC

>>> U = u'A\u00C4B\U000000E8C'   # unicode escapes for non-ASCII
>>> U                            # u'' = 16 bits, U''= 32 bits
u'A\xc4B\xe8C'
>>> print U
AÄBèC

>>> S = 'A\xC4B\xE8C'            # hex escapes work
>>> S
'A\xc4B\xe8C'
>>> print S                      # but some print oddly, unless decoded
A-BFC
>>> print S.decode('latin-1')
AÄBèC

>>> S = 'A\u00C4B\U000000E8C'    # not unicode escapes: taken literally!
>>> S
'A\\u00C4B\\U000000E8C'
>>> print S
A\u00C4B\U000000E8C
>>> len(S)
19

Like 3.0's str and bytes, 2.6's unicode and str share nearly identical operation sets, so you can often treat unicode as though it were str unless you need to convert to other encodings. One of the primary differences between 2.6 and 3.0 is that unicode and non-unicode str objects can be freely mixed in expressions, as long as the non-unicode object contains only 7-bit ASCII characters; the non-unicode str is automatically converted up to unicode in the process (in 3.0, str and bytes never mix automatically, and require manual conversions):

>>> u'ab' + 'cd'                # can mix if compatible (if str is all ASCII)
u'abcd'

>>> S = 'A\xC4B\xE8C'           # can't mix if incompatible
>>> U = u'A\xC4B\xE8C'
>>> S + U
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal not in range(128)

>>> S.decode('latin-1') + U     # manual conversion still required 
u'A\xc4B\xe8CA\xc4B\xe8C'

>>> print S.decode('latin-1') + U
AÄBèCAÄBèC

Finally, note that 2.6's open() call supports only files of 8-bit bytes, and returns their content as str strings; it's up to you to interpret that content as text or binary data. To read and write Unicode files and encode or decode their content in the process, see 2.6's library manual for information on the codecs.open() call. This call provides much the same functionality as 3.0's open(), and uses 2.6 unicode objects to represent file content—reading a file translates encoded bytes into decoded Unicode characters, and writing translates Unicode strings to the desired encoding specified when opened. We'll see more on files in both Pythons ahead.

Source-File Encoding Declarations

One last note on coding non-ASCII text: Unicode escapes suffice for the occasional Unicode character in string literals, but can become tedious if you need to code non-ASCII text in your strings frequently. For string literals and other text that you embed in your script files, Python uses the UTF-8 encoding in 3.X (and ASCII in 2.X) by default to read your code's text, but allows you to change this per file to use an arbitrary encoding—and hence directly embed any unescaped characters that the chosen encoding supports.

To make this work, simply include a comment which names the encoding used to save your source file. This special encoding-declaration comment must appear as either the first or second line in your script, and is usually of the following form (see Python's manuals for other formats it accepts):

# -*- coding: latin-1 -*-

When present, Python will recognize strings represented natively in the given encoding. That way, you can edit your script file in a text editor that accepts, displays, and saves accented and other non-ASCII characters, and Python will correctly decode them when reading your string literals and other program-file text.

For example, notice how the comment at the top of the following file, "text.py," allows Python to recognize Latin-1 characters embedded in strings when the file is saved with this encoding:

# -*- coding: latin-1 -*-

# any of the following string literal forms work in latin-1;
# changing the encoding above to either ascii or utf-8 fails,
# because the 0xc4 and 0xe8 in myStr1 are not valid in either

myStr1 = 'aÄBèC'

myStr2 = 'A\u00c4B\U000000e8C'

myStr3 = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C'

import sys, locale
print('Sys default encoding: ', sys.getdefaultencoding())
print('Open default encoding:', locale.getencoding())        # later Python 3.X

for aStr in myStr1, myStr2, myStr3:
    print('{0}, strlen={1}, '.format(aStr, len(aStr)), end='')

    bytes1 = aStr.encode()                     # per default utf-8: 2 bytes for non-ASCII
    bytes2 = aStr.encode('latin-1')            # one byte per char 
   #bytes3 = aStr.encode('ascii')              # ascii fails: outside 0..127 range

    print('byteslen1={0}, byteslen2={1}'.format(len(bytes1), len(bytes2)))


C:\misc>c:\python30\python text.py
Sys default encoding:  utf-8
Open default encoding: cp1252
aÄBèC, strlen=5, byteslen1=7, byteslen2=5
AÄBèC, strlen=5, byteslen1=7, byteslen2=5
AÄBèC, strlen=5, byteslen1=7, byteslen2=5

Since most programmers are likely to fall back on the default source encodings (especially the general UTF-8 in Python 3.X), we'll defer to Python's standard manual set for more details on this option, as well as more advanced Unicode support such as properties and character-name escapes in strings that we'll skip here.

Processing 3.0 Bytes Objects

We'll see the string types we've met in action again when we study files ahead. First, though, let's take a brief detour to dig a bit deeper into the operation sets provided by the new bytes type in 3.0.

As mentioned earlier, the 3.0 bytes type supports sequence operations and most of the same methods available on str (and present in 2.X's str type). However, bytes does not support the format() method or the % formatting expression (until 3.5, per the update ahead). Moreover, you cannot mix and match bytes and str without explicit conversions—you generally will use all str type objects and text files for text data, and all bytes type objects and binary files for binary data.

Method Calls

If you really want to see what attributes str has that bytes doesn't, you can always check their dir() results; this can also tell you something about the expression operators they support (e.g., __mod__ and __rmod__ implement the % operator):

C:\misc>c:\python30\python
Python 3.0 (r30:67507, Dec  3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.


# Attributes unique to str

>>> set(dir('abc')) - set(dir(b'abc'))
{'isprintable', 'format', '__mod__', 'encode', 'isidentifier', '_formatter_field_name_split', 
'isnumeric', '__rmod__', 'isdecimal', '_formatter_parser', 'maketrans'}


# Attributes unique to bytes

>>> set(dir(b'abc')) - set(dir('abc'))
{'decode', 'fromhex'}

As you can see, str and bytes have almost identical functionality; their unique attributes are generally methods that don't apply to the other. For instance, decode() translates a raw bytes into its str representation, and encode() translates a str into its raw bytes representation). Most methods are shared between str and bytes, though. Moreover, bytes are immutable just like str in both 2.6 and 3.0 (error messages here have been shortened for brevity):

>>> B = b'spam'                    # b'...' bytes literal
>>> B.find(b'pa')
1

>>> B.replace(b'pa', b'XY')
b'sXYm'

>>> B
b'spam'

>>> B[0] = 'x'
TypeError: 'bytes' object does not support item assignment

One notable exception to this rule: string formatting works only on str in 3.0, not on bytes. As told here, 3.0 also convolutes the string formatting story in general by adding redundant functionality, but that story is beyond the scope of this page:

>>> b'%s' % 99
TypeError: unsupported operand type(s) for %: 'bytes' and 'int'

>>> '%s' % 99
'99'

>>> b'{0}'.format(99)
AttributeError: 'bytes' object has no attribute 'format'

>>> '{0}'.format(99)
'99'

Update: Python 3.5 eventually extended % formatting (only) to bytes objects per this page—for better or worse. The extension has a heavily ASCII bias which clashes badly with the generalized Unicode text model of 3.X, but may be useful in limited contexts.

Sequence Operations

Besides method calls, all the usual generic sequence operations you know (and possibly love) from Python 2.X strings and lists work as expected on both str and bytes in 3.0; this includes indexing, slicing, concatenation, and so on. Notice in the following that indexing bytes returns an integer giving the byte's binary value; bytes really is a sequence of 8-bit integers, but it prints as a string of ASCII-coded characters (plus non-ASCII escapes) when displayed as a whole. To check a given byte's text interpretation, use chr() to convert it back to its character:

>>> B = b'spam'
>>> B
b'spam'

>>> B[0]
115
>>> B[-1]
109

>>> chr(B[0])
's'

>>> B[1:], B[:-1]
(b'pam', b'spa')
 
>>> len(B)
4

>>> B + b'lmn'
b'spamlmn'
>>> B * 4
b'spamspamspamspam'

Other Ways to Make Bytes

So far, we've been making bytes objects with the b'...' literal syntax; they can also be created by calling the bytes() constructor with a str and an encoding name, calling bytes with an iterable of integers representing byte values, or encoding a str object per the default (or passed-in) encoding. Encoding takes a str and returns the raw binary bytes value of the string according to its encoding specification; decoding takes a raw bytes sequence and encodes it to its string representation—a series of Unicode characters:

>>> B = b'abc'
>>> B
b'abc'

>>> B = bytes('abc', 'ascii')
>>> B
b'abc'

>>> ord('a')
97
>>> B = bytes([97, 98, 99])
>>> B
b'abc'

>>> B = 'spam'.encode()       # or bytes()
>>> B
b'spam'
>>>
>>> S = B.decode()            # or str()
>>> S
'spam'

From a larger perspective, the last two of these operations can also be seen as tools for converting between str and bytes, introduced earlier and expanded upon in the next section.

Mixing String Types

Notice in the replace() call of the earlier method-calls section how we have to pass in two bytes objects—str types won't work there. Although Python 2.X automatically converts str to and from unicode when possible (that is, when the str is only 7-bit ASCII text), Python requires specific string types in some contexts, and expects manual conversions if needed:

# Must pass expected types to function and method calls

>>> B = b'spam'

>>> B.replace('pa', 'XY')
TypeError: expected an object with the buffer interface

>>> B.replace(b'pa', b'XY')
b'sXYm'

>>> B = B'spam'
>>> B.replace(bytes('pa'), bytes('xy'))
TypeError: string argument without an encoding

>>> B.replace(bytes('pa', 'ascii'), bytes('xy', 'utf-8'))
b'sxym'


# Must convert manually in mixed-type expressions

>>> b'ab' + 'cd'
TypeError: can't concat bytes to str

>>> b'ab'.decode() + 'cd'                   # bytes to str
'abcd'

>>> b'ab' + 'cd'.encode()                   # str to bytes
b'abcd'

>>> b'ab' + bytes('cd', 'ascii')            # str to bytes
b'abcd'

Two footnotes here. First, remember that encoding and decoding are more than a simple type conversion; as we learned in the fuller coverage earlier, they create different types of data altogether. Second, although you can create bytes objects yourself to represent packed binary data, they can also be made automatically by reading files opened in binary mode, as we'll see later in this article. First, though, let's briefly meet bytes' changeable cousin.

Using 3.0 bytearray Objects

So far, we've focused on str and bytes, since they subsume 2.6's unicode and str. Python 3.0 has a third string type, though—bytearray is essentially a mutable variant of bytes, and thus a mutable sequence of integers in the range 0..255. As such, it supports the same string methods and sequence operations as bytes, as well as the mutable in-place-change operations found on lists:

# Creation: a mutable sequence of small (0..255) ints

>>> B = b'spam'               # str 'spam' works in 2.X only
>>> C = bytearray(B)
>>> C
bytearray(b'spam')
>>> C[0], chr(C[0])           # ASCII integer code for 's'
(115, 's')


# Mutable, but must assign ints, not strings

>>> C[0] = 'x'
TypeError: an integer is required

>>> C[0] = b'x'
TypeError: an integer is required

>>> C[0] = ord('x')
>>> C
bytearray(b'xpam')

>>> C[1] = b'Y'[0]
>>> C
bytearray(b'xYam')


# Methods overlap with both str and bytes, but also has list's mutable methods

>>> set(dir(b'abc')) - set(dir(bytearray(b'abc')))
{'__getnewargs__'}

>>> set(dir(bytearray(b'abc'))) - set(dir(b'abc'))
{'insert', '__alloc__', 'reverse', 'extend', '__delitem__', 'pop', '__setitem__'
, '__iadd__', 'remove', 'append', '__imul__'}


# Mutable method calls

>>> C
bytearray(b'xYam')

>>> C.append(b'LMN')
TypeError: an integer is required

>>> C.append(ord('L'))
>>> C
bytearray(b'xYamL')

>>> C.extend(b'MNO')
>>> C
bytearray(b'xYamLMNO')


# Sequence operations and string methods

>>> C + b'!#'
bytearray(b'xYamLMNO!#')

>>> C[0]
120

>>> C[1:]
bytearray(b'YamLMNO')

>>> len(C)
8

>>> C
bytearray(b'xYamLMNO')

>>> C.replace('xY', 'sp')
TypeError: Type str doesn't support the buffer API

>>> C.replace(b'xY', b'sp')
bytearray(b'spamLMNO')

>>> C
bytearray(b'xYamLMNO')

>>> C * 4
bytearray(b'xYamLMNOxYamLMNOxYamLMNOxYamLMNO')

Finally, by way of summary, the following examples demonstrate how bytes and bytearray are sequences of ints, and str is a sequence of characters (i.e., decoded Unicode code points); although all three can contain character values and support many of the same operations, you should use str for textual data, bytes for binary data, and bytearray for binary data you wish to change in place:

>>> B
b'spam'
>>> list(B)
[115, 112, 97, 109]

>>> C
bytearray(b'xYamLMNO')
>>> list(C)
[120, 89, 97, 109, 76, 77, 78, 79]

>>> S = 'spam'
>>> list(S)
['s', 'p', 'a', 'm']

Python 3.0 File Modes in Action

Now that we've learned all about Python's string types, let's turn to their roles in files—the main context in which most programmers will likely encounter Unicode and bytes, and the last major topic of this tutorial.

As also mentioned above, the mode in which you open a file is crucial: it determines which object type you will use to represent the file's content in your script. Text mode implies str objects, and binary mode implies bytes:

In terms of code, the second positional argument to open() determines whether you want text or binary processing and types, just as it does in 2.X Python—adding a "b" to the string implies binary mode. The default mode is "rt" which is the same as "r", which means text input, just as in 2.X. In 3.0, though, this mode argument to open() also implies an object type for file content representation regardless of the underlying platform—text files return a str for reads and expect one for writes, but binary files return a bytes for reads and expect bytes (or bytearray) for writes.

Text File Basics

To demonstrate, let's begin with basic file I/O. As long as you're processing basic text files (e.g., ASCII) and don't care about circumventing the platform-default encoding of strings, files look and feel much as they do in 2.X (for that matter, so do strings in general). The following, for instance, writes one line of text to a file and reads it back in 3.0, exactly as it would in 2.6 (note that file is no longer a built-in name in 3.0, and it's perfectly okay to use it as a variable here either way):

C:\misc>c:\python30\python

# Basic text files (and strings) work the same as in 2.X

>>> file = open('temp', 'w')
>>> size = file.write('abc\n')       # returns number bytes written
>>> file.close()                     # manual close to flush output buffer

>>> file = open('temp')              # default mode is "r" (== "rt"), which means text input
>>> text = file.read()
>>> text
'abc\n'

Using Text and Binary Modes

Next, we'll write a text file and read it back in both modes in 3.0. Notice that we are required to provide a str for writing, but reading gives us a str or bytes depending on the open mode (I've strung operations together here into one-liners just for brevity):

# Write and read a text file

>>> open('temp', 'w').write('abc\n')       # text mode output, provide a str
4

>>> open('temp', 'r').read()               # text mode input, returns a str
'abc\n'

>>> open('temp', 'rb').read()              # binary mode input, returns a bytes
b'abc\r\n'

Now, let's do the same, but with a binary file; we must provide a bytes to write, and still get back a str or bytes depending on the input mode:

# Write and read a binary file

>>> open('temp', 'wb').write(b'abc\n')     # binary mode output, provide a bytes
4

>>> open('temp', 'r').read()               # text mode input, returns a str
'abc\n'

>>> open('temp', 'rb').read()              # binary mode input, returns a bytes
b'abc\n'

Notice that the same holds even if the data we're writing to the binary file is truly binary in nature; in the following, the \x00 is a binary zero byte, and not a printable character (though it passes as a text code point in the default encoding):

# Write and read binary data

>>> open('temp', 'wb').write(b'a\x00c')
3

>>> open('temp', 'r').read()
'a\x00c'

>>> open('temp', 'rb').read()
b'a\x00c'

Binary mode files always return contents as a bytes object, but accept either a bytes or bytearray object for writing. This naturally follows, given that bytearray is mostly just a mutable variant of bytes. In fact, most APIs in Python 3.0 that accept a bytes also allow a bytearray:

# Bytearrays work too

>>> BA = bytearray(b'\x01\x02\x03')
>>>
>>> open('temp', 'wb').write(BA)
3

>>> open('temp', 'r').read()
'\x01\x02\x03'

>>> open('temp', 'rb').read()
b'\x01\x02\x03'

Finally, notice that you can't get away with violating Python's str/bytes distinction when it comes to files; in the following we get errors (shortened here) if we try to write a bytes to a text file, or a str to a binary file. Although it is often possible to convert between these two types (as described earlier in this article), you will usually want to stick to str for text data and bytes for binary data:

# Types are not flexible for file content

>>> open('temp', 'w').write('abc\n')             # auto encodes str to bytes
4
>>> open('temp', 'w').write(b'abc\n')            # but bytes is not decoded text
TypeError: can't write bytes to text stream

>>> open('temp', 'wb').write(b'abc\n')           # writes raw bytes
4
>>> open('temp', 'wb').write('abc\n')            # but str is not raw bytes
TypeError: can't write str to binary stream

This may seem strict, but Python cannot guess how you may wish to interpret the contents of a bytes or str when used in the opposite context, and wisely refuses to decode or encode content implicitly. Moreover, because str and bytes operation sets largely intersect, the choice of types won't be much of a dilemma for most programs. See earlier in this article for more on bytes operations and mixed-type constraints, and the struct module coverage ahead for another binary-file example.

Using Unicode Text Files

Update: this draft article originally stopped short here before demonstrating open() encodings for text files. In brief, text files allow a specific Unicode encoding-scheme name to be passed in with an encoding argument, and use it to automatically decode and encode text on input and output, respectively:

The first of the above, for instance, assumes the file's content is encoded per UTF-8, and automatically decodes its data to str code points when read by the program. Similarly, the second bullet above encodes str code points to their Latin-1 format as they are output to the file. File transfers raise exceptions whenever a requested format doesn't work.

For example, in Python 3.X:

>>> file = open('uni.txt', 'w', encoding='utf8')           # auto encodes to bytes
>>> file.write('spÄm')
4
>>> file.close()

>>> text = open('uni.txt', 'r', encoding='utf8').read()    # auto decodes to str
>>> text
'spÄm'

>>> raw = open('uni.txt', 'rb').read()                     # no decoding applied
>>> raw
b'sp\xc3\x84m'

>>> text = open('uni.txt', 'r', encoding='ascii').read()   # Ä's utf8 bytes aren't ascii
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

>>> import codecs
>>> codecs.open('uni.txt', 'r', encoding='utf8').read()    # 2.X's flavor in 3.X
'spÄm'

In the absence of encoding, text files still encode and decode per a platform- and version-specific default noted earlier. Although you may not notice these translations if your default and files agree, you generally should not rely on the default; it makes your programs dependent on the context in which their files were created, and can lead to portability issues. A program run on a UTF-8 default platform, for instance, may have trouble using a file made under a Latin-1 default (an interoperability pitfall also noted here).

For more coverage and examples of these topics, try this site's posts here and here, and see this article's later version in this book.

Other String Tool Changes in 3.0

In closing, it's worth noting that many of the popular string-processing tools in Python's standard library have also been revamped for the new str/bytes dichotomy. We won't cover any of these application-focused tools in much detail in this core-language book, but as a sample, here's a quick look at two of the major tools impacted.

The re Pattern-Matching Module

Python's re pattern-matching module has been generalized to work on any objects of any string type in 3.0—str, bytes, and bytearray. Note that you can't mix str and bytes types in its calls' arguments, though:

>>> import re
>>> S = 'Bugger all down here on earth!'
>>> B = b'Bugger all down here on earth!'
>>>
>>> re.match('(.*) down (.*) on (.*)', S).groups()
('Bugger all', 'here', 'earth!')
>>>
>>> re.match(b'(.*) down (.*) on (.*)', B).groups()
(b'Bugger all', b'here', b'earth!')


>>> re.match('(.*) down (.*) on (.*)', B).groups()
...
TypeError: can't use a string pattern on a bytes-like object
>>>
>>> re.match(b'(.*) down (.*) on (.*)', S).groups()
...
TypeError: can't use a bytes pattern on a string-like object


>>> re.match(b'(.*) down (.*) on (.*)', bytearray(B)).groups()
(bytearray(b'Bugger all'), bytearray(b'here'), bytearray(b'earth!'))
>>>
>>> re.match('(.*) down (.*) on (.*)', bytearray(B)).groups()
...
TypeError: can't use a string pattern on a bytes-like object

The struct Binary-Data Module

Along similar lines, the Python struct module, used to create and extract packed binary data from strings, works in 3.0 as it does in 2.X, but in 3.X operates on bytes and bytearray only, not str (which makes sense, given that it's intended for processing binary data, not decoded text):

>>> import struct
>>> B = struct.pack('>i4sh', 7, b'spam', 8)     # 's' requires bytes as of 3.2
>>> B                                           # (it encodes str as utf8 in 3.0/3.1)
b'\x00\x00\x00\x07spam\x00\x08'
>>>
>>> vals = struct.unpack('>i4sh', B)            # packed data is bytes, not str 
>>> vals
(7, b'spam', 8)
>>>
>>> vals = struct.unpack('>i4sh', B.decode())
TypeError: 'str' does not have the buffer interface

Apart from the new syntax for bytes, creating and reading binary files works almost the same in 3.0 as it does in 2.X (and as described briefly earlier in this article, and in more detail in this book):

C:\misc>c:\python30\python.exe
>>> F = open('data.bin', 'wb')                  # open binary output file
>>> import struct
>>> data = struct.pack('>i4sh', 7, b'spam', 8)  # create packed binary data
>>> data                                        # bytes in 3.0, not str
b'\x00\x00\x00\x07spam\x00\x08'
>>> F.write(data)                               # write to the file
10
>>> F.close()

>>> F = open('data.bin', 'rb')                  # open binary input file
>>> data = F.read()                             # read bytes
>>> data
b'\x00\x00\x00\x07spam\x00\x08'
>>> values = struct.unpack('>i4sh', data)       # extract packed binary data
>>> values                                      # back to Python objects
(7, b'spam', 8)

Update: Python 3.2 changed the struct.pack() call to require a bytes (or bytearray) object for its "s" conversion code; using a str is an error. In 3.0 and 3.1 a str is allowed and automatically encoded to bytes as UTF-8 text—arguably too large an assumption, but 3.2 changes working and documented behavior. To placate 3.2 and later, the examples above simply use a b'...' literal instead of '...'; in your code, encode as needed (e.g., mystr.encode('utf8').

For more on re, struct, and other string-related modules impacted by 3.0's new Unicode support, consult the Python library manual; this article's published version, which also covers pickle object serialization and XML parsing; or application-focused follow-up books such as Programming Python.

For More Reading

An edited and enhanced version of this page's material appeared in this book, and was later expanded in this edition; see the latter for additional coverage of strings in both Python 3.X and 2.X.

For related resources online at this site, you may also be interested in a 2016 review of Python 3.5 bytes-string formatting; 2018 program-usage notes regarding Unicode BOMs, defaults, and encodings; and the 2022 coverage of normalization.

For additional reading, try these other articles popular at learning-python.com:

These and more are available on the blog page.



[Home page] Books Code Blog Python Author Train Find ©M.Lutz