13. Text Processing








     String objects

     Splitting strings

     Regular expressions

     Parsing languages

     XML parsing



String objects: review



     Handle basic text processing tasks

     Operations: slicing, concatenation, indexing, formatting, etc.

     String methods: searching, replacement, splitting, etc.

     Built-in string functions: ord(C) → ASCII code in 2.X, Unicode code point in 3.X

     Running code strings: ‘eval’, ‘exec’ (2.X+3.X), ‘execfile’ (2.X)

     Unicode (possibly-wide) strings supported in Python 2.0+

U’xxx’ in 2.X, ‘xxx’ in 3.X, encoding/decoding in memory or on IO (see final unit)




>>> text = "Hello world"

>>> text = 'M' + text[1:6] + 'World'

>>> text

'Mello World'

>>> exec 'print "J" + text[1:]'

Jello World



Splitting and joining strings



     str.split returns a list of columns: around whitespace

     str.split allows arbitrary delimiters to used

     str.join puts string lists back together

     eval converts column strings to Python objects





Example: summing columns in a file



# see also: newer column summer code at end of Basic Statements unit



file: summer.py

import sys


def summer(numCols, fileName):

    sums = [0] * numCols

    for line in open(fileName, 'r'):

        cols = line.split()

        for i in range(numCols):

            sums[i] += eval(cols[i])      # any expression will work!

    return sums


if __name__ == '__main__':

    print summer(eval(sys.argv[1]), sys.argv[2])




Example: column sum alternatives






Example: replacing substrings


file: replace.py

# manual global substitution: same as str.replace(old, new)


def replace(str, old, new):

    list = str.split(old)         # XoldY  -> [X, Y]

    return new.join(list)         # [X, Y] -> XnewY





Example: analyzing data files



      Collect all entries for keys on right

      Data file contains “histogram” data




% cat histo1.txt

1       one

2       one

3       two

7       three

8       two

10      one

14      three

19      three

20      three

30      three





% cat histo.py

#!/usr/bin/env python

import sys


entries = {}

for line in open(sys.argv[1]):

    [left, right] = line.split()


        entries[right].append(left)        # or use has_key, or get

    except KeyError:                       # e[r] = e.get(r, []) + [l]

        entries[right] = [left]


for (right, lefts) in entries.items():

  print "%04d '%s'\titems => %s" % (len(lefts), right,lefts)





% histo.py histo1.txt

0003 'one'      items => ['1', '2', '10']

0005 'three'    items => ['7', '14', '19', '20', '30']

0002 'two'      items => ['3', '8']





Regular expressions



     For matching patterns in strings

     Matched substrings may extracted after a match as “groups”

     Compiled regular expressions are first-class objects: optimization

     Now supported by the ‘re’ standard module: Perl5-style patterns

     Suports non-greedy operators, character classes, etc.

     Older options: the “regex”, ‘regsub’ modules: emacs/awk/grep patterns




Basic interface



>>> import re


>>> mobj = re.match('Hello(.*)world', 'Hello---spam---world')

>>> mobj.group(1)



>>> pobj = re.compile('Hello[ \t]*(.*)')

>>> mobj = pobj.match('Hello    SPAM!')

>>> mobj.group(1)






Example: searching C files


Finds #include and #define lines in a C file




     X+       repeat X one or more times

     X*               repeat X zero or more times

     [abc]    any of a or b or c

     (X)       keep substring that matches X (“group”)

     ^X        match X at start of line




     re.compile   precompiles expression into pattern object

     patternobj.match        returns match object, or None if match fails

     matchobj.group          returns matched substring[i] (pattern part in parens)

     matchobj.span            returns start/stop indexes of match substring[i]

     also has methods for replacement and findall, nongreedy match operators,…






file: cheader.py


import sys, re


pattDefine = re.compile(                               # precompile to pattobj

    '^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)')   # "# define xxx yyy..."


pattInclude = re.compile(

    '^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)')     # "# include <xxx>..."


def scan(file):

    count = 0

    for line in file:                                # scan line-by-line

        count += 1

        matchobj = pattDefine.match(line)            # None if match fails

        if matchobj:

            name = matchobj.group(1)                 # substrings for (...) parts

            body = matchobj.group(2)

            print(count, 'defined', name, '=', body.strip())


            matchobj = pattInclude.match(line)

            if matchobj:

                start, stop = matchobj.span(1)       # start/stop indexes of (...)

                filename = line[start:stop]          # slice out of line

                print(count, 'include', filename)    # same as matchobj.group(1)


if len(sys.argv) == 1:

    scan(sys.stdin)                    # no args: read stdin


    scan(open(sys.argv[1], 'r'))       # arg: input file name






Parsing languages



    For more demanding languages: regular expressions have no “memory”

    Recursive descent parsers: see YAPPS parser generator

    Parser generators: ‘bison’ wrapper, PyParsing, SPARK, kwParsing, etc. (see the web)

    NLTK: Natural Language Toolkit for Python, AI and statistical tools






XML Parsing






    Python Standard Library Support:

   SAX parsers: state-machines with class method callbacks

   DOM parsers: document object tree, with standard API for traversal

   ElementTree: Python-specific XML parser and generator



Example: Parsing XML 4 ways with patterns and basic XML tools





Given the following (narcissistic!) XML text file, mybooks.xml:




    <title>Learning Python</title>

    <title>Pogramming Python</title>

    <title>Python Pocket Reference</title>

    <publisher>O'Reilly Media</publisher>






Run a script to extract and display content of all the nested “title” tags, as follows:


Learning Python

Pogramming Python

Python Pocket Reference





PATTERNS: Basic pattern matching on the file’s text, though this can be inaccurate if the text is unpredictable. findall locates all places where a pattern matches in the string, returns list of matched substrings corresponding to parenthesized pattern groups, or tuples of such for multiple groups


# File patternparse.py


import re

text  = open('mybooks.xml').read()

found = re.findall('<title>(.*)</title>', text)

for title in found: print(title)





DOM PARSING: Perform complete XML parsing with the standard library’s DOM parsing support. DOM parses XML text into a tree of objects, and provides an interface for navigating the tree to extract tag attributes and values; a formal specification, independent of Python


# File domparse.py


from xml.dom.minidom import parse, Node

xmltree = parse('mybooks.xml')

for node1 in xmltree.getElementsByTagName('title'):

    for node2 in node1.childNodes:

         if node2.nodeType == Node.TEXT_NODE:






SAX PARSING: Alternatively, Python’s standard library also supports SAX parsing for XML. Under the SAX model, a class’s methods receive callbacks as a parse progresses, and use state information to keep track of where they are in the document and collect its data


# File saxparse.py


import xml.sax.handler

class BookHandler(xml.sax.handler.ContentHandler):

    def __init__(self):

        self.inTitle = False

    def startElement(self, name, attributes):

        if name == 'title':

            self.inTitle = True

    def characters(self, data):

        if self.inTitle:


    def endElement(self, name):

        if name == "title":

            self.inTitle = False


import xml.sax

parser = xml.sax.make_parser()

handler = BookHandler()







ELEMENTTREE PARSING: Finally, the etree package of the standard library can often achieve the same effects as XML DOM parsers, but with less code. It’s a Python-specific way to both parse and generate XML text; after a parse, its API gives access to components of the document


# File etreeparse.py


from xml.etree.ElementTree import parse

tree = parse('mybooks.xml')

for E in tree.findall('title'):






The output of all four alternatives is the same under 2.6, 2.7, and 3.X


C:\misc> c:\python26\python domparse.py

Learning Python

Pogramming Python

Python Pocket Reference


C:\misc> c:\python30\python domparse.py

Learning Python

Pogramming Python

Python Pocket Reference



    Other XML-related Support

      See Extras\Code\XML on class CD for additional DOM/SAX examples

      XPath,… 3rd-party extensions (see xml-sig page)

      O’Reilly book: Python & XML

      XML-RPC: xmlrpclib in Python std lib – XML coded data over HTTP

      SOAP Protocol: PySoap, Soapy 3rd-party extensions





    See also: JSON example in Database unit, the successor to XML?






Lab Session 10


Click here to go to lab exercises

Click here to go to exercise solutions

Click here to go to solution source files


Click here to go to lecture example files