13. Text Processing








     String objects

     Splitting strings

     Regular expressions

     Parsing languages

     XML parsing



String objects: review



     Handle basic text processing tasks

     Operations: slicing, concatenation, indexing, formatting, etc.

     String methods: searching, replacement, splitting, etc.

     Built-in string functions: ord(C) → ASCII code in 2.X, Unicode code point in 3.X

     Running code strings: ‘eval’, ‘exec’ (2.X+3.X), ‘execfile’ (2.X)

     Unicode (possibly-wide) strings supported in Python 2.0+

U’xxx’ in 2.X, ‘xxx’ in 3.X, encoding/decoding in memory or on IO (see final unit)




>>> text = "Hello world"

>>> text = 'M' + text[1:6] + 'World'

>>> text

'Mello World'

>>> exec 'print "J" + text[1:]'

Jello World



Splitting and joining strings



     str.split returns a list of columns: around whitespace

     str.split allows arbitrary delimiters to used

     str.join puts string lists back together

     eval converts column strings to Python objects





Example: summing columns in a file



# see also: newer column summer code at end of Basic Statements unit



file: summer.py

import sys


def summer(numCols, fileName):

    sums = [0] * numCols

    for line in open(fileName, 'r'):

        cols = line.split()

        for i in range(numCols):

            sums[i] += eval(cols[i])      # any expression will work!

    return sums


if __name__ == '__main__':

    print summer(eval(sys.argv[1]), sys.argv[2])




Example: column sum alternatives






Example: replacing substrings


file: replace.py

# manual global substitution: same as str.replace(old, new)


def replace(str, old, new):

    list = str.split(old)         # XoldY  -> [X, Y]

    return new.join(list)         # [X, Y] -> XnewY





Example: analyzing data files



      Collect all entries for keys on right

      Data file contains “histogram” data




% cat histo1.txt

1       one

2       one

3       two

7       three

8       two

10      one

14      three

19      three

20      three

30      three





% cat histo.py

#!/usr/bin/env python

import sys


entries = {}

for line in open(sys.argv[1]):

    [left, right] = line.split()


        entries[right].append(left)        # or use has_key, or get

    except KeyError:                       # e[r] = e.get(r, []) + [l]

        entries[right] = [left]


for (right, lefts) in entries.items():

  print "%04d '%s'\titems => %s" % (len(lefts), right,lefts)





% histo.py histo1.txt

0003 'one'      items => ['1', '2', '10']

0005 'three'    items => ['7', '14', '19', '20', '30']

0002 'two'      items => ['3', '8']





Regular expressions



     For matching patterns in strings

     Matched substrings may extracted after a match as “groups”

     Compiled regular expressions are first-class objects: optimization

     Now supported by the ‘re’ standard module: Perl5-style patterns

     Suports non-greedy operators, character classes, etc.

     Older options: the “regex”, ‘regsub’ modules: emacs/awk/grep patterns




Basic interface



>>> import re


>>> mobj = re.match('Hello(.*)world', 'Hello---spam---world')

>>> mobj.group(1)



>>> pobj = re.compile('Hello[ \t]*(.*)')

>>> mobj = pobj.match('Hello    SPAM!')

>>> mobj.group(1)






Example: searching C files


Finds #include and #define lines in a C file




     X+       repeat X one or more times

     X*               repeat X zero or more times

     [abc]    any of a or b or c

     (X)       keep substring that matches X (“group”)

     ^X        match X at start of line




     re.compile   precompiles expression into pattern object

     patternobj.match        returns match object, or None if match fails

     matchobj.group          returns matched substring[i] (pattern part in parens)

     matchobj.span            returns start/stop indexes of match substring[i]

     also has methods for replacement and findall, nongreedy match operators,…






file: cheader.py


import sys, re


pattDefine = re.compile(                               # precompile to pattobj

    '^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)')   # "# define xxx yyy..."


pattInclude = re.compile(

    '^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)')     # "# include <xxx>..."


def scan(file):

    count = 0

    for line in file:                                # scan line-by-line

        count += 1

        matchobj = pattDefine.match(line)            # None if match fails

        if matchobj:

            name = matchobj.group(1)                 # substrings for (...) parts

            body = matchobj.group(2)

            print(count, 'defined', name, '=', body.strip())


            matchobj = pattInclude.match(line)

            if matchobj:

                start, stop = matchobj.span(1)       # start/stop indexes of (...)

                filename = line[start:stop]          # slice out of line

                print(count, 'include', filename)    # same as matchobj.group(1)


if len(sys.argv) == 1:

    scan(sys.stdin)                    # no args: read stdin


    scan(open(sys.argv[1], 'r'))       # arg: input file name






Parsing languages



    For more demanding languages: regular expressions have no “memory”

    Recursive descent parsers: see YAPPS parser generator

    Parser generators: ‘bison’ wrapper, PyParsing, SPARK, kwParsing, etc. (see the web)

    NLTK: Natural Language Toolkit for Python, AI and statistical tools






XML Parsing






    Python Standard Library Support:

   SAX parsers: state-machines with class method callbacks

   DOM parsers: document object tree, with standard API for traversal

   ElementTree: Python-specific XML parser and generator



Example: Parsing XML 4 ways with patterns and basic XML tools





Given the following (narcissistic!) XML text file, mybooks.xml:




    <title>Learning Python</title>

    <title>Pogramming Python</title>

    <title>Python Pocket Reference</title>

    <publisher>O'Reilly Media</publisher>






Run a script to extract and display content of all the nested “title” tags, as follows:


Learning Python

Pogramming Python

Python Pocket Reference





PATTERNS: Basic pattern matching on the file’s text, though this can be inaccurate if the text is unpredictable. findall locates all places where a pattern matches in the string, returns list of matched substrings corresponding to parenthesized pattern groups, or tuples of such for multiple groups


# File patternparse.py


import re

text  = open('mybooks.xml').read()

found = re.findall('<title>(.*)</title>', text)

for title in found: print(title)





DOM PARSING: Perform complete XML parsing with the standard library’s DOM parsing support. DOM parses XML text into a tree of objects, and provides an interface for navigating the tree to extract tag attributes and values; a formal specification, independent of Python


# File domparse.py


from xml.dom.minidom import parse, Node

xmltree = parse('mybooks.xml')

for node1 in xmltree.getElementsByTagName('title'):

    for node2 in node1.childNodes:

         if node2.nodeType == Node.TEXT_NODE:






SAX PARSING: Alternatively, Python’s standard library also supports SAX parsing for XML. Under the SAX model, a class’s methods receive callbacks as a parse progresses, and use state information to keep track of where they are in the document and collect its data


# File saxparse.py


import xml.sax.handler

class BookHandler(xml.sax.handler.ContentHandler):

    def __init__(self):

        self.inTitle = False

    def startElement(self, name, attributes):

        if name == 'title':

            self.inTitle = True

    def characters(self, data):

        if self.inTitle:


    def endElement(self, name):

        if name == "title":

            self.inTitle = False


import xml.sax

parser = xml.sax.make_parser()

handler = BookHandler()







ELEMENTTREE PARSING: Finally, the etree package of the standard library can often achieve the same effects as XML DOM parsers, but with less code. It’s a Python-specific way to both parse and generate XML text; after a parse, its API gives access to components of the document


# File etreeparse.py


from xml.etree.ElementTree import parse

tree = parse('mybooks.xml')

for E in tree.findall('title'):






The output of all four alternatives is the same under 2.6, 2.7, and 3.X


C:\misc> c:\python26\python domparse.py

Learning Python

Pogramming Python

Python Pocket Reference


C:\misc> c:\python30\python domparse.py

Learning Python

Pogramming Python

Python Pocket Reference



    Other XML-related Support

      See Extras\Code\XML on class CD for additional DOM/SAX examples

      XPath,… 3rd-party extensions (see xml-sig page)

      O’Reilly book: Python & XML

      XML-RPC: xmlrpclib in Python std lib – XML coded data over HTTP

      SOAP Protocol: PySoap, Soapy 3rd-party extensions





    See also: JSON example in Database unit, the successor to XML?






