13. Text Processing

Topics

♦ String objects

♦ Splitting strings

♦ Regular expressions

♦ Parsing languages

♦ XML parsing

String objects: review

♦ Handle basic text processing tasks

♦ Operations: slicing, concatenation, indexing, formatting, etc.

♦ String methods: searching, replacement, splitting, etc.

♦ Built-in string functions: ord(C) → ASCII code in 2.X, Unicode code point in 3.X

♦ Running code strings: ‘eval’, ‘exec’ (2.X+3.X), ‘execfile’ (2.X)

♦ Unicode (possibly-wide) strings supported in Python 2.0+

►U’xxx’ in 2.X, ‘xxx’ in 3.X, encoding/decoding in memory or on IO (see final unit)

>>> text = "Hello world"

>>> text = 'M' + text[1:6] + 'World'

>>> text

'Mello World'

>>> exec 'print "J" + text[1:]'

Jello World

Splitting and joining strings

♦ str.split returns a list of columns: around whitespace

♦ str.split allows arbitrary delimiters to used

♦ str.join puts string lists back together

♦ eval converts column strings to Python objects

Example: summing columns in a file

# see also: newer column summer code at end of Basic Statements unit

file: summer.py

import sys

def summer(numCols, fileName):

sums = [0] * numCols

for line in open(fileName, 'r'):

cols = line.split()

for i in range(numCols):

sums[i] += eval(cols[i]) # any expression will work!

return sums

if __name__ == '__main__':

print summer(eval(sys.argv[1]), sys.argv[2])

Example: column sum alternatives

Example: replacing substrings

file: replace.py

# manual global substitution: same as str.replace(old, new)

def replace(str, old, new):

list = str.split(old) # XoldY -> [X, Y]

return new.join(list) # [X, Y] -> XnewY

Example: analyzing data files

● Collect all entries for keys on right

● Data file contains “histogram” data

% cat histo1.txt

1 one

2 one

3 two

7 three

8 two

10 one

14 three

19 three

20 three

30 three

% cat histo.py

#!/usr/bin/env python

import sys

entries = {}

for line in open(sys.argv[1]):

[left, right] = line.split()

try:

entries[right].append(left) # or use has_key, or get

except KeyError: # e[r] = e.get(r, []) + [l]

entries[right] = [left]

for (right, lefts) in entries.items():

print "%04d '%s'\titems => %s" % (len(lefts), right,lefts)

% histo.py histo1.txt

0003 'one' items => ['1', '2', '10']

0005 'three' items => ['7', '14', '19', '20', '30']

0002 'two' items => ['3', '8']

Regular expressions

♦ For matching patterns in strings

♦ Matched substrings may extracted after a match as “groups”

♦ Compiled regular expressions are first-class objects: optimization

♦ Now supported by the ‘re’ standard module: Perl5-style patterns

♦ Suports non-greedy operators, character classes, etc.

♦ Older options: the “regex”, ‘regsub’ modules: emacs/awk/grep patterns

Basic interface

>>> import re

>>> mobj = re.match('Hello(.*)world', 'Hello---spam---world')

>>> mobj.group(1)

'---spam---'

>>> pobj = re.compile('Hello[ \t]*(.*)')

>>> mobj = pobj.match('Hello SPAM!')

>>> mobj.group(1)

'SPAM!'

Example: searching C files

Finds #include and #define lines in a C file

Operators

● X+ repeat X one or more times

● X* repeat X zero or more times

● [abc] any of a or b or c

● (X) keep substring that matches X (“group”)

● ^X match X at start of line

Methods

● re.compile precompiles expression into pattern object

● patternobj.match returns match object, or None if match fails

● matchobj.group returns matched substring[i] (pattern part in parens)

● matchobj.span returns start/stop indexes of match substring[i]

● also has methods for replacement and findall, nongreedy match operators,…

file: cheader.py

#!/usr/local/bin/python

import sys, re

pattDefine = re.compile( # precompile to pattobj

'^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)') # "# define xxx yyy..."

pattInclude = re.compile(

'^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)') # "# include <xxx>..."

def scan(file):

count = 0

for line in file: # scan line-by-line

count += 1

matchobj = pattDefine.match(line) # None if match fails

if matchobj:

name = matchobj.group(1) # substrings for (...) parts

body = matchobj.group(2)

print(count, 'defined', name, '=', body.strip())

else:

matchobj = pattInclude.match(line)

if matchobj:

start, stop = matchobj.span(1) # start/stop indexes of (...)

filename = line[start:stop] # slice out of line

print(count, 'include', filename) # same as matchobj.group(1)

if len(sys.argv) == 1:

scan(sys.stdin) # no args: read stdin

else:

scan(open(sys.argv[1], 'r')) # arg: input file name

Parsing languages

● For more demanding languages: regular expressions have no “memory”

● Recursive descent parsers: see YAPPS parser generator

● Parser generators: ‘bison’ wrapper, PyParsing, SPARK, kwParsing, etc. (see the web)

● NLTK: Natural Language Toolkit for Python, AI and statistical tools

XML Parsing

● Python Standard Library Support:

► SAX parsers: state-machines with class method callbacks

► DOM parsers: document object tree, with standard API for traversal

► ElementTree: Python-specific XML parser and generator

Example: Parsing XML 4 ways with patterns and basic XML tools

Given the following (narcissistic!) XML text file, mybooks.xml:

<books>

<title>Learning Python</title>

<title>Pogramming Python</title>

<title>Python Pocket Reference</title>

<publisher>O'Reilly Media</publisher>

</books>

Run a script to extract and display content of all the nested “title” tags, as follows:

Learning Python

Pogramming Python

Python Pocket Reference

PATTERNS: Basic pattern matching on the file’s text, though this can be inaccurate if the text is unpredictable. findall locates all places where a pattern matches in the string, returns list of matched substrings corresponding to parenthesized pattern groups, or tuples of such for multiple groups

# File patternparse.py

import re

text = open('mybooks.xml').read()

found = re.findall('<title>(.*)</title>', text)

for title in found: print(title)

DOM PARSING: Perform complete XML parsing with the standard library’s DOM parsing support. DOM parses XML text into a tree of objects, and provides an interface for navigating the tree to extract tag attributes and values; a formal specification, independent of Python

# File domparse.py

from xml.dom.minidom import parse, Node

xmltree = parse('mybooks.xml')

for node1 in xmltree.getElementsByTagName('title'):

for node2 in node1.childNodes:

if node2.nodeType == Node.TEXT_NODE:

print(node2.data)

SAX PARSING: Alternatively, Python’s standard library also supports SAX parsing for XML. Under the SAX model, a class’s methods receive callbacks as a parse progresses, and use state information to keep track of where they are in the document and collect its data

# File saxparse.py

import xml.sax.handler

class BookHandler(xml.sax.handler.ContentHandler):

def __init__(self):

self.inTitle = False

def startElement(self, name, attributes):

if name == 'title':

self.inTitle = True

def characters(self, data):

if self.inTitle:

print(data)

def endElement(self, name):

if name == "title":

self.inTitle = False

import xml.sax

parser = xml.sax.make_parser()

handler = BookHandler()

parser.setContentHandler(handler)

parser.parse('mybooks.xml')

ELEMENTTREE PARSING: Finally, the etree package of the standard library can often achieve the same effects as XML DOM parsers, but with less code. It’s a Python-specific way to both parse and generate XML text; after a parse, its API gives access to components of the document

# File etreeparse.py

from xml.etree.ElementTree import parse

tree = parse('mybooks.xml')

for E in tree.findall('title'):

print(E.text)

The output of all four alternatives is the same under 2.6, 2.7, and 3.X

C:\misc> c:\python26\python domparse.py

Learning Python

Pogramming Python

Python Pocket Reference

C:\misc> c:\python30\python domparse.py

Learning Python