Topics
¨ String objects
¨ Splitting strings
¨ Regular expressions
¨ Parsing languages
¨ XML parsing (see Internet section)
¨ Handle basic text processing tasks
¨ Operations: slicing, concatenation, indexing, etc.
¨ String methods: searching, replacement, splitting, etc.
¨ Built-in string functions: ord()
¨ Running code strings: ‘eval’, ‘exec’, ‘execfile’
¨ Unicode (16-bit char) wide strings supported in Python 2.0
>>> text = "Hello world"
>>> text = 'M' + text[1:6] + 'World'
>>> text
'Mello World'
>>> exec 'print "J" + text[1:]'
Jello World
¨ str.split returns a list of columns: around whitespace
¨ str.split allows arbitrary delimiters to used
¨ str.join puts string lists back together
¨ eval converts column strings to Python objects
Example: column sum alternatives
Example: summing columns in a file
# see also: newer column summer code at end of Basic Statements
chapter
file: summer.py
import sys
def summer(numCols, fileName):
sums = [0] * numCols
for line in open(fileName, 'r'):
cols = line.split()
for i in range(numCols):
sums[i] += eval(cols[i]) # any expression will work!
return sums
if __name__ == '__main__':
print summer(eval(sys.argv[1]), sys.argv[2])
Example: replacing substrings
file: replace.py
# manual global substitution
def replace(str, old, new):
list = str.split(old) # XoldY -> [X, Y]
return new.join(list) # [X, Y] -> XnewY
Example: analyzing data files
· Collect all entries for keys on right
· Data file contains “histogram” data
%
cat histo1.txt
1 one
2 one
3 two
7 three
8 two
10 one
14 three
19 three
20 three
30 three
%
cat histo.py
#!/usr/bin/env python
import sys
entries = {}
for line in open(sys.argv[1]):
[left, right] = line.split()
try:
entries[right].append(left) # or use has_key, or get
except KeyError: # e[r] = e.get(r, []) + [l]
entries[right] = [left]
for (right, lefts) in entries.items():
print "%04d '%s'\titems => %s" % (len(lefts), right,lefts)
% histo.py histo1.txt
0003 'one' items => ['1', '2', '10']
0005 'three' items => ['7', '14', '19', '20', '30']
0002 'two' items => ['3', '8']
¨ For matching patterns in strings
¨ Matched substrings may extracted after a match as “groups”
¨ Compiled regular expressions are first-class objects: optimization
¨ Now supported by the ‘re’ standard module: Perl5-style patterns
¨ Suports non-greedy operators, character classes, etc.
¨ Older options: the “regex”, ‘regsub’ modules: emacs/awk/grep patterns
Basic interface
>>>
import re
>>>
mobj = re.match('Hello(.*)world', 'Hello---spam---world')
>>>
mobj.group(1)
'---spam---'
>>>
pobj = re.compile('Hello[ \t]*(.*)')
>>> mobj
= pobj.match('Hello SPAM!')
>>>
mobj.group(1)
'SPAM!'
Example: searching C files
¨ Finds
#include and #define lines in a C file
Operators
¨
X+ repeat
X one or more times
¨
X* repeat
X zero or more times
¨
[abc] any of
a or b or c
¨
(X) keep
substring that matches X (“group”)
¨
^X match X at
start of line
Methods
¨ re.compile precompiles expression into pattern
object
¨ patternobj.match returns match object, or None if match fails
¨ matchobj.group
returns matched substring[i] (pattern
part in parens)
¨ matchobj.span returns start/stop indexes of match
substring[i]
¨ also has
methods for replacement, nongreedy match operators,…
file: cheader.py
#! /usr/local/bin/python
import sys, re
pattDefine
= re.compile(
# compile to pattobj
'^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)') # "# define xxx yyy..."
pattInclude
= re.compile(
'^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)') # "# include <xxx>..."
def scan(file):
count = 0
while 1: # scan line-by-line
line = file.readline()
if not line: break
count = count + 1
matchobj = pattDefine.match(line) # None if match fails
if matchobj:
name = matchobj.group(1) # substrings for (...) parts
body = matchobj.group(2)
print count, 'defined', name, '=', body.strip()
continue
matchobj = pattInclude.match(line)
if matchobj:
start, stop = matchobj.span(1) # start/stop indexes of (...)
filename = line[start:stop]
# slice out of line
print count, 'include', filename
# same as matchobj.group(1)
if len(sys.argv) == 1:
scan(sys.stdin)
# no args: read stdin
else:
scan(open(sys.argv[1], 'r'))
# arg: input file name
¨ For more demanding languages: regular expressions have no “memory”
¨ Recursive descent parsers: see YAPPS parser generator
¨ Parser generators: ‘bison’ wrapper, PyParsing, SPARK, kwParsing, etc. (see the web)
¨ NLTK: Natural Language Toolkit for Python, AI and statistical tools
Click here to go to
lab exercises
Click here to go to
exercise solutions
Click here to go to solution source files
Click here to go to
lecture example files