Topics
♦ String objects
♦ Splitting strings
♦ Regular expressions
♦ Parsing languages
♦ XML parsing
♦ Handle basic text processing tasks
♦ Operations: slicing, concatenation, indexing, formatting, etc.
♦ String methods: searching, replacement, splitting, etc.
♦ Built-in string functions: ord(C) → ASCII code in 2.X, Unicode code point in 3.X
♦ Running code strings: ‘eval’, ‘exec’ (2.X+3.X), ‘execfile’ (2.X)
♦ Unicode (possibly-wide) strings supported in Python 2.0+
►U’xxx’ in 2.X, ‘xxx’ in 3.X, encoding/decoding in memory or on IO (see final unit)
>>> text = "Hello world"
>>> text = 'M' + text[1:6] + 'World'
>>> text
'Mello World'
>>> exec 'print "J" + text[1:]'
Jello World
♦ str.split returns a list of columns: around whitespace
♦ str.split allows arbitrary delimiters to used
♦ str.join puts string lists back together
♦ eval converts column strings to Python objects
Example: summing columns in a file
# see also: newer column summer code at end of Basic
Statements unit
file: summer.py
import sys
def summer(numCols, fileName):
sums = [0] * numCols
for line in open(fileName, 'r'):
cols = line.split()
for i in range(numCols):
sums[i] += eval(cols[i]) # any expression will work!
return sums
if __name__ == '__main__':
print summer(eval(sys.argv[1]), sys.argv[2])
Example: column sum alternatives
Example: replacing substrings
file: replace.py
# manual global substitution: same as str.replace(old, new)
def replace(str, old, new):
list = str.split(old) # XoldY -> [X, Y]
return new.join(list) # [X, Y] -> XnewY
Example: analyzing data files
● Collect all entries for keys on right
● Data file contains “histogram” data
%
cat histo1.txt
1 one
2 one
3 two
7 three
8 two
10 one
14 three
19 three
20 three
30 three
%
cat histo.py
#!/usr/bin/env python
import sys
entries = {}
for line in open(sys.argv[1]):
[left, right] = line.split()
try:
entries[right].append(left) # or use has_key, or get
except KeyError: # e[r] = e.get(r, []) + [l]
entries[right] = [left]
for (right, lefts) in entries.items():
print "%04d '%s'\titems => %s" % (len(lefts), right,lefts)
% histo.py histo1.txt
0003 'one' items => ['1', '2', '10']
0005 'three' items => ['7', '14', '19', '20', '30']
0002 'two' items => ['3', '8']
♦ For matching patterns in strings
♦ Matched substrings may extracted after a match as “groups”
♦ Compiled regular expressions are first-class objects: optimization
♦ Now supported by the ‘re’ standard module: Perl5-style patterns
♦ Suports non-greedy operators, character classes, etc.
♦ Older options: the “regex”, ‘regsub’ modules: emacs/awk/grep patterns
Basic interface
>>>
import re
>>>
mobj = re.match('Hello(.*)world', 'Hello---spam---world')
>>>
mobj.group(1)
'---spam---'
>>>
pobj = re.compile('Hello[ \t]*(.*)')
>>>
mobj = pobj.match('Hello SPAM!')
>>>
mobj.group(1)
'SPAM!'
Example: searching C files
Finds #include and #define
lines in a C file
Operators
● X+ repeat X one or more times
● X* repeat X zero or more times
● [abc] any of a or b or c
● (X) keep substring that matches X (“group”)
● ^X match X at start of line
Methods
● re.compile precompiles expression into pattern object
● patternobj.match returns match object, or None if match
fails
● matchobj.group
returns matched substring[i]
(pattern part in parens)
● matchobj.span returns start/stop indexes of match
substring[i]
● also
has methods for replacement and findall, nongreedy match operators,…
file: cheader.py
#!/usr/local/bin/python
import sys, re
pattDefine
= re.compile(
# precompile to pattobj
'^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)') # "# define xxx yyy..."
pattInclude
= re.compile(
'^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)') # "# include <xxx>..."
def scan(file):
count = 0
for line in file: # scan
line-by-line
count += 1
matchobj = pattDefine.match(line) # None if match fails
if matchobj:
name = matchobj.group(1) # substrings for (...) parts
body = matchobj.group(2)
print(count, 'defined', name, '=', body.strip())
else:
matchobj = pattInclude.match(line)
if matchobj:
start, stop = matchobj.span(1)
# start/stop indexes of (...)
filename =
line[start:stop] # slice out of
line
print(count, 'include',
filename) # same as matchobj.group(1)
if len(sys.argv) == 1:
scan(sys.stdin)
# no args: read stdin
else:
scan(open(sys.argv[1], 'r'))
# arg: input file name
● For more demanding languages: regular expressions have no “memory”
● Recursive descent parsers: see YAPPS parser generator
● Parser generators: ‘bison’ wrapper, PyParsing, SPARK, kwParsing, etc. (see the web)
● NLTK: Natural Language Toolkit for Python, AI and statistical tools
● Python Standard Library Support:
► SAX parsers: state-machines with class method callbacks
► DOM parsers: document object tree, with standard API for traversal
► ElementTree: Python-specific XML parser and generator
Example: Parsing XML 4 ways with patterns and basic XML tools
Given the
following (narcissistic!) XML text file, mybooks.xml:
<books>
<date>2009</date>
<title>Learning Python</title>
<title>Pogramming Python</title>
<title>Python Pocket Reference</title>
<publisher>O'Reilly Media</publisher>
</books>
Run a script
to extract and display content of all the nested “title” tags, as follows:
Learning Python
Pogramming Python
Python Pocket Reference
PATTERNS: Basic pattern matching on the file’s
text, though this can be inaccurate if the text is unpredictable. findall
locates all places where a pattern matches in the string, returns list of
matched substrings corresponding to parenthesized pattern groups, or tuples of
such for multiple groups
# File patternparse.py
import re
text = open('mybooks.xml').read()
found = re.findall('<title>(.*)</title>', text)
for title in found: print(title)
DOM PARSING: Perform complete XML parsing with the standard library’s DOM parsing support. DOM parses XML text into a tree of objects, and provides an interface for navigating the tree to extract tag attributes and values; a formal specification, independent of Python
# File domparse.py
from xml.dom.minidom import parse, Node
xmltree = parse('mybooks.xml')
for node1 in xmltree.getElementsByTagName('title'):
for node2 in node1.childNodes:
if node2.nodeType == Node.TEXT_NODE:
print(node2.data)
SAX PARSING: Alternatively, Python’s standard
library also supports SAX parsing for XML. Under the SAX model, a class’s
methods receive callbacks as a parse progresses, and use state information to
keep track of where they are in the document and collect its data
# File saxparse.py
import xml.sax.handler
class BookHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.inTitle = False
def startElement(self, name, attributes):
if name == 'title':
self.inTitle = True
def characters(self, data):
if self.inTitle:
print(data)
def endElement(self, name):
if name == "title":
self.inTitle = False
import xml.sax
parser = xml.sax.make_parser()
handler = BookHandler()
parser.setContentHandler(handler)
parser.parse('mybooks.xml')
ELEMENTTREE PARSING: Finally, the etree
package of the standard library can often achieve the same effects as XML DOM
parsers, but with less code. It’s a Python-specific way to both parse and
generate XML text; after a parse, its API gives access to components of the
document
# File etreeparse.py
from xml.etree.ElementTree import parse
tree = parse('mybooks.xml')
for E in tree.findall('title'):
print(E.text)
The output of all four alternatives is the
same under 2.6, 2.7, and 3.X
C:\misc> c:\python26\python domparse.py
Learning Python
Pogramming Python
Python Pocket Reference
C:\misc> c:\python30\python domparse.py
Learning Python
Pogramming Python
Python Pocket Reference
● Other XML-related Support
■ See Extras\Code\XML on class CD for additional DOM/SAX examples
■ XPath,… 3rd-party extensions (see xml-sig page)
■ O’Reilly book: Python & XML
■ XML-RPC: xmlrpclib in Python std lib – XML coded data over HTTP
■ SOAP Protocol: PySoap, Soapy 3rd-party extensions
● See also: JSON example in Database unit, the successor to XML?
Click here to go to
lab exercises
Click here to go to
exercise solutions
Click here to go to solution source files
Click here to go to
lecture example files