13. Text Processing

 

 

 

 

Topics

 

 

     String objects

     Splitting strings

     Regular expressions

     Parsing languages

     XML parsing

 


 

String objects: review

 

 

     Handle basic text processing tasks

     Operations: slicing, concatenation, indexing, formatting, etc.

     String methods: searching, replacement, splitting, etc.

     Built-in string functions: ord(C) → ASCII code in 2.X, Unicode code point in 3.X

     Running code strings: ‘eval’, ‘exec’ (2.X+3.X), ‘execfile’ (2.X)

     Unicode (possibly-wide) strings supported in Python 2.0+

U’xxx’ in 2.X, ‘xxx’ in 3.X, encoding/decoding in memory or on IO (see final unit)

 

 

 

>>> text = "Hello world"

>>> text = 'M' + text[1:6] + 'World'

>>> text

'Mello World'

>>> exec 'print "J" + text[1:]'

Jello World

 


 

Splitting and joining strings

 

 

     str.split returns a list of columns: around whitespace

     str.split allows arbitrary delimiters to used

     str.join puts string lists back together

     eval converts column strings to Python objects

 

 

 

 

Example: summing columns in a file

 

 

# see also: newer column summer code at end of Basic Statements unit

 

 

file: summer.py

import sys

 

def summer(numCols, fileName):

    sums = [0] * numCols

    for line in open(fileName, 'r'):

        cols = line.split()

        for i in range(numCols):

            sums[i] += eval(cols[i])      # any expression will work!

    return sums

 

if __name__ == '__main__':

    print summer(eval(sys.argv[1]), sys.argv[2])

 

 

 

Example: column sum alternatives

 

 

 

 

 

Example: replacing substrings

 

file: replace.py

# manual global substitution: same as str.replace(old, new)

 

def replace(str, old, new):

    list = str.split(old)         # XoldY  -> [X, Y]

    return new.join(list)         # [X, Y] -> XnewY

 

 

 

 

Example: analyzing data files

 

 

      Collect all entries for keys on right

      Data file contains “histogram” data

 

 

 

% cat histo1.txt

1       one

2       one

3       two

7       three

8       two

10      one

14      three

19      three

20      three

30      three

 

 

 

 

% cat histo.py

#!/usr/bin/env python

import sys

 

entries = {}

for line in open(sys.argv[1]):

    [left, right] = line.split()

    try:                               

        entries[right].append(left)        # or use has_key, or get

    except KeyError:                       # e[r] = e.get(r, []) + [l]

        entries[right] = [left]

 

for (right, lefts) in entries.items():

  print "%04d '%s'\titems => %s" % (len(lefts), right,lefts)

 

 

 

 

% histo.py histo1.txt

0003 'one'      items => ['1', '2', '10']

0005 'three'    items => ['7', '14', '19', '20', '30']

0002 'two'      items => ['3', '8']

 


 

 

 

Regular expressions

 

 

     For matching patterns in strings

     Matched substrings may extracted after a match as “groups”

     Compiled regular expressions are first-class objects: optimization

     Now supported by the ‘re’ standard module: Perl5-style patterns

     Suports non-greedy operators, character classes, etc.

     Older options: the “regex”, ‘regsub’ modules: emacs/awk/grep patterns

 

 

 

Basic interface

 

 

>>> import re

 

>>> mobj = re.match('Hello(.*)world', 'Hello---spam---world')

>>> mobj.group(1)

'---spam---'

 

>>> pobj = re.compile('Hello[ \t]*(.*)')

>>> mobj = pobj.match('Hello    SPAM!')

>>> mobj.group(1)

'SPAM!'

 

 

 

 

Example: searching C files

 

Finds #include and #define lines in a C file

 

 

Operators

     X+       repeat X one or more times

     X*               repeat X zero or more times

     [abc]    any of a or b or c

     (X)       keep substring that matches X (“group”)

     ^X        match X at start of line

 

 

Methods

     re.compile   precompiles expression into pattern object

     patternobj.match        returns match object, or None if match fails

     matchobj.group          returns matched substring[i] (pattern part in parens)

     matchobj.span            returns start/stop indexes of match substring[i]

     also has methods for replacement and findall, nongreedy match operators,…

 

 

 

 

 

file: cheader.py

#!/usr/local/bin/python

import sys, re

 

pattDefine = re.compile(                               # precompile to pattobj

    '^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)')   # "# define xxx yyy..."

 

pattInclude = re.compile(

    '^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)')     # "# include <xxx>..."

 

def scan(file):

    count = 0

    for line in file:                                # scan line-by-line

        count += 1

        matchobj = pattDefine.match(line)            # None if match fails

        if matchobj:

            name = matchobj.group(1)                 # substrings for (...) parts

            body = matchobj.group(2)

            print(count, 'defined', name, '=', body.strip())

        else:

            matchobj = pattInclude.match(line)

            if matchobj:

                start, stop = matchobj.span(1)       # start/stop indexes of (...)

                filename = line[start:stop]          # slice out of line

                print(count, 'include', filename)    # same as matchobj.group(1)

 

if len(sys.argv) == 1:

    scan(sys.stdin)                    # no args: read stdin

else:

    scan(open(sys.argv[1], 'r'))       # arg: input file name

 

 

 

 


 

Parsing languages

 

 

    For more demanding languages: regular expressions have no “memory”

    Recursive descent parsers: see YAPPS parser generator

    Parser generators: ‘bison’ wrapper, PyParsing, SPARK, kwParsing, etc. (see the web)

    NLTK: Natural Language Toolkit for Python, AI and statistical tools

 

 

 

 

 

XML Parsing

 

 

 

 

 

    Python Standard Library Support:

   SAX parsers: state-machines with class method callbacks

   DOM parsers: document object tree, with standard API for traversal

   ElementTree: Python-specific XML parser and generator

 

 

Example: Parsing XML 4 ways with patterns and basic XML tools

 

 

 

 

Given the following (narcissistic!) XML text file, mybooks.xml:

 

<books>

    <date>2009</date>

    <title>Learning Python</title>

    <title>Pogramming Python</title>

    <title>Python Pocket Reference</title>

    <publisher>O'Reilly Media</publisher>

</books>

 

 

 

 

Run a script to extract and display content of all the nested “title” tags, as follows:

 

Learning Python

Pogramming Python

Python Pocket Reference

 

 

 

 

PATTERNS: Basic pattern matching on the file’s text, though this can be inaccurate if the text is unpredictable. findall locates all places where a pattern matches in the string, returns list of matched substrings corresponding to parenthesized pattern groups, or tuples of such for multiple groups

 

# File patternparse.py

 

import re

text  = open('mybooks.xml').read()

found = re.findall('<title>(.*)</title>', text)

for title in found: print(title)

 

 

 

 

DOM PARSING: Perform complete XML parsing with the standard library’s DOM parsing support. DOM parses XML text into a tree of objects, and provides an interface for navigating the tree to extract tag attributes and values; a formal specification, independent of Python

 

# File domparse.py

 

from xml.dom.minidom import parse, Node

xmltree = parse('mybooks.xml')

for node1 in xmltree.getElementsByTagName('title'):

    for node2 in node1.childNodes:

         if node2.nodeType == Node.TEXT_NODE:

             print(node2.data)

 

 

 

 

SAX PARSING: Alternatively, Python’s standard library also supports SAX parsing for XML. Under the SAX model, a class’s methods receive callbacks as a parse progresses, and use state information to keep track of where they are in the document and collect its data

 

# File saxparse.py

 

import xml.sax.handler

class BookHandler(xml.sax.handler.ContentHandler):

    def __init__(self):

        self.inTitle = False

    def startElement(self, name, attributes):

        if name == 'title':

            self.inTitle = True

    def characters(self, data):

        if self.inTitle:

            print(data)

    def endElement(self, name):

        if name == "title":

            self.inTitle = False

 

import xml.sax

parser = xml.sax.make_parser()

handler = BookHandler()

parser.setContentHandler(handler)

parser.parse('mybooks.xml')

 

 

 

 

ELEMENTTREE PARSING: Finally, the etree package of the standard library can often achieve the same effects as XML DOM parsers, but with less code. It’s a Python-specific way to both parse and generate XML text; after a parse, its API gives access to components of the document

 

# File etreeparse.py

 

from xml.etree.ElementTree import parse

tree = parse('mybooks.xml')

for E in tree.findall('title'):

    print(E.text)

 

 

 

 

The output of all four alternatives is the same under 2.6, 2.7, and 3.X

 

C:\misc> c:\python26\python domparse.py

Learning Python

Pogramming Python

Python Pocket Reference

 

C:\misc> c:\python30\python domparse.py

Learning Python

Pogramming Python

Python Pocket Reference

 

 

    Other XML-related Support

      See Extras\Code\XML on class CD for additional DOM/SAX examples

      XPath,… 3rd-party extensions (see xml-sig page)

      O’Reilly book: Python & XML

      XML-RPC: xmlrpclib in Python std lib – XML coded data over HTTP

      SOAP Protocol: PySoap, Soapy 3rd-party extensions

 

 

 

 

    See also: JSON example in Database unit, the successor to XML?

 

 

 

 

 

Lab Session 10

 

Click here to go to lab exercises

Click here to go to exercise solutions

Click here to go to solution source files

 

Click here to go to lecture example files