13. Text processing

 

 

 

 

Topics

 

 

¨  String objects

¨  Splitting strings

¨  Regular expressions

¨  Parsing languages

¨  XML parsing (see Internet section)

 


 

String objects: review

 

 

¨   Handle basic text processing tasks

¨   Operations: slicing, concatenation, indexing, etc.

¨   String methods: searching, replacement, splitting, etc.

¨   Built-in string functions: ord()

¨   Running code strings: ‘eval’, ‘exec’, ‘execfile’

¨   Unicode (16-bit char) wide strings supported in Python 2.0

 

 

 

>>> text = "Hello world"

>>> text = 'M' + text[1:6] + 'World'

>>> text

'Mello World'

>>> exec 'print "J" + text[1:]'

Jello World

 


 

Splitting and joining strings

 

 

¨   str.split returns a list of columns: around whitespace

¨   str.split allows arbitrary delimiters to used

¨   str.join puts string lists back together

¨   eval converts column strings to Python objects

 

 

 

 

Example: column sum alternatives

 

 

 

 

 

Example: summing columns in a file

 

 

# see also: newer column summer code at end of Basic Statements chapter

 

 

file: summer.py

import sys

 

def summer(numCols, fileName):

    sums = [0] * numCols

    for line in open(fileName, 'r'):

        cols = line.split()

        for i in range(numCols):

            sums[i] += eval(cols[i])      # any expression will work!

    return sums

 

if __name__ == '__main__':

    print summer(eval(sys.argv[1]), sys.argv[2])

 

 

 

 

Example: replacing substrings

 

file: replace.py

# manual global substitution

 

def replace(str, old, new):

    list = str.split(old)         # XoldY  -> [X, Y]

    return new.join(list)         # [X, Y] -> XnewY

 

 

 

 

Example: analyzing data files

 

 

·        Collect all entries for keys on right

·        Data file contains “histogram” data

 

 

 

% cat histo1.txt

1       one

2       one

3       two

7       three

8       two

10      one

14      three

19      three

20      three

30      three

 

 

 

 

% cat histo.py

#!/usr/bin/env python

import sys

 

entries = {}

for line in open(sys.argv[1]):

    [left, right] = line.split()

    try:                               

        entries[right].append(left)        # or use has_key, or get

    except KeyError:                       # e[r] = e.get(r, []) + [l]

        entries[right] = [left]

 

for (right, lefts) in entries.items():

  print "%04d '%s'\titems => %s" % (len(lefts), right,lefts)

 

 

 

 

% histo.py histo1.txt

0003 'one'      items => ['1', '2', '10']

0005 'three'    items => ['7', '14', '19', '20', '30']

0002 'two'      items => ['3', '8']

 


 

 

 

Regular expressions

 

 

¨   For matching patterns in strings

¨   Matched substrings may extracted after a match as “groups”

¨   Compiled regular expressions are first-class objects: optimization

¨   Now supported by the ‘re’ standard module: Perl5-style patterns

¨   Suports non-greedy operators, character classes, etc.

¨   Older options: the “regex”, ‘regsub’ modules: emacs/awk/grep patterns

 

 

 

Basic interface

 

 

>>> import re

 

>>> mobj = re.match('Hello(.*)world', 'Hello---spam---world')

>>> mobj.group(1)

'---spam---'

 

>>> pobj = re.compile('Hello[ \t]*(.*)')

>>> mobj = pobj.match('Hello    SPAM!')

>>> mobj.group(1)

'SPAM!'

 

 

 

 

Example: searching C files

 

¨       Finds #include and #define lines in a C file

 

 

Operators

¨     X+        repeat X one or more times

¨     X*        repeat X zero or more times

¨     [abc]    any of a or b or c

¨     (X)       keep substring that matches X (“group”)

¨     ^X match X at start of line

 

 

Methods

¨     re.compile            precompiles expression into pattern object

¨     patternobj.match  returns match object, or None if match fails

¨     matchobj.group     returns matched substring[i] (pattern part in parens)

¨     matchobj.span      returns start/stop indexes of match substring[i]

¨     also has methods for replacement, nongreedy match operators,…

 

 

 

 

 

file: cheader.py

#! /usr/local/bin/python

import sys, re

 

pattDefine = re.compile(                               # compile to pattobj

    '^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)')   # "# define xxx yyy..."

 

pattInclude = re.compile(

    '^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)')     # "# include <xxx>..."

 

def scan(file):

    count = 0

    while 1:                                     # scan line-by-line

        line = file.readline()

        if not line: break

        count = count + 1

        matchobj = pattDefine.match(line)        # None if match fails

        if matchobj:

            name = matchobj.group(1)             # substrings for (...) parts

            body = matchobj.group(2)

            print count, 'defined', name, '=', body.strip()

            continue

        matchobj = pattInclude.match(line)

        if matchobj:

            start, stop = matchobj.span(1)       # start/stop indexes of (...)

            filename = line[start:stop]          # slice out of line

            print count, 'include', filename     # same as matchobj.group(1)

 

if len(sys.argv) == 1:

    scan(sys.stdin)                    # no args: read stdin

else:

    scan(open(sys.argv[1], 'r'))       # arg: input file name

 

 

 

 


 

Parsing languages

 

 

¨   For more demanding languages: regular expressions have no “memory”

¨   Recursive descent parsers: see YAPPS parser generator

¨   Parser generators: ‘bison’ wrapper, PyParsing, SPARK, kwParsing, etc. (see the web)

¨   NLTK: Natural Language Toolkit for Python, AI and statistical tools

 

 

 

 

 

Lab Session 10

 

Click here to go to lab exercises

Click here to go to exercise solutions

Click here to go to solution source files

 

Click here to go to lecture example files