File: cgi/

============================================================================== - on a URL query from a client, display any text file 
in an HTML page, auto-scrolled horizontally, with a raw-text link.

Author/Copyright: 2018, M. Lutz (
License: provided freely, but with no warranties of any kind.

Version: Jun 26, 2018 - allow non-ASCII filenames in Python 2.X, "UTF-8".
History: Jun 18, 2018 - note on avoiding explicit URLs for offline use.
         Apr 23, 2018 - robots.txt handling notes.
         Apr 15, 2018 - readme.txt note, special-case bad filenames.
         Feb 23, 2018 - initial release, for mobile site redesign.

This is a Python CGI script: it runs on a web server, reads URL "?"
query parameters, and prints HTTP headers and HTML or plain text to the 
client.  It runs on both Python 2.X and 3.X (2.X on its current host).


When invoked by explicit URL or Apache rewrite, this script dynamically 
builds a reply page containing the subject file's text - as either plain 
text, and formatted HTML with uniform styling and bi-directional scrolling.

While broadly useful, this is done primarily for ease of viewing on small
screens (e.g., mobile devices).  Else, text may be too small to read, 
without tedious zooms and scrolls; worse, it may be line-wrapped, which 
is awful for program code.

For HTML replies, links to view and save the file's raw (plain) text are
also generated as options for browsers that handle them well (e.g., opening
text in a local editor).  As installed, this script is automatically run 
for _every_ ".py" and ".txt" file on this site accessed directly, per the
invocation schemes up next.


This script is run by both explicit HTML links, and automatic Apache
rewrite rules.  In general, it is invoked with a URL of this form, 
where the subject file's name appears as a query-string parameter:

The site name can be relative in links as usual, and the subject file 
is assumed to live in ".." (the site root, above the cgi/ folder of 
this script), so links in HTML files are coded this way when explicit:

  <A HREF="cgi/"></A>

The current site uses a few of the explicit links above, but also uses
Apache URL rewrite rules in .htaccess files to automatically route all
other requests for both "*.py" and "*.txt" files to this script.  These 
rules use PCRE matching patterns to map basic URLs to the form above 
automatically, thereby avoiding many manual link edits. 

For example, the following rewrite rule maps all URLs not starting in 
'cgi/' and ending in '.py', '.txt', or '.pyw' to the script, thus 
handling all direct Python and text file links, while skipping script 
invocations (other extensions, including '.css', require explicit URLs):

  rewriterule ^(?!cgi\/)(.*).(py|txt|pyw)$ 

This works, but makes raw-text support complex.  Because the Apache
rule maps *all* Python and text file links to the script's URL (and 
it's weirdly difficult to prevent a rewrite of a rewrite in Apache), 
this script also supports a "rawmode" parameter, primarily for use 
in the template file's URLs meant to fetch a raw-text copy:

  <A HREF="cgi/"></A>

  <A HREF="cgi/"></A>

A "rawmode=view" triggers inline plain-text output in this script 
instead of HTML; its effect is the same as a direct file link sans
rewrites.  A "rawmode=save" sends plain text as attachment, which asks 
browsers to save immediately; where supported, this is arguably easier
and more reliable than cmd/ctl-A+C to select text, or link rightclicks.


The only files that _require_ an explicit cgi/ URL for display
are those in this script's folder (cgi/), or otherwise not matched by the
Apache rewrite rule.  In the companion HTML template file, for example, 
the self-display links must be explicit URLs.  For scripts, their names'
appearance in showcode URL query parameters also avoids invocation.  

Similarly, CSS files are deliberately not matched by the rewrite rule 
to avoid mutating their code when requested by the browser, and hence 
require explicit showcode URLs for formatted display.  All other files 
can be displayed by either explicit URL _or_ the Apache rewrite rule.

Although explicit cgi/ URLs always work when a server is 
present, this site is careful to use them _only_ when required, per 
the rules above.  This better supports offline viewing in the absence 
of Apache URL rewrites (else explicit URLs display script, not target).

Though convenient, Apache rewrite rules also may complicate the handling
of auto-index README files and crawler-directive "robots.txt" files, 
but these are both subtle enough to warrant their own sections.


Besides making raw-text support complex, the Apache rewrite rule also 
breaks "README.txt" files in auto-generated index pages; their text no
longer appears on the index page, and their names are not shown in 
index lists (the leading theory is that their names are rewritten,
and mod_autoindex doesn't like the HTML reply it gets back).

This can be addressed by coding manual "index.html" pages.  But it's 
simpler to rename or copy to "README.html" with a <PRE> or <P> around
the file's text and a "ReadmeName README.html" in the .htaccess file. 
For less-important cases, rename to "_README.txt" and let the user 
click if they really wish to view; a script can easily automate this:
see for an example.

UPDATE, Apr-2018: oddly, this story differs at a new host to which this
site was recently moved.  On the new server (only), auto-index pages 
list README.txt files, but do not display their content inline, even if 
named in ReadmeName directives.  Hence, is not required,
and if used must be accommodated by IndexIgnore to hide any _README*s.
Alas, Apache's wildly-implicit design yields radically-variable hosts!

In retrospect, READMEs might also have been ruled out on the former 
host by enhanced patterns, per the robot files tip of the next section.

UPDATE, Jul-2018: README.txt files have once again vanished from 
autoindex pages on the site hosting this script due to an unknown 
Godaddy apache-server change, and eliminating README.txt files in 
the rewrite pattern had no effect.  Hence, _README.txt files and 
their script have been reinstated (see .htaccess).
Lesson: Apache servers are brittle and hosting providers are worse.


If your site uses a "robots.txt" file to give guidance to crawlers, you
MAY want to configure your Apache rewrite rules to avoid routing them 
to this script for formatting.  Otherwise, crawlers may invoke a URL
like this and receive an HTML page in response:

To avoid this, either expand the match pattern to disqualify this filename,
or add a rule to match the name and prune further rewrite processing if 
possible (e.g., with an L or END action code?).

For example, the following rule in .htaccess successfully excludes robot
files by using a lookahead negative assertion with a nested non-capturing
alternation, plus two capturing groups (yes, yuck):

  rewriterule ^(?!(?:cgi\/|.*robots.txt))(.*).(py|txt|pyw)$ 

But the following alternative placed before the showcode rewrite rule 
does not work on my server, for reasons TBD (this is subtle business):

  rewriterule ^(.*)robots.txt$ "http\:\/\/learning-python\.com\/$1robots.txt" [L]

If your site does NOT use a robots.txt (and this script's site does not),
you probably don't need to care: the error-reply HTML page this script 
issues when a missing robots.txt is requested should be harmless to your
search visibility and crawling results.  Per:

unrecognized content in HTML replies is simply ignored; which makes the
reply equivalent to an empty file; which is the same as no robots.txt 
at all; which means "crawl everything here."  Redirects may also be 
followed, if your showcode rewrite rule uses one (this site's doesn't).

Note that it's possible that crawlers may still recognize the directives 
text in a robots file even if it HAS been formatted as HTML for display 
by this script.  This would make the above tricks unnecessary, but was
not tested because this site doesn't use these files; your site may vary. 

Disclaimer: this is based on Google behavior which other crawlers may or
may not mimic, and your robots.txt resolution may have to be applied to
any other admin files on your site that match the showcode rewrite rule
(e.g., sitemaps?).  While this script could support a list of such files 
forcibly returned as plain-text or 404 error codes (see,
it's easier to delegate to servers by coding rules to exclude such files.


When loading code files, this script tries a set of Unicode encodings 
in turn, until one works or all fail.  Most Python and text files on 
this site are UTF8 (or its ASCII subset), but a few Latin-1 files crop
up as examples.  The UNICODE_IN encodings list reflects this, but may 
be changed for use elsewhere.  Once loaded, code text is just decoded 
codepoints in memory, and is always output as UTF8-encoded bytes in 
reply pages.  Doing so portably for both Python 3.X and 2.X is possible
but subtle; see the Jun-26-18 notes ahead.


This script can also be invoked by URL in the "action" tag of a form 
in an HTML page; could be submitted by a script (see Python's urllib);
and might work as an Apache handler (to be explored).


As is, this script reflects a number of tradeoffs:

-Its code must run on the Python 2.X version which is default on the host.
-Its footer code must avoid copies of text normally generated by genhtml.
-Its error checking is minimal, as it is used only in well-known contexts.
-Its ".." assumption for subject files' paths is not very general.
-Its Apache rewrite rule breaks "README.txt" in index pages (see above).
-Its Apache rewrite rule may complicate robots.txt handling (see above).
-Its always-UTF8 output policy means others are converted to this on saves.

OTOH, it works as intended, and demos CGI; expand and improve as desired.

import cgitb; cgitb.enable()      # route python exceptions to browser/client
import cgi, os, sys, codecs

UsingPython3 = sys.version[0] == '3'
UsingPython2 = sys.version[0] == '2'

if UsingPython3:                                      # py 3.X/2.X compatible
    from html import escape as html_escape            # run on 2.X only to date
    from urllib.parse import quote_plus
elif UsingPython2:
    from cgi import escape as html_escape             # for text added to HTML 
    from urllib import quote_plus                     # for text added to URL
    assert False, 'The future remains to be written'

# Switches and constants

# Jun-26-18 note: browsers allow "UTF8" but "UTF-8" is technically the HTML 
# encoding name; python allows synonyms, including both "UTF-8" and "utf8";
# for background, see:;

MOCK_VALUES = False                 # 1=simulate parsed inputs
MOCK_SERVER = False                 # 1=simulate client request

UNICODE_IN  = ['UTF8', 'latin1']    # try in turn for code file content
UNICODE_OUT = 'UTF-8'               # for text in generated reply page

TEMPLATE    = 'showcode-template.txt'    # the reply-page format
FOOTER      = '../dummy-footer.html'     # site-wide footer code

trace = lambda *args: None  # or print, to display on stdout 

# Get input filename (and raw-text mode?) sent from the client

# Parse and/or forge request

    # simulate parsed request for testing
    class Mock: 
        def __init__(self, value):
            self.value = value
    form = dict(name=Mock(''))    # + rawmode=Mock('view')?
        # jun18: simulate post-server, pre-pylibs state
        os.environ['CONTENT_TYPE'] = 'Content-type: application/x-www-form-urlencoded'
        os.environ['QUERY_STRING'] = (

    form = cgi.FieldStorage()         # parse form/url input data

# Extract request inputs

if 'name' not in form:
    name = 'cgi/'          # show myself: more useful 
    # error check: custom reply = hdr + blankline=\n + msg
    print('Content-type: text/plain\n')
    print('Please provide a value for "name" in the request.')
    name = form['name'].value         # real or mocked, pathname relative to '..'

if 'rawmode' not in form:
    rawmode = False
    rawmode = form['rawmode'].value   # 'view' or 'save' or absent=formatted

# Load the code from a file in "..", in 1 of N Unicode encodings

# "name" may be a basename or a pathname relative to ".." (site root);
# both open()/read() flavors retain \r on Windows, decode to codepoints,
# and return a Unicode object: a py2 u'xx' unicode, or a py3 'xx' str; 
# tries N Unicode types for input, but always outputs as UTF8 bytes;
# os.path.isfile() is a superset of os.path.exists(): don't need both;

path = '..' + os.sep + name

if not os.path.isfile(path):
    code = (u'Error: file does not exist or is not a file.\n\n'
             'Please verify the filename in your request.\n')     # apr18

    for tryenc in UNICODE_IN:
            if UsingPython3:
                code = open(path, mode='r', encoding=tryenc, newline='').read()
                code =, mode='r', encoding=tryenc).read()
            pass     # try next encoding on list
            break    # load successful: skip else
        code = (u'Error: could not open file.\n\n'
                 "Please adjust the script's UNICODE_IN list.\n")   # jun18

# Load and expand the HTML template

# It's okay to load the template as str, even though "code" is unicode:
# in py3 they're the same - both are str, which is always Unicode text;
# in py2 they differ, but str is coerced up - '%s' % u'spam' => u'spam',
# even for dicts: see;

# Jun-26-18 - but we must decode strs to unicode for py2 if they may be
# non-ASCII, because py2 expects strs to be all-ASCII whenever mixed with 
# unicodes; this applies only to "namehtml", a str in 2.X encoded as UTF-8
# per web conventions: "code" is already unicode, "nameurl" URL-escapes any
# non-ASCIIs, and "footer" and "template" are assumed to be ASCII files;  
# even if name is decoded, error-message "code" must be all-ASCII str or 
# unicode too - use u'xxx' above in both 2.X and 3.X (3.X requires 3.3+);

if rawmode:
    reply = code    # send text as is

    template = open(TEMPLATE).read()        # template file in '.', ASCII only

    codehtml = html_escape(code)            # HTML-escape any characters special in HTML
    namehtml = html_escape(name)            # template also hardcodes some URL escapes
    # no longer need to strip 'cgi/' here: used in URL query, not raw link
    nameurl = quote_plus(name)              # URL-escape this: added to query in template 

    footer = open(FOOTER).read()            # load dummy ASCII generated footer html in ..
    for link in ('HREF', 'href', 'SRC'):    # munge it to add ".." to all nested item refs
        old = '%s="'    % link              # this avoids copying code (see template text)
        new = '%s="../' % link
        footer = footer.replace(old, new)

    for undo in ('mailto', '#'):            # undo up-rerouting for two special-cases 
        new = 'HREF="../%s' % undo          # still beats maintaining copied code...
        old = 'HREF="%s'    % undo
        footer = footer.replace(new, old)

    trace(type(namehtml), namehtml)

    # jun-26-18: allow non-ASCII filenames in Python 2.X (see above)
    if UsingPython2:
        namehtml = namehtml.decode(UNICODE_OUT)

    trace(type(namehtml), namehtml)

    reply = template % dict(                # unicode reply: replace template targets
                __NAME__     = namehtml,    # the sent and escaped filename
                __NAMEURL__  = nameurl,     # the filename for raw-text link
                __CODE__     = codehtml,    # the loaded and escaped Unicode text 
                __FOOTER__   = footer)      # the munged dummy generated toolbar html

trace(type(reply), reply[:40])

# Print the reply stream back to the client

# Write UTF8-encoded bytes, use "charset" to force Unicode type to match;
# "inline" is always view, but may require cmd/ctl-A+C to save contents;
# "attachment" is usually save, but opens may fail on some platforms, and
# this is just view on others (notably, iOS: there's no user file access);

if not rawmode:
    contenthdr  = 'Content-type: text/html; charset=%s' % UNICODE_OUT
    dispostype  = 'inline' if rawmode == 'view' else 'attachment'
    basename    = os.path.basename(name)
    contenthdr  = 'Content-type: text/plain; charset=%s\n' % UNICODE_OUT
    contenthdr += 'Content-Disposition: %s; filename="%s"' % (dispostype, basename)

replybytes = reply.encode(UNICODE_OUT)      # send encoded bytes: print is iffy

print(contenthdr)                           # reply = hdrs + blankline + html
print('')                                   # need '' for 2.X, else a tuple!
if UsingPython2:
    sys.stdout.write(replybytes)            # py2 accepts a str for the bytes
    sys.stdout.buffer.write(replybytes)     # py3 stdout is str: use io layer  

[Home] Books Programs Blog Python Author Training Search Email ©M.Lutz