File: cgi/showcode.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
u"""
==============================================================================
showcode.py - on a URL query from a client, display any text file 
in an HTML page, auto-scrolled horizontally, with a raw-text link.

Author/Copyright: 2018, M. Lutz (learning-python.com).
License: provided freely, but with no warranties of any kind.

Version: Sep 01, 2018 - note about mixed Unicode files, latin1=>cp1252.
History: Jun 26, 2018 - allow non-ASCII filenames in Py 2.X, use "UTF-8".
         Jun 18, 2018 - note on avoiding explicit URLs for offline use.
         Apr 23, 2018 - robots.txt handling notes.
         Apr 15, 2018 - readme.txt note, special-case bad filenames.
         Feb 23, 2018 - initial release, for mobile site redesign.

This is a Python CGI script: it runs on a web server, reads URL "?"
query parameters, and prints HTTP headers and HTML or plain text to the 
client.  It runs on both Python 2.X and 3.X (2.X on its current host).


WHAT THIS SCRIPT DOES:

When invoked by explicit URL or Apache rewrite, this script dynamically 
builds a reply page containing the subject file's text - as either plain 
text, and formatted HTML with uniform styling and bi-directional scrolling.

While broadly useful, this is done primarily for ease of viewing on small
screens (e.g., mobile devices).  Else, text may be too small to read, 
without tedious zooms and scrolls; worse, it may be line-wrapped, which 
is awful for program code.

For HTML replies, links to view and save the file's raw (plain) text are
also generated as options for browsers that handle them well (e.g., opening
text in a local editor).  As installed, this script is automatically run 
for _every_ ".py" and ".txt" file on this site accessed directly, per the
invocation schemes up next.


HOW THIS SCRIPT IS INVOKED:

This script is run by both explicit HTML links, and automatic Apache
rewrite rules.  In general, it is invoked with a URL of this form, 
where the subject file's name appears as a query-string parameter:

  http://learning-python.com/cgi/showcode.py?name=filename.py

The site name can be relative in links as usual, and the subject file 
is assumed to live in ".." (the site root, above the cgi/ folder of 
this script), so links in HTML files are coded this way when explicit:

  <A HREF="cgi/showcode.py?name=filename.py">filename.py</A>

The current site uses a few of the explicit links above, but also uses
Apache URL rewrite rules in .htaccess files to automatically route all
other requests for both "*.py" and "*.txt" files to this script.  These 
rules use PCRE matching patterns to map basic URLs to the form above 
automatically, thereby avoiding many manual link edits. 

For example, the following rewrite rule maps all URLs not starting in 
'cgi/' and ending in '.py', '.txt', or '.pyw' to the script, thus 
handling all direct Python and text file links, while skipping script 
invocations (other extensions, including '.css', require explicit URLs):

  rewriterule ^(?!cgi\/)(.*).(py|txt|pyw)$ 
    "http\:\/\/learning-python\.com\/cgi\/showcode\.py\?name\=$1.$2"

This works, but makes raw-text support complex.  Because the Apache
rule maps *all* Python and text file links to the script's URL (and 
it's weirdly difficult to prevent a rewrite of a rewrite in Apache), 
this script also supports a "rawmode" parameter, primarily for use 
in the template file's URLs meant to fetch a raw-text copy:

  <A HREF="cgi/showcode.py?name=filename.py&rawmode=view">filename.py</A>

  <A HREF="cgi/showcode.py?name=filename.py&rawmode=save">filename.py</A>

A "rawmode=view" triggers inline plain-text output in this script 
instead of HTML; its effect is the same as a direct file link sans
rewrites.  A "rawmode=save" sends plain text as attachment, which asks 
browsers to save immediately; where supported, this is arguably easier
and more reliable than cmd/ctl-A+C to select text, or link rightclicks.


MORE ON URLS:

The only files that _require_ an explicit cgi/showcode.py URL for display
are those in this script's folder (cgi/), or otherwise not matched by the
Apache rewrite rule.  In the companion HTML template file, for example, 
the self-display links must be explicit URLs.  For scripts, their names'
appearance in showcode URL query parameters also avoids invocation.  

Similarly, CSS files are deliberately not matched by the rewrite rule 
to avoid mutating their code when requested by the browser, and hence 
require explicit showcode URLs for formatted display.  All other files 
can be displayed by either explicit URL _or_ the Apache rewrite rule.

Although explicit cgi/showcode.py URLs always work when a server is 
present, this site is careful to use them _only_ when required, per 
the rules above.  This better supports offline viewing in the absence 
of Apache URL rewrites (else explicit URLs display script, not target).

Though convenient, Apache rewrite rules also may complicate the handling
of auto-index README files and crawler-directive "robots.txt" files, 
but these are both subtle enough to warrant their own sections.


APACHE AUTO-INDEX READMES: 

Besides making raw-text support complex, the Apache rewrite rule also 
breaks "README.txt" files in auto-generated index pages; their text no
longer appears on the index page, and their names are not shown in 
index lists (the leading theory is that their names are rewritten,
and mod_autoindex doesn't like the HTML reply it gets back).

This can be addressed by coding manual "index.html" pages.  But it's 
simpler to rename or copy to "README.html" with a <PRE> or <P> around
the file's text and a "ReadmeName README.html" in the .htaccess file. 
For less-important cases, rename to "_README.txt" and let the user 
click if they really wish to view; a script can easily automate this:
see learning-python.com/fix-readmes.py for an example.

UPDATE, Apr-2018: oddly, this story differs at a new host to which this
site was recently moved.  On the new server (only), auto-index pages 
list README.txt files, but do not display their content inline, even if 
named in ReadmeName directives.  Hence, fix-readmes.py is not required,
and if used must be accommodated by IndexIgnore to hide any _README*s.
Alas, Apache's wildly implicit design yields radically variable hosts!

In retrospect, READMEs might also have been ruled out on the former 
host by enhanced patterns, per the robot files tip of the next section.

UPDATE, Jul-2018: README.txt files have once again vanished from 
autoindex pages on the site hosting this script due to an unknown 
Godaddy apache-server change, and eliminating README.txt files in 
the rewrite pattern had no effect.  Hence, _README.txt files and 
their fix-readmes.py script have been reinstated (see .htaccess).
Lesson: Apache servers are brittle and hosting providers are worse.


CRAWLER ROBOTS.TXT FILES:

If your site uses a "robots.txt" file to give guidance to crawlers, you
MAY want to configure your Apache rewrite rules to avoid routing them 
to this script for formatting.  Otherwise, crawlers may invoke a URL
like this and receive an HTML page in response:

  http://learning-python.com/cgi/showcode.py?name=robots.txt

To avoid this, either expand the match pattern to disqualify this filename,
or add a rule to match the name and prune further rewrite processing if 
possible (e.g., with an L or END action code?).

For example, the following rule in .htaccess successfully excludes robot
files by using a lookahead negative assertion with a nested non-capturing
alternation, plus two capturing groups (yes, yuck):

  rewriterule ^(?!(?:cgi\/|.*robots.txt))(.*).(py|txt|pyw)$ 
    "http\:\/\/learning-python\.com\/cgi\/showcode\.py\?name\=$1.$2"

But the following alternative placed before the showcode rewrite rule 
does not work on my server, for reasons TBD (this is subtle business):

  rewriterule ^(.*)robots.txt$ "http\:\/\/learning-python\.com\/$1robots.txt" [L]

If your site does NOT use a robots.txt (and this script's site does not),
you probably don't need to care: the error-reply HTML page this script 
issues when a missing robots.txt is requested should be harmless to your
search visibility and crawling results.  Per:

  https://developers.google.com/search/reference/robots_txt#file-format

unrecognized content in HTML replies is simply ignored; which makes the
reply equivalent to an empty file; which is the same as no robots.txt 
at all; which means "crawl everything here."  Redirects may also be 
followed, if your showcode rewrite rule uses one (this site's doesn't).

Note that it's possible that crawlers may still recognize the directives 
text in a robots file even if it HAS been formatted as HTML for display 
by this script.  This would make the above tricks unnecessary, but was
not tested because this site doesn't use these files; your site may vary. 

Disclaimer: this is based on Google behavior which other crawlers may or
may not mimic, and your robots.txt resolution may have to be applied to
any other admin files on your site that match the showcode rewrite rule
(e.g., sitemaps?).  While this script could support a list of such files 
forcibly returned as plain-text or 404 error codes (see sitesearch.py),
it's easier to delegate to servers by coding rules to exclude such files.


UNICODE POLICIES HERE:

When loading code files, this script tries a set of Unicode encodings 
in turn, until one works or all fail.  Most Python and text files on 
this site are UTF8 (or its ASCII subset), but a few cp1252 and Latin-1
files crop up as examples.  The UNICODE_IN encodings list reflects this, 
but may be changed for use elsewhere (see also the next section).  Once 
loaded, code text is just decoded code points in memory, and is always
output as UTF8-encoded bytes in reply pages.  Doing so portably for both 
Python 3.X and 2.X is possible but subtle; see the Jun-26-18 notes ahead.


AVOID MIXED UNICODE ENCODINGS [Sep-2018]:

[There is a polished and expanded version of the following note online at
https://learning-python.com/post-release-updates.html#showcodeunicode.]

When using this script, a site's displayable text files should generally 
all use a common Unicode encoding type (e.g., UTF-8) for reliable display.
Else, it's possible that some files may be loaded per an incorrect encoding
if their data passes under other schemes.  This is especially possible if 
files use several incompatible 8-bit encoding schemes: the first on the 
encodings list that successfully loads the data will win, and may munge
some characters in the process.

This issue cropped up in an older file created with the CP-1252 (a.k.a. 
Windows-1252) encoding on Windows, whose tools have a nasty habit of 
silently using its native encodings.  This file's slanted quotes failed
to display correctly in showcode because Python happily loads the file as 
Latin-1 (a.k.a. ISO-8859-1), despite its non-Latin-1 quotes.  The loaded 
text encodes as UTF-8 for transmission, but decodes with junk bytes.
 
Here's the story in code.  Python does not allow the character '“' to be
_encoded_ as Latin-1, in either manual method calls or implicit file-object 
writes.  This quote's 0x201c Unicode code point maps to and from byte value
0x93 in Windows' CP-1252, but is not defined in Latin-1's character set:

  >>> c = '“'               # run in Python 3.X 
  >>> hex(ord(c))           # same in 2.X (using u'“', codecs.open(), print)
  '0x201c'

  >>> c.encode('cp1252')    # valid in CP-1252, but not Latin-1
  b'\x93'
  >>> c.encode('latin1')
  UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c'...

Conversely, _decoding_ this character's CP-1252 byte to Latin-1 works 
both in manual method calls and file-object reads.  This is presumably 
because byte value 0x93 maps to an obscure and unprintable "STS" C1 control 
character in some Latin-1 definitions, though the decoder may simply allow
any 8-bit value to pass.  It's not a CP-1252 quote in any event:

  >>> b = b'\x93'
  >>> b.decode('cp1252')    # the proper translation
  '“'
  >>> b.decode('latin1')    # but it's not a quote in latin1
  '\x93'

  >>> n = open('temp', 'wb').write(b)
  >>> open('temp', encoding='cp1252').read()
  '“'
  >>> open('temp', encoding='latin1').read()    # <= what showcode did
  '\x93'

This is problematic in showcode, because this script relies on encoding 
failures to find one that matches the data and translates its content to 
code points correctly.  Because a CP-1252 file loads without error as 
Latin-1, its UTF-8 encoding for reply transmission is erroneous; the 
quote's code point never makes the cut:

  >>> b.decode('cp1252').encode('utf8').decode('utf8')   # load, reply, browser
  '“'
  >>> b.decode('latin1').encode('utf8').decode('utf8')   # the Latin-1 munge...
  '\x93'

The net effect turns the quote into a garbage byte that browsers simply 
ignore (it's a box in Firefox's view-source, but is otherwise hidden). 

If your non-UTF-8 files are _only_ CP-1252, replacing Latin-1 with CP-1252
in the encodings list fixes the issue.  However, if your site's files use
multiple encodings whose byte ranges overlap but map to different characters,
using CP-1252 may fix some files but break others.  Latin-1 files using the 
0x93 control code, for example, would sprout quotes when displayed (unlikely,
but true).  The real issue here is that content of mixed encodings is 
inherently ambiguous in the Unicode model.

The _better solution_ is to make sure your site's displayable text files 
don't use incompatible encoding schemes.  At showcode's site, the simplest 
fix was to adopt UTF-8 as the site-wide encoding, by opening its handful 
of CP-1252 files as CP-1252, and saving as UTF-8.  The set of suspect files
can be easily isolated by trying UTF-8 opens (e.g., in an os.walk() loop).

Converting to UTF-8 universally will not only help avoid corrupted text 
in showcode, it might also avoid issues in text editors that are given or 
guess encoding types.  If you give the wrong encoding to an editor, saves 
may corrupt your data.  If you expect a tool to deal with mixed encoding 
types, guessing may be its only recourse.  But guessing is overkill; is 
impossible to do accurately anyhow; and is not science.  Skip the drama 
and convert your files.

UPDATE: in light of the above, 'latin1' was eventually replaced by 'cp1252' 
in showcode's preset input-encodings list, to accommodate a few files at 
this site that are intentionally not UTF-8 (this is similar in spirit to 
the policies for parsing web pages in HTML5).  CP-1252 is a superset of 
Latin-1 and should work more broadly, but change as needed for your site's 
files.  This is still only a partial solution for mixed-content ambiguity; 
use a common Unicode type to avoid encoding mismatches altogether.

FOOTNOTE: subtly, some scripts, including this site's genhtml page builder
(learning-python.com/genhtml.html), can often get away with treating CP-1252 
and other 8-bit encodings as Latin-1, because bytes whose interpretations 
differ between the two are passed through unchanged from load to save. 
What Latin-1 reads and writes as 0x93 is still '“' to CP-1252, though 
the equivalence falls apart when comparing non-Latin-1 text:

  >>> '“'.encode('cp1252').decode('latin1').encode('latin1').decode('cp1252')
  '“'
  >>> '“'.encode('cp1252').decode('latin1') == '“'    # cp1252's meaning lost
  False

This doesn't help in showcode, because data loaded as Latin-1 is not written
again as Latin-1; encoding as UTF-8 in the reply makes the munging permanent.


OTHER USAGE IDEAS:

This script can also be invoked by URL in the "action" tag of a form 
in an HTML page; could be submitted by a script (see Python's urllib);
and might work as an Apache handler (to be explored).

 
CAVEATS:

As is, this script reflects a number of tradeoffs:

-Its code must run on the Python 2.X version which is default on the host.
-Its footer code must avoid copies of text normally generated by genhtml.
-Its error checking is minimal, as it is used only in well-known contexts.
-Its ".." assumption for subject files' paths is not very general.
-Its Apache rewrite rule breaks "README.txt" in index pages (see above).
-Its Apache rewrite rule may complicate robots.txt handling (see above).
-Its always-UTF8 output policy means others are converted to this on saves.
-Its Unicode encodings list may fail in mixed-type contexts (see above).

OTOH, it works as intended, and demos CGI; expand and improve as desired.
==============================================================================
"""

import cgitb; cgitb.enable()      # route python exceptions to browser/client
import cgi, os, sys, codecs

UsingPython3 = sys.version[0] == '3'
UsingPython2 = sys.version[0] == '2'

if UsingPython3:                                      # py 3.X/2.X compatible
    from html import escape as html_escape            # run on 2.X only to date
    from urllib.parse import quote_plus
elif UsingPython2:
    from cgi import escape as html_escape             # for text added to HTML 
    from urllib import quote_plus                     # for text added to URL
else:
    assert False, 'The future remains to be written'


# Switches and constants

# Jun-26-18 note: browsers allow "UTF8" but "UTF-8" is technically the HTML 
# encoding name; python allows synonyms, including both "UTF-8" and "utf8";
# for background, see: https://encoding.spec.whatwg.org/#names-and-labels;

# Sep-01-18 note: see AVOID MIXED UNICODE ENCODINGS above - 'latin1' was 
# replaced with 'cp1252' in UNICODE_IN for the host site as a half-measure;

MOCK_VALUES = False                 # 1=simulate parsed inputs
MOCK_SERVER = False                 # 1=simulate client request

UNICODE_IN  = ['UTF8', 'cp1252']    # try in turn for code file content
UNICODE_OUT = 'UTF-8'               # for text in generated reply page

TEMPLATE    = 'showcode-template.txt'    # the reply-page format
FOOTER      = '../dummy-footer.html'     # site-wide footer code

trace = lambda *args: None  # or print, to display on stdout 


#------------------------------------------------------------------------------
# Get input filename (and raw-text mode?) sent from the client
#------------------------------------------------------------------------------

# Parse and/or forge request

if MOCK_VALUES:
    # simulate parsed request for testing
    class Mock: 
        def __init__(self, value):
            self.value = value
    form = dict(name=Mock('timeformat.py'))    # + rawmode=Mock('view')?
else:
    if MOCK_SERVER:
        # jun18: simulate post-server, pre-pylibs state
        os.environ['CONTENT_TYPE'] = 'Content-type: application/x-www-form-urlencoded'
        os.environ['QUERY_STRING'] = (
            'name=pyedit-products/unzipped/docetc/examples'
                 '/Non-BMP-Emojis/Non-BMP-Emoji-both-%f0%9f%98%8a.txt')

    form = cgi.FieldStorage()         # parse form/url input data

# Extract request inputs

if 'name' not in form:
    name = 'cgi/showcode.py'          # show myself: more useful 
    """
    # error check: custom reply = hdr + blankline=\n + msg
    print('Content-type: text/plain\n')
    print('Please provide a value for "name" in the request.')
    sys.exit(1)
    """
else:
    name = form['name'].value         # real or mocked, pathname relative to '..'

if 'rawmode' not in form:
    rawmode = False
else:
    rawmode = form['rawmode'].value   # 'view' or 'save' or absent=formatted


#------------------------------------------------------------------------------
# Load the code from a file in "..", in 1 of N Unicode encodings
#------------------------------------------------------------------------------

# "name" may be a basename or a pathname relative to ".." (site root);
# both open()/read() flavors retain \r on Windows, decode to code points,
# and return a Unicode object: a py2 u'xx' unicode, or a py3 'xx' str; 
# tries N Unicode types for input, but always outputs as UTF8 bytes;
# os.path.isfile() is a superset of os.path.exists(): don't need both;

path = '..' + os.sep + name

if not os.path.isfile(path):
    code = (u'Error: file does not exist or is not a file.\n\n'
             'Please verify the filename in your request.\n')     # apr18

else:
    for tryenc in UNICODE_IN:
        try:
            if UsingPython3:
                code = open(path, mode='r', encoding=tryenc, newline='').read()
            else:
                code = codecs.open(path, mode='r', encoding=tryenc).read()
        except:
            pass     # try next encoding on list
        else:
            break    # load successful: skip else
    else:
        code = (u'Error: could not open file.\n\n'
                 "Please adjust the script's UNICODE_IN list.\n")   # jun18


#------------------------------------------------------------------------------
# Load and expand the HTML template
#------------------------------------------------------------------------------

# It's okay to load the template as str, even though "code" is unicode:
# in py3 they're the same - both are str, which is always Unicode text;
# in py2 they differ, but str is coerced up - '%s' % u'spam' => u'spam',
# even for dicts: see learning-python.com/showcode-unicode-demo.txt;

# Jun-26-18 - but we must decode strs to unicode for py2 if they may be
# non-ASCII, because py2 expects strs to be all-ASCII whenever mixed with 
# unicodes; this applies only to "namehtml", a str in 2.X encoded as UTF-8
# per web conventions: "code" is already unicode, "nameurl" URL-escapes any
# non-ASCIIs, and "footer" and "template" are assumed to be ASCII files;  
# even if name is decoded, error-message "code" must be all-ASCII str or 
# unicode too - use u'xxx' above in both 2.X and 3.X (3.X requires 3.3+);

if rawmode:
    reply = code    # send text as is

else:
    template = open(TEMPLATE).read()        # template file in '.', ASCII only

    codehtml = html_escape(code)            # HTML-escape any characters special in HTML
    namehtml = html_escape(name)            # template also hardcodes some URL escapes
   
    # no longer need to strip 'cgi/' here: used in URL query, not raw link
    nameurl = quote_plus(name)              # URL-escape this: added to query in template 

    footer = open(FOOTER).read()            # load dummy ASCII generated footer html in ..
    for link in ('HREF', 'href', 'SRC'):    # munge it to add ".." to all nested item refs
        old = '%s="'    % link              # this avoids copying code (see template text)
        new = '%s="../' % link
        footer = footer.replace(old, new)

    for undo in ('mailto', '#'):            # undo up-rerouting for two special-cases 
        new = 'HREF="../%s' % undo          # still beats maintaining copied code...
        old = 'HREF="%s'    % undo
        footer = footer.replace(new, old)

    trace(type(namehtml), namehtml)
    trace(type(codehtml))

    # jun-26-18: allow non-ASCII filenames in Python 2.X (see above)
    if UsingPython2:
        namehtml = namehtml.decode(UNICODE_OUT)

    trace(type(namehtml), namehtml)

    reply = template % dict(                # unicode reply: replace template targets
                __NAME__     = namehtml,    # the sent and escaped filename
                __NAMEURL__  = nameurl,     # the filename for raw-text link
                __CODE__     = codehtml,    # the loaded and escaped Unicode text 
                __FOOTER__   = footer)      # the munged dummy generated toolbar html

trace(type(reply), reply[:40])


#------------------------------------------------------------------------------
# Print the reply stream back to the client
#------------------------------------------------------------------------------

# Write UTF8-encoded bytes, use "charset" to force Unicode type to match;
# "inline" is always view, but may require cmd/ctl-A+C to save contents;
# "attachment" is usually save, but opens may fail on some platforms, and
# this is just view on others (notably, iOS: there's no user file access);

if not rawmode:
    contenthdr  = 'Content-type: text/html; charset=%s' % UNICODE_OUT
else:
    dispostype  = 'inline' if rawmode == 'view' else 'attachment'
    basename    = os.path.basename(name)
    contenthdr  = 'Content-type: text/plain; charset=%s\n' % UNICODE_OUT
    contenthdr += 'Content-Disposition: %s; filename="%s"' % (dispostype, basename)

replybytes = reply.encode(UNICODE_OUT)      # send encoded bytes: print is iffy

print(contenthdr)                           # reply = hdrs + blankline + html
print('')                                   # need '' for 2.X, else a tuple!
if UsingPython2:
    sys.stdout.write(replybytes)            # py2 accepts a str for the bytes
else:
    sys.stdout.buffer.write(replybytes)     # py3 stdout is str: use io layer  



[Home] Books Programs Blog Python Author Training Search Email ©M.Lutz