File: cgi/sitesearch.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-
u"""
===============================================================================
sitesearch.py - implement local-pages search for a website

Synopsis: given search parameters in an HTML form or URL, build a site-specific 
search query string and send it to a search provider with an HTTP redirect.

Author/Copyright: 2016-2024, M. Lutz (learning-python.com)
License: provided freely, but with no warranties of any kind.

CGI MODULE NOTE: this script relies on the longstanding Python cgi module,
which was mind-numbingly deprecated in Python 3.12, and is to be removed 
in 3.13.  This script has no intention of coding the many mods required 
by this pointless and opinionated removal.  Instead, if your server uses
Python 3.13 or later, you can thankfully work around this by installing 
the original cgi module, available on PyPI as package legacy-cgi (see
https://pypi.org/project/legacy-cgi/).  Simply run the following single 
command to restore the still-useful cgi and cgitb Python modules:

    pip3 install legacy-cgi

VERSIONS (search on mmm-yyyy for changes):
    Sep 08, 2024 - Add note above about CGI module deprecation/workaround.

    Nov 18, 2020 - Error message if parameters absent|invalid, not status 500.
                   Assorted documentation edits/improvements, in this docstr.
 
    Sep 26, 2020 - Fix Py3.X+Linux redirect-page print excs on non-ASCII terms.
                   Fix log-file write excs in Py2.X for non-ASCII term encodes.
                   Try but skip alt format for non-ASCII Location search URLs.
                   Add a few privacy-focused newcomers to host selection lists.

    Jun 28, 2019 - Document and improve the privacy and permissions of the 
                   search-terms file saved on the server.

    Jun 28, 2018 - Improve and describe handling of Unicode in search terms,
                   remove Ixquick (it's now StartPage), add Baidu and Yandex,
                   unescape URL displayed to user in UTF-8 fallback page.

    Jan 25, 2016 - Log search terms to a server file for metrics, with date
                   and time.  For privacy, set this file's Unix permissions 
                   to 0200 (or 0600) per "ABOUT PERMISSIONS..." ahead.
    
    Jan 15, 2016 - Initial release.


-------------------------------------------------------------------------------
OVERVIEW
-------------------------------------------------------------------------------

This is a Python CGI script: it runs on a web server, reads form (or URL query) 
inputs, and prints HTTP headers + HTML text back to the client.  To search, 
this builds a query URL and submits and delegates it to a search provider via 
an HTTP redirect header.  Anonymous search terms are also saved to a file for 
metrics, and the reply includes a fallback HTML page with redirection options.

This CGI script is normally invoked by the "action" tag of the form in the 
companion HTML page at:

    http://learning-python.com/sitesearch.html

Use your browser's "View Source" to see the form in this HTML page's code;
the linkage in the HTML file looks like this:

    <FORM method=POST accept-charset="UTF-8" action="cgi/sitesearch.py">

As usual, this script can also be invoked from a browser or script using a
GET-style URL with query parameters at the end like this (sans line breaks):

    http://learning-python.com/cgi/sitesearch.py?
        searchhost=Google&searchsite=Entire+site&searchterm=fortran

To use for other sites: 

- Edit the HTML form page's content (or copy its <form>) 
- Edit the HTML form's site-select list (name=searchsite)
- Edit the searchsites dictionary below accordingly

Tradeoffs: this script's scheme exits the hosting site for search results, 
in exchange for a relatively simple way to add multiple-provider search that
requires neither JavaScript nor proprietary widgets.

Examples of mappings from inputs to query URL:

1) Search entire site:

    user inputs = [site=Entire site, term=spam, host=Google]
    search string = "spam site:learning-python.com"
    URL = https://www.google.com/search?q=spam+site%3Alearning-python.com

2) Search individual parts (if any, currently unused here):

    user inputs = [site=Books only, term=decorator, host=DuckDuckGo]
    search string = "decorator site:learning-python.com/books"
    URL = https://duckduckgo.com?q=decorator+site%3Alearning-python.com%2Fbooks

3) Search for multi-word terms (users may add "..." to focus results)

    user inputs = [site=Entire site, term=Pandora's box, host=Bing]
    search string = "Pandora's box site:learning-python.com"
    URL = https://www.bing.com?q=Pandora%27s+box+site%3Alearning-python.com

    With user-provided quotes:

    user inputs = [site=Entire site, term="Pandora's box", host=Bing]
    search string = '"Pandora\'s box" site:learning-python.com'
    URL = https://www.bing.com?q=%22Pandora%27s+box%22+site%3Alearning-python.com


-------------------------------------------------------------------------------
GENERAL NOTES
-------------------------------------------------------------------------------

Usage note: although search providers can be selected for comparison,
DuckDuckGo or StartPage are strongly recommended.  Other providers may 
insert ads and unrelated photos in results and track searchers, and Google
occasionally disables the Back button.  See the HTML page for more details.

Coding note: uses \n instead of \r\n for line breaks, because all known
clients accept it; print adds a \n by default; and Windows may expand \n
to \r\n automatically, which could change an explicit \r\n to \r\r\n.
See https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.3.

Coding note: this script is portable to both Python 2.X and 3.X, but its code
is constrained by the need to run on Python 2.4 -- the most recent Python at
the ISP hosting this site (GoDaddy).  OTOH, 2.4 works fine, which begs the
question: were all the Python changes since 2004 really that important?...

UPDATE, Jun-2017: learning-python.com is now a single site/part; the former
"Books only" in both code here and older examples is defunct but harmless 
(it's disabled altogether in the input form's HTML - no changes made here).

UPDATES, Jun-2018: Ixquick has been merged into StartPage (and is disabled
in the form's HTML); this site's host now runs Python 2.6 by default (though
this has no impact here); Baidu and Yandex have been added to the list of
supported providers; and Unicode handling is modestly improved (per below).

UPDATE, Sep-2020: this script now runs on an AWS Lightsail VPS host which uses
Python 3.5; multiple Unicode issues had to be fixed as a result (see ahead).
The search-host selection list also now includes newer privacy-focused sites.


-------------------------------------------------------------------------------
ABOUT PERMISSIONS AND PRIVACY OF THE SEARCH-TERMS FILE
-------------------------------------------------------------------------------

This script logs search terms to a server file for metrics, adding date
and time but no user information.  For best privacy, set this file's Unix 
permissions to 0200 (and possibly omit it from its directory's index).  

Permission 0200 allows this script to update the file, but prevents the 
file from being accessed on the web at large -- by either direct URL, 
or general file-viewer scripts (e.g., this site's cgi/showcode.py).

Permission 0600 suffices for sites with no file-viewer script (it prevents 
direct-URL access only), and may be required to download the terms file; 
change as needed.  Use permission 0755 (executable) for this script itself.


-------------------------------------------------------------------------------
ABOUT SEARCH TERMS WITH ARBITRARY UNICODE CHARACTERS
-------------------------------------------------------------------------------

TBD: this script may or may not support non-ASCII Unicode in search terms. 
While it could encode the redirect URL to something like UTF-8 bytes, it's 
not clear that servers would use this in "Location:" lines, even given a 
content line of "Content-type: text/html; charset=UTF8".  Resolve me. 

UPDATE, Jun-2018: for non-ASCII characters (e.g., BMP symbols and emojis), 
the HTML form that triggers this script now uses an [accept-charset="UTF-8"] 
attribute to force inputs to be encoded to UTF-8 bytes and URL-escaped ("%XX")
for transmission (a charset meta tag for the entire page is now set as well).
This is an improvement, not a complete fix, though the reasons are subtle.

Before this, browsers would send characters outside the default charset 
oddly: as URL-escaped HTML entities.  For example, a user's searchterm 
input "aaa☞bbb" would be sent from the browser as "aaa%26%239758%3Bbbb", 
and become "aaa&#9758;bbb" after URL unescaping.  URL-escaping this again
here for the redirect would recreate the browser's HTML-entity string,
which only Google seemed to recognize well (StartPage returned a few hits, 
but all other search sites listed here returned all or no pages in response).

With the new accept-charset attribute, a searchterm input "aaa☞bbb" is now
instead sent from browsers as "aaa%E2%98%9Ebbb" (the URL-escaped version 
of Python bytes-string b'aaa\xe2\x98\x9ebbb'), which URL-unescaping turns
into the original input string, and URL-reescaping here restores.

This avoids having to use an HTML parser to munge entities, if required.
Unfortunately, though, its only real benefit today seems to be making the 
term clearer in the generated URL.  Google and StartPage are still the only 
search sites that support such search terms reliably (HTML-entity or not), 
and Google seems to be the better of the two; Baidu and Yandex support 
varies per character, and others ignore the search term altogether.

Hence, barring a future revelation (and time to stumble onto one), search 
users should try Google first for most searches using non-ASCII characters.  
Alas, Unicode on the web is still riddled with intellectual black holes.
This very note required the charset declaration in line 2 above for the
host's Python 2.X (though 3.X's UTF-8 default happily groks any and all ☞). 

For background details on such dark matters, try:
    https://www.w3schools.com/tags/att_form_accept_charset.asp
    https://tools.ietf.org/html/rfc2616#section-14.2
    https://www.w3schools.com/html/html_forms.asp

UPDATE: the next section's item #3 explores a later failed try on this front.


-------------------------------------------------------------------------------
SEP-2020: THREE NON-ASCII IMPLEMENTATION UPDATES
-------------------------------------------------------------------------------

Moving to a Python 3.X host uncovered additional issues with non-ASCII terms.
Search for "SEP-20" here for all of the changes applied for issues fixed:


1) Fix silently failing "print(manualredirect)" in 3.X for non-ASCII terms

When using Python 3.X on Linux, this fallback-page print near the end of 
the script raised an exception for non-ASCII search terms.  This was not
noticed in normal usage because this text followed the redirection headers 
(it was meant only for redirect failures), and clients had already redirected 
to search results before the exception text was seen.  The later save of the
search term to the local log file, however, had not been reached or run for 
non-ASCII terms ever since the host website moved to a server with Python 3.X
(and earlier non-ASCII results are unknown). 

TO FIX, the fallback-page text is now encoded to UTF-8 bytes in 3.X 
before printing, and printed as bytes; this prevents automatic ASCII 
encoding for str in print().  A "charset" header parameter was also added 
for the UTF-8 encoding.  With the fix, the log file now records non-ASCII 
terms (even for emojis), and the redirect page both properly displays 
decoded non-ASCII text and redirects to the search engine (if ever needed).

THE CAUSE: the print's failure stems from the fact that locale environment
settings used at the terminal are not available in the CGI context, which
causes stdout encoding to default to ASCII in CGI scripts *only*.  For 
more background on this, search for "NON-ASCII FILENAMES, CAUSE" in this
site's similarly impacted https://learning-python.com/cgi/showcode.py.


2) Fix exceptions in 2.X (only) on log-file-line encode of non-ASCII terms

A second but unrelated Unicode issue: the log-file line write() at the end 
of the script could fail for non-ASCII terms in 2.X (only), because the line 
is an already-encoded-bytes str in 2.X (only), which bombs in the manual 
encode.  It's a decoded-codepoints str in 3.X, which does require the encode
prior to the low-level file write.  TO FIX, the encoding is now skipped for
2.X (only).  Alas, 2.X/3.X Unicode portability is so subtle that it may not
be worth the effort today.


3) Try alternative non-ASCII term formatting added to showcode - AND PUNT

This script experimented with using the HTTP header parameter formatting
applied to fix non-ASCII filenames in the raw-text replies of showcode.py. 
For the full story on this scheme, search for "Sep-2020" in:

    https://learning-python.com/cgi/showcode.py

In showcode.py, this yields properly encoded-and-then-escaped filenames,
which prevent print() exceptions in Python 3.X on Linux, and is better 
for clients in general:

    Content-Disposition: inline; filename*=UTF-8''%c2%a3%20spam%20%e2%82%ac%20eggs

Here, applying this same alternative formatting yields Location headers like 
this (the first of these searches for a smiley emoji, the second for '真Л⇨'):

    Location: UTF-8''https%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3D%F0%9F%98%8A%2Bsite%3Alearning-python.com
    Location: UTF-8''https%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3D%E7%9C%9F%D0%9B%E2%87%A8%2Bsite%3Alearning-python.com 

This differs from the original's encoded-and-then-escaped parts:

    Location: https://www.google.com/search?q=%F0%9F%98%8A+site%3Alearning-python.com
    Location: https://www.google.com/search?q=%E7%9C%9F%D0%9B%E2%87%A8+site%3Alearning-python.com

The new alternative formatting DID NOT WORK here for the value of the Location 
header (as coded and tested, at least).  In all cases, the alternative 
formatting produced "Not Found" reply pages in the client.  This is perhaps 
because the standard seems for HTTP header _parameter_ values, not the entire 
header value; the net effect appears to invalidate the redirect altogether. 

Hence, this script is sticking with its original formatting scheme of 
URL-encoding individual parts of the redirect URL.  The good news is that 
this seems to work on most search engines listed here for non-emoji non-ASCII
terms such as '真Л⇨' (though emoji support outside Google is still spotty).

Also in the browser-dependence column: Yandex now calls out 'automated' 
searches like those here, and Bing may now require multiple Backs to exit 
its search-results page (though only in desktop Firefox?); sigh...
===============================================================================
"""



####################
# CODE STARTS HERE #
####################



from __future__ import print_function     # enable py3.X print() in py2.X
import cgitb; cgitb.enable()              # route python exceptions to browser/client (?)

import cgi, sys, os, time

UsingPython3 = sys.version[0] == '3'
UsingPython2 = sys.version[0] == '2'

# fetch url reply, url-encode/decode text
if UsingPython3:                                      # py 3.X/2.X compatible
    from urllib.request import urlopen                # *urlopen not yet used*
    from urllib.parse import quote_plus, unquote_plus
else:                                     
    from urllib import urlopen, quote_plus, unquote_plus

# Jan-2016: save search terms to this server file for metrics
SAVETERMSFILE = 'sitesearch-savedterms.txt' 

# testing
MOCK  = False    # True=simulate form/url-query inputs (testing)
DEBUG = False    # True=display url without submitting it (testing)



#==============================================================================
# Fetch and map inputs to url parts
#==============================================================================


searchsites = {
    'Entire site':   'learning-python.com',            # Jun-2017: now Entire only
    'Books only':    'learning-python.com/books',      # [url-encodes '/' ahead]
   #'Training only': 'learning-python.com/training'    # [not yet (ever) supported]
    }

searchhosts = {
    'DuckDuckGo': 'duckduckgo.com',          # no tracking, ads, or images
    'Google':     'www.google.com',          # all of the above + Back breaks...
    'Bing':       'www.bing.com',            # ads + images + less fruitful?
    'StartPage':  'startpage.com',           # google results, no tracking
    'Yahoo':      'search.yahoo.com',        # you be the judge
    'Baidu':      'www.baidu.com',           # Jun-2018: popular in China
    'Yandex':     'www.yandex.com',          # Jun-2018: popular in Russia 
    'Privado':    'www.privado.com',         # Sep-2020: privacy-focused newcomer
    'OneSearch':  'www.onesearch.com',       # Sep-2020: privacy-focused newcomer
   #'Ixquick':    'ixquick.com',             # [Jun-2018: now redirects to StartPage]
    }


#------------------------------------------------------------------------
# Fetch inputs from html form fields, url query parameters, or mock-up.
#------------------------------------------------------------------------

if not MOCK:
    form = cgi.FieldStorage()                # parse live form/url input data
else:
    class Mock:                              # simulate form input to test
        def __init__(self, value):
            self.value = value

    form = {'searchsite': Mock('Entire site'),
            'searchterm': Mock('"class decorator" ☞'),    # or 真Л⇨'),
            'searchhost': Mock('Google')}

    # => Location: https://www.google.com/
    #      search?q=%22class+decorator%22+%E2%98%9E+site%3Alearning-python.com


#------------------------------------------------------------------------
# Map inputs to query URL values.
#
# Missing and invalid keys in url queries trigger cgitb KeyError displays.
# This script started on py 2.4, which has no 'a if b else c' expr; that 
# required some arguably verbose coding (which has largely faded today).
#
# Nov-2020: really, missing site/host trigger 500 status replies today 
# (Internal Server Error); to stop google-search nastygrams about it, 
# check explicitly, and send an explicit error-message reply if empty. 
# searchterm was formerly handled this way, but the other two were not; 
# site and host cannot be empty in the HTML form, but can in URL queries.
# Also now catches invalid key values for passed keys, using dict.get()
# (e.g., previously was: site = searchsites[form['searchsite'].value] ).
#------------------------------------------------------------------------

site = host = term = ''    # missing term, and others: error reply ahead

if 'searchsite' in form:
    site = searchsites.get(form['searchsite'].value, '')

if 'searchhost' in form:
    host = searchhosts.get(form['searchhost'].value, '')

if 'searchterm' in form:
    term = form['searchterm'].value


#-----------------------------------------------------------------------
# Map search provider to query URL pattern.
#
# Jun-2018: Drop 'atag' prefix in favor of per-provider patterns,
# because Baidu and Yandex URLs are too custom for prior scheme.
# The former "/" before an empty atag is optional (ddg adds one).
# urltemplate = 'https://%(host)s/%(atag)s?q=%(term)s+%(pref)s%(site)s'
#
# Sep-2020: Yandex now may detect and call out automated search URLs
# (requiring a user verify), and Bing may require multiple Backs to
# exit.  Newly added: Privado and OneSearch are both privacy-focused
# newcomers (there may be more like them).  Regrettably, most here
# may show sponsored ads first - even those pushing privacy.  Bing
# also now uses same format as Google, but auto maps urlgeneric to it.
#-----------------------------------------------------------------------

urlgeneric = 'https://%(host)s?q=%(term)s+%(pref)s%(site)s'

# keys should match searchhosts above

urltemplates = {
    'DuckDuckGo':   urlgeneric, 

    'Google':      'https://%(host)s/search?q=%(term)s+%(pref)s%(site)s',  

    'Bing':         urlgeneric,  # sep-2020: now == google, but generic ok 

    'StartPage':   'https://%(host)s/do/search?q=%(term)s+%(pref)s%(site)s',

    'Yahoo':       'https://%(host)s/search?q=%(term)s+%(pref)s%(site)s',

    'Baidu':       'https://%(host)s/s?ie=utf-8&wd=%(term)s+%(pref)s%(site)s',

    'Yandex':      'https://%(host)s/search/?text=%(term)s+%(pref)s%(site)s',

    'Privado':      urlgeneric,   # Bing-powered

    'OneSearch':    urlgeneric    # caveat: requires an extra click
} 

# Nov-2020: allow missing here too
if 'searchhost' in form:
    urltemplate = urltemplates.get(form['searchhost'].value, '')
else:
    urltemplate = ''


#-----------------------------------------------------------------------
# Encode and escape parts, expand URL template.
#
# URL-escapes the search-term text input, site '/', and label ':'.
# This escapes individual parts of the redirect URL, not all of it.
# Also explicitly encodes term's str codepoints to utf8 bytes in py3.X;
# quote_plus() does this auto in 3.X only, but that's far too implicit.
# quote_plus() also always returns an all-ASCII _str_, for any input. 
#-----------------------------------------------------------------------

# Sep-2020: formerly relied on quote_plus() utf8 default in py3.X (only)
if UsingPython3:
    term = term.encode('utf8')          # it's already encoded bytes in 2.X

term = quote_plus(term)                 # pass encoded bytes in both 3.X and 2.X
site = quote_plus(site)                 # '"A/B C:D E"' -> '%22A%2FB+C%3AD+E%22'
pref = quote_plus('site:')              # unquote_plus() reverses the %-encoding

# expand template
searchurl = urltemplate % vars()        # Jun-2018: now table-driven



#==============================================================================
# Pass formatted 'term site:xxx' query to providers
#==============================================================================


#------------------------------------------------------
# Fallback-option page (or via HTTP Refresh header).
#
# Caution: the body's HTML may contain non-ASCII text.
# Jun-2018: unescape link displayed to user (only).
# Noc-2020: drop blank line #2 at top, add one at end.
#------------------------------------------------------

manualredirect = """<HTML><HEAD>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<META http-equiv="Refresh" content="5; URL=%s">
</HEAD>
<BODY><FONT face=Arial>
<P>Redirecting to search host...
<P>If this fails, please click this link instead:
<PRE>   
    <B><A HREF="%s">%s</A></B>
</PRE>
</P></BODY></HTML>\n""" % (searchurl, searchurl, unquote_plus(searchurl))


#------------------------------------------------------------------------
# Error checks, custom replies
#------------------------------------------------------------------------

if not site and not host and not term:              # Nov-2020: check more inputs
    print('Content-type: text/plain\n')             # start reply = hdr + blankline=\n
    print('Search parameters missing or invalid.')  # text to show in browser, if any

elif not site:
    print('Content-type: text/plain\n')       
    print('Search site missing or invalid.')

elif not host:
    print('Content-type: text/plain\n')       
    print('Search host missing or invalid.')

elif not term:
    print('Content-type: text/plain\n')
    print('Please provide a search term.')          # traps just missing for term 
    
elif DEBUG:
    # disply built link only
    print('Content-type: text/plain\n')             # start reply = hdr + blankline=\n
    print(searchurl)                                # with encoded-and-escaped term 

else:

    #--------------------------------------------------------------------
    # Valid and live: redirect client to a search-provider's site/page.
    # 
    # Send the client an HTTP reply = [headers + blank line + HTML page]
    # having a 'Location:' header line with a formatted search-query URL.
    # The hosting site auto generates a 3xx status header if 'Location:'.
    # Print HTML reply too, in case client doesn't redirect as planned.
    #--------------------------------------------------------------------
    
   #print('HTTP/1.1 302 Found')                         # <= added by host if 'Location:'
    print('Location: %s' % searchurl)                   # cgi http redirect header
   #print('Connection: close')                          # <= this seems optional or auto
    print('Content-type: text/html; charset=UTF-8')     # Sep-2020: add charset for body
    print('')                                           # '' else 2.X tuple sans __future__

    #--------------------------------------------------------------------
    # HTML page shown if 'Location:' header redirect fails (unlikely).
    #
    # Sep-2020: The original print() below produced a to-ASCII encoding
    # exception in Python 3.X on Linux for all non-ASCII search terms;
    # the redirect silenced the exception, but later log writes didn't 
    # happen.  To fix, 3.X encodes manually and writes as bytes instead.
    # Why: CGI scripts lack locale settings, so stdout defaults to ASCII.  
    #--------------------------------------------------------------------

    if UsingPython2:                                    # py2.X: already encoded bytes
        print(manualredirect)                           # py2 stdout allows str of bytes

    else:                                               # py3.X: manually encode str
        manualredirect = manualredirect.encode('utf8')  # match charset header encoding
        try: 
            sys.stdout.flush()                          # mar20: else headers last on 3.X
            sys.stdout.buffer.write(manualredirect)     # py3 stdout is str: use io layer  
        except: 
            print('Page error', repr(sys.exc_info()[1]), file=sys.stderr)

    #--------------------------------------------------------------------
    # Save search term (pre-encode) to flat file on server (Jan-2016).
    #
    # Open with exclusive lock for possibly concurrent web access.
    # Encode str to bytes for Py 3.X, no-op on Py 2.X unless non-ASCII.
    # Use permission 0200 for updates but inaccessibility on the web.
    # Sep-2020: don't encode line in 2.X - encoding fails for non-ASCII. 
    #
    # NOTE: the O_EXCL flag does _not_ suffice to lock the terms file
    # for exclusive access on the web server.  For tools that may, see 
    # Python's os.lockf(), fctl.flock(), and fctl.lockf().  This script 
    # assumes that writes are atomic (or won't overlap in practice).
    #--------------------------------------------------------------------
    
    try:
        filename = SAVETERMSFILE
        message = '[%s] [%s]' % (time.asctime(), unquote_plus(term))   # jan25
        if not os.path.exists(filename):
            open(filename, 'w').close()    # make file first time
            
        fd = os.open(filename, os.O_EXCL | os.O_APPEND | os.O_WRONLY)  # binary on Unix
        line = message + os.linesep
        if UsingPython3:
            line = line.encode('utf8')   # Sep-2020: it's already encoded bytes in 2.X
        os.write(fd, line)
        os.close(fd)
    except:
        print('Log error', repr(sys.exc_info()[1]), file=sys.stderr)   # to error_log
        pass   # neither server nor client care if this fails


# And the client reroutes to the search provider (which hopefully plays nice)