File: cgi/sitesearch.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-
"""
===============================================================================
sitesearch.py - redirect a site-specific search query to a search provider.

Author/Copyright: 2016-2018, M. Lutz (learning-python.com)
License: provided freely, but with no warranties of any kind.
Versions: 
    Jun 28, 2018 - improve and describe handling of Unicode in search terms,
                   remove Ixquick (it's now StartPage), add Baidu and Yandex,
                   unescape URL displayed to user in UTF-8 fallback page.
    Jan 25, 2016 - log search terms to a server file for metrics.

A Python CGI script: runs on the server, reads form (or URL query) input,
prints HTTP headers + HTML text to the client.  This script builds a query
URL and submits and delegates it to a search provider via an HTTP redirect.

Example - search entire site:

    user inputs = [site=Entire site, term=spam, host=Google]
    search string = "spam site:learning-python.com"
    URL = https://www.google.com/search?q=spam+site%3Alearning-python.com

Example - search individual parts (if any):

    user inputs = [site=Books only, term=decorator, host=DuckDuckGo]
    search string = "decorator site:learning-python.com/books"
    URL = https://duckduckgo.com/?q=decorator+site%3Alearning-python.com%2Fbooks

Normally invoked by the "action" tag of the form in the companion HTML page at:

    http://learning-python.com/sitesearch.html

Use your browser's "View Source" to see the form in this HTML page's code.
To use for other sites, edit the HTML select list and the "sites" dict below.

As usual, this script can also be invoked from a browser or script using a
GET-style URL with query parameters at the end like this (all on one line):

    http://learning-python.com/cgi/sitesearch.py?
        searchhost=DuckDuckGo&searchsite=Entire+site&searchterm=fortran

This script's scheme exits the hosting site for search results, in exchange 
for a relatively simple way to add multiple-provider search that requires 
neither JavaScript nor proprietary widgets.

Usage note: although search providers can be selected for comparison,
DuckDuckGo or StartPage are strongly recommended.  Other providers may 
insert ads and unrelated photos in results and track searchers, and Google
occasionally disables the Back button.  See the HTML page for more details.

Coding note: uses \n instead of \r\n for line breaks, because all known
clients accept it; print adds a \n by default; and Windows may expand \n
to \r\n automatically, which could change an explicit \r\n to \r\r\n.
See https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.3.

Coding note: this script is portable to both Python 2.X and 3.X, but its code
is constrained by the need to run on Python 2.4 -- the most recent Python at
the ISP hosting this site (Godaddy).  OTOH, 2.4 works fine, which begs the
question: were all the Python changes since 2004 really that important?...

Update, June 2017: learning-python.com is now a single site/part; the former
"Books only" in both code here and older examples is defunct but harmless 
(it's disabled altogether in the input form's HTML - no changes made here).

Updates, June 2018: Ixquick has been merged into StartPage (and is disabled
in the form's HTML); this site's host now runs Python 2.6 by default (though
this has no impact here); Baidu and Yandex have been added to the list of
supported providers; and Unicode handling is modestly improved (per below).

-------------------------------------------------------------------------------
ABOUT SEARCH TERMS WITH ARBITRARY UNICODE CHARACTERS

TBD: this script may or may not support non-ASCII Unicode in search terms. 
While it could encode the redirect URL to something like UTF-8 bytes, it's 
not clear that servers would use this in "Location:" lines, even given a 
content line of "Content-type: text/html; charset=UTF8".  Resolve me. 

UPDATE, Jun 2018: for non-ASCII characters (e.g., BMP symbols and emojis), 
the HTML form that triggers this script now uses an [accept-charset="UTF-8"] 
attribute to force inputs to be encoded to UTF-8 bytes and URL-escaped ("%XX")
for transmission (a charset meta tag for the entire page is now set as well).
This is an improvement, not a complete fix, though the reasons are subtle.

Before this, browsers would send characters outside the default charset 
oddly: as URL-escaped HTML entities.  For example, a user's searchterm 
input "aaa☞bbb" would be sent from the browser as "aaa%26%239758%3Bbbb", 
and become "aaa☞bbb" after URL unescaping.  URL-escaping this again
here for the redirect would recreate the browser's HTML-entity string,
which only Google seemed to recognize well (StartPage returned a few hits, 
but all other search sites listed here returned all or no pages in response).

With the new accept-charset attribute, a searchterm input "aaa☞bbb" is now
instead sent from browsers as "aaa%E2%98%9Ebbb" (the URL-escaped version 
of Python bytes-string b'aaa\xe2\x98\x9ebbb'), which URL-unescaping turns
into the original input string, and URL-reescaping here restores.

This avoids having to use an HTML parser to munge entities, if required.
Unfortunately, though, its only real benefit today seems to be making the 
term clearer in the generated URL.  Google and StartPage are still the only 
search sites that support such search terms reliably (HTML-entity or not), 
and Google seems to be the better of the two; Baidu and Yandex support 
varies per character, and others ignore the search term altogether.

Hence, barring a future revelation (and time to stumble onto one), search 
users should select Google for all searches using non-ASCII characters.  
Alas, Unicode on the web is still riddled with intellectual black holes.
This very note required the charset declaration in line 2 above for the
host's Python 2.X (though 3.X's UTF-8 default happily groks any and all ☞). 

For background details on such dark matters, try:
    https://www.w3schools.com/tags/att_form_accept_charset.asp
    https://tools.ietf.org/html/rfc2616#section-14.2
    https://www.w3schools.com/html/html_forms.asp
===============================================================================
"""

import cgitb; cgitb.enable()   # route python exceptions to browser/client

import cgi, sys, os, time

# fetch url reply, url-encode/decode text
if sys.version[0] == '3':                             # py 3.X/2.X compatible
    from urllib.request import urlopen                # *urlopen not yet used*
    from urllib.parse import quote_plus, unquote_plus
else:                                     
    from urllib import urlopen, quote_plus, unquote_plus

# Jan 2016: save search terms to this server file for metrics
SAVETERMSFILE = 'sitesearch-savedterms.txt' 

# testing
MOCK  = False    # True=simulate form/url-query inputs
DEBUG = False    # True=display url without submitting it


#------------------------------------------------------------------------------
# Fetch and map inputs to url parts
#------------------------------------------------------------------------------


searchsites = {
    'Entire site':   'learning-python.com',            # Jun 2017: now Entire only
    'Books only':    'learning-python.com/books',      # [url-encodes '/' ahead]
   #'Training only': 'learning-python.com/training'    # [not yet (ever) supported]
    }

searchhosts = {
    'DuckDuckGo': 'duckduckgo.com',          # no tracking, ads, or images
    'Google':     'www.google.com',          # all of the above + Back breaks...
    'Bing':       'www.bing.com',            # ads + images + less fruitful?
    'StartPage':  'startpage.com',           # google results, no tracking
    'Yahoo':      'search.yahoo.com',        # you be the judge
    'Baidu':      'www.baidu.com',           # Jun 2018: popular in China
    'Yandex':     'www.yandex.com',          # Jun 2018: popular in Russia 
   #'Ixquick':    'ixquick.com',             # [Jun 2018: now redirects to StartPage]
    }


# Fetch inputs 
# from html form fields, url query parameters, or mock-up

if not MOCK:
    form = cgi.FieldStorage()                # parse live form/url input data
else:
    class Mock:                              # simulate form input to test
        def __init__(self, value):
            self.value = value

    form = {'searchsite': Mock('Entire site'),
            'searchterm': Mock('"class decorator" ☞'),
            'searchhost': Mock('Google')}

    # => Location: https://www.google.com/
    #      search?q=%22class+decorator%22+%E2%98%9E+site%3Alearning-python.com


# Map inputs
# missing and invalid keys in url queries trigger cgitb KeyError displays

site = searchsites[form['searchsite'].value]
host = searchhosts[form['searchhost'].value]

if 'searchterm' in form:                # but handle missing term: starts empty
    term = form['searchterm'].value     # py 2.4 has no 'a if b else c' expr
else:
    term = ''


# Jun 2018: drop 'atag' prefix in favor of per-provider patterns,
# because Baidu and Yandex URLs are too custom for prior scheme;
# the former "/" before an empty atag is optional (ddg adds one);
#
# urltemplate = 'https://%(host)s/%(atag)s?q=%(term)s+%(pref)s%(site)s'
#

urlgeneric = 'https://%(host)s?q=%(term)s+%(pref)s%(site)s'

urltemplates = {
    'DuckDuckGo':   urlgeneric, 

    'Google':      'https://%(host)s/search?q=%(term)s+%(pref)s%(site)s',  

    'Bing':         urlgeneric,

    'StartPage':   'https://%(host)s/do/search?q=%(term)s+%(pref)s%(site)s',

    'Yahoo':       'https://%(host)s/search?q=%(term)s+%(pref)s%(site)s',

    'Baidu':       'https://%(host)s/s?ie=utf-8&wd=%(term)s+%(pref)s%(site)s',

    'Yandex':      'https://%(host)s/search/?text=%(term)s+%(pref)s%(site)s'
} 

urltemplate = urltemplates[form['searchhost'].value]


# url-escape text input, site '/', label ':'
term = quote_plus(term) 
site = quote_plus(site)                 # '"A/B C:D E"' -> '%22A%2FB+C%3AD+E%22'
pref = quote_plus('site:')              # unquote_plus() reverses this

# expand template
searchurl = urltemplate % vars()        # Jun 2018: now table-driven


#------------------------------------------------------------------------------
# Pass 'term site:xxx' query to providers
#------------------------------------------------------------------------------


# fallback option (or via HTTP Refresh header)
# Jun 2018: unescape link displayed to user (only)

manualredirect = """
<HTML><HEAD>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<META http-equiv="Refresh" content="5; URL=%s">
</HEAD>
<BODY><FONT face=Arial>
<P>Redirecting to search host...
<P>If this fails, please click this link instead:
<PRE>   
    <B><A HREF="%s">%s</A></B>
</PRE>
</P></BODY></HTML>""" % (searchurl, searchurl, unquote_plus(searchurl))


if not term:
    # error check, custom reply
    print('Content-type: text/plain\n')        # start reply: hdr + blankline=\n
    print('Please provide a search term in field "Search for this:".')
    
elif DEBUG:
    # disply built link only
    print('Content-type: text/plain\n')        # start reply: hdr + blankline=\n
    print(searchurl)

else:

    #--------------------------------------------------------------------
    # valid and live: redirect client to search provider site/page;
    # hosting site auto generates a 3xx status header if "Location:";
    # print HTML reply too, in case client doesn't redirect (unlikely);
    #--------------------------------------------------------------------
    
   #print('HTTP/1.1 302 Found')                # added by host if 'Location:'
    print('Location: %s' % searchurl)          # cgi http redirect header
   #print('Connection: close')                 # this seems optional or auto
    print('Content-type: text/html')           # reply = hdrs + blankline + html
    print('')                                  # need '' for 2.X, else a tuple!
    print(manualredirect)                      # HTML page if redirect fails
    
    #--------------------------------------------------------------------
    # Jan 2016: save search term (pre-encode) to flat file on server;
    # open with exclusive lock for possibly concurent web access;
    # encode str to bytes for Py 3.X, no-op on Py 2.X unless non-ascii;
    #--------------------------------------------------------------------
    
    try:
        filename = SAVETERMSFILE
        message = '[%s] [%s]' % (time.asctime(), unquote_plus(term))   # jan25
        if not os.path.exists(filename):
            open(filename, 'w').close()    # make file first time
            
        fd = os.open(filename, os.O_EXCL | os.O_APPEND | os.O_WRONLY)
        line = (message + os.linesep).encode('utf8')
        os.write(fd, line)
        os.close(fd)
    except:
        pass   # neither server nor client care if this fails



[Home] Books Programs Blog Python Author Training Search Email ©M.Lutz