#!/usr/bin/python # -*- coding: UTF-8 -*- u""" =============================================================================== sitesearch.py - implement local-pages search for a website Synopsis: given search parameters in an HTML form or URL, build a site-specific search query string and send it to a search provider with an HTTP redirect. Author/Copyright: 2016-2020, M. Lutz (learning-python.com) License: provided freely, but with no warranties of any kind. Versions (search on mmm-yyyy for changes): Nov 18, 2020 - Error message if parameters absent|invalid, not status 500. Assorted documentation edits/improvements, in this docstr. Sep 26, 2020 - Fix Py3.X+Linux redirect-page print excs on non-ASCII terms. Fix log-file write excs in Py2.X for non-ASCII term encodes. Try but skip alt format for non-ASCII Location search URLs. Add a few privacy-focused newcomers to host selection lists. Jun 28, 2019 - Document and improve the privacy and permissions of the search-terms file saved on the server. Jun 28, 2018 - Improve and describe handling of Unicode in search terms, remove Ixquick (it's now StartPage), add Baidu and Yandex, unescape URL displayed to user in UTF-8 fallback page. Jan 25, 2016 - Log search terms to a server file for metrics, with date and time. For privacy, set this file's Unix permissions to 0200 (or 0600) per "ABOUT PERMISSIONS..." ahead. Jan 15, 2016 - Initial release. ------------------------------------------------------------------------------- OVERVIEW ------------------------------------------------------------------------------- This is a Python CGI script: it runs on a web server, reads form (or URL query) inputs, and prints HTTP headers + HTML text back to the client. To search, this builds a query URL and submits and delegates it to a search provider via an HTTP redirect header. Anonymous search terms are also saved to a file for metrics, and the reply includes a fallback HTML page with redirection options. This CGI script is normally invoked by the "action" tag of the form in the companion HTML page at: http://learning-python.com/sitesearch.html Use your browser's "View Source" to see the form in this HTML page's code; the linkage in the HTML file looks like this:
As usual, this script can also be invoked from a browser or script using a GET-style URL with query parameters at the end like this (sans line breaks): http://learning-python.com/cgi/sitesearch.py? searchhost=Google&searchsite=Entire+site&searchterm=fortran To use for other sites: - Edit the HTML form page's content (or copy its ) - Edit the HTML form's site-select list (name=searchsite) - Edit the searchsites dictionary below accordingly Tradeoffs: this script's scheme exits the hosting site for search results, in exchange for a relatively simple way to add multiple-provider search that requires neither JavaScript nor proprietary widgets. Examples of mappings from inputs to query URL: 1) Search entire site: user inputs = [site=Entire site, term=spam, host=Google] search string = "spam site:learning-python.com" URL = https://www.google.com/search?q=spam+site%3Alearning-python.com 2) Search individual parts (if any, currently unused here): user inputs = [site=Books only, term=decorator, host=DuckDuckGo] search string = "decorator site:learning-python.com/books" URL = https://duckduckgo.com?q=decorator+site%3Alearning-python.com%2Fbooks 3) Search for multi-word terms (users may add "..." to focus results) user inputs = [site=Entire site, term=Pandora's box, host=Bing] search string = "Pandora's box site:learning-python.com" URL = https://www.bing.com?q=Pandora%27s+box+site%3Alearning-python.com With user-provided quotes: user inputs = [site=Entire site, term="Pandora's box", host=Bing] search string = '"Pandora\'s box" site:learning-python.com' URL = https://www.bing.com?q=%22Pandora%27s+box%22+site%3Alearning-python.com ------------------------------------------------------------------------------- GENERAL NOTES ------------------------------------------------------------------------------- Usage note: although search providers can be selected for comparison, DuckDuckGo or StartPage are strongly recommended. Other providers may insert ads and unrelated photos in results and track searchers, and Google occasionally disables the Back button. See the HTML page for more details. Coding note: uses \n instead of \r\n for line breaks, because all known clients accept it; print adds a \n by default; and Windows may expand \n to \r\n automatically, which could change an explicit \r\n to \r\r\n. See https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.3. Coding note: this script is portable to both Python 2.X and 3.X, but its code is constrained by the need to run on Python 2.4 -- the most recent Python at the ISP hosting this site (GoDaddy). OTOH, 2.4 works fine, which begs the question: were all the Python changes since 2004 really that important?... UPDATE, Jun-2017: learning-python.com is now a single site/part; the former "Books only" in both code here and older examples is defunct but harmless (it's disabled altogether in the input form's HTML - no changes made here). UPDATES, Jun-2018: Ixquick has been merged into StartPage (and is disabled in the form's HTML); this site's host now runs Python 2.6 by default (though this has no impact here); Baidu and Yandex have been added to the list of supported providers; and Unicode handling is modestly improved (per below). UPDATE, Sep-2020: this script now runs on an AWS Lightsail VPS host which uses Python 3.5; multiple Unicode issues had to be fixed as a result (see ahead). The search-host selection list also now includes newer privacy-focused sites. ------------------------------------------------------------------------------- ABOUT PERMISSIONS AND PRIVACY OF THE SEARCH-TERMS FILE ------------------------------------------------------------------------------- This script logs search terms to a server file for metrics, adding date and time but no user information. For best privacy, set this file's Unix permissions to 0200 (and possibly omit it from its directory's index). Permission 0200 allows this script to update the file, but prevents the file from being accessed on the web at large -- by either direct URL, or general file-viewer scripts (e.g., this site's cgi/showcode.py). Permission 0600 suffices for sites with no file-viewer script (it prevents direct-URL access only), and may be required to download the terms file; change as needed. Use permission 0755 (executable) for this script itself. ------------------------------------------------------------------------------- ABOUT SEARCH TERMS WITH ARBITRARY UNICODE CHARACTERS ------------------------------------------------------------------------------- TBD: this script may or may not support non-ASCII Unicode in search terms. While it could encode the redirect URL to something like UTF-8 bytes, it's not clear that servers would use this in "Location:" lines, even given a content line of "Content-type: text/html; charset=UTF8". Resolve me. UPDATE, Jun-2018: for non-ASCII characters (e.g., BMP symbols and emojis), the HTML form that triggers this script now uses an [accept-charset="UTF-8"] attribute to force inputs to be encoded to UTF-8 bytes and URL-escaped ("%XX") for transmission (a charset meta tag for the entire page is now set as well). This is an improvement, not a complete fix, though the reasons are subtle. Before this, browsers would send characters outside the default charset oddly: as URL-escaped HTML entities. For example, a user's searchterm input "aaa☞bbb" would be sent from the browser as "aaa%26%239758%3Bbbb", and become "aaa☞bbb" after URL unescaping. URL-escaping this again here for the redirect would recreate the browser's HTML-entity string, which only Google seemed to recognize well (StartPage returned a few hits, but all other search sites listed here returned all or no pages in response). With the new accept-charset attribute, a searchterm input "aaa☞bbb" is now instead sent from browsers as "aaa%E2%98%9Ebbb" (the URL-escaped version of Python bytes-string b'aaa\xe2\x98\x9ebbb'), which URL-unescaping turns into the original input string, and URL-reescaping here restores. This avoids having to use an HTML parser to munge entities, if required. Unfortunately, though, its only real benefit today seems to be making the term clearer in the generated URL. Google and StartPage are still the only search sites that support such search terms reliably (HTML-entity or not), and Google seems to be the better of the two; Baidu and Yandex support varies per character, and others ignore the search term altogether. Hence, barring a future revelation (and time to stumble onto one), search users should try Google first for most searches using non-ASCII characters. Alas, Unicode on the web is still riddled with intellectual black holes. This very note required the charset declaration in line 2 above for the host's Python 2.X (though 3.X's UTF-8 default happily groks any and all ☞). For background details on such dark matters, try: https://www.w3schools.com/tags/att_form_accept_charset.asp https://tools.ietf.org/html/rfc2616#section-14.2 https://www.w3schools.com/html/html_forms.asp UPDATE: the next section's item #3 explores a later failed try on this front. ------------------------------------------------------------------------------- SEP-2020: THREE NON-ASCII IMPLEMENTATION UPDATES ------------------------------------------------------------------------------- Moving to a Python 3.X host uncovered additional issues with non-ASCII terms. Search for "SEP-20" here for all of the changes applied for issues fixed: 1) Fix silently failing "print(manualredirect)" in 3.X for non-ASCII terms When using Python 3.X on Linux, this fallback-page print near the end of the script raised an exception for non-ASCII search terms. This was not noticed in normal usage because this text followed the redirection headers (it was meant only for redirect failures), and clients had already redirected to search results before the exception text was seen. The later save of the search term to the local log file, however, had not been reached or run for non-ASCII terms ever since the host website moved to a server with Python 3.X (and earlier non-ASCII results are unknown). TO FIX, the fallback-page text is now encoded to UTF-8 bytes in 3.X before printing, and printed as bytes; this prevents automatic ASCII encoding for str in print(). A "charset" header parameter was also added for the UTF-8 encoding. With the fix, the log file now records non-ASCII terms (even for emojis), and the redirect page both properly displays decoded non-ASCII text and redirects to the search engine (if ever needed). THE CAUSE: the print's failure stems from the fact that locale environment settings used at the terminal are not available in the CGI context, which causes stdout encoding to default to ASCII in CGI scripts *only*. For more background on this, search for "NON-ASCII FILENAMES, CAUSE" in this site's similarly impacted https://learning-python.com/cgi/showcode.py. 2) Fix exceptions in 2.X (only) on log-file-line encode of non-ASCII terms A second but unrelated Unicode issue: the log-file line write() at the end of the script could fail for non-ASCII terms in 2.X (only), because the line is an already-encoded-bytes str in 2.X (only), which bombs in the manual encode. It's a decoded-codepoints str in 3.X, which does require the encode prior to the low-level file write. TO FIX, the encoding is now skipped for 2.X (only). Alas, 2.X/3.X Unicode portability is so subtle that it may not be worth the effort today. 3) Try alternative non-ASCII term formatting added to showcode - AND PUNT This script experimented with using the HTTP header parameter formatting applied to fix non-ASCII filenames in the raw-text replies of showcode.py. For the full story on this scheme, search for "Sep-2020" in: https://learning-python.com/cgi/showcode.py In showcode.py, this yields properly encoded-and-then-escaped filenames, which prevent print() exceptions in Python 3.X on Linux, and is better for clients in general: Content-Disposition: inline; filename*=UTF-8''%c2%a3%20spam%20%e2%82%ac%20eggs Here, applying this same alternative formatting yields Location headers like this (the first of these searches for a smiley emoji, the second for '真Л⇨'): Location: UTF-8''https%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3D%F0%9F%98%8A%2Bsite%3Alearning-python.com Location: UTF-8''https%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3D%E7%9C%9F%D0%9B%E2%87%A8%2Bsite%3Alearning-python.com This differs from the original's encoded-and-then-escaped parts: Location: https://www.google.com/search?q=%F0%9F%98%8A+site%3Alearning-python.com Location: https://www.google.com/search?q=%E7%9C%9F%D0%9B%E2%87%A8+site%3Alearning-python.com The new alternative formatting DID NOT WORK here for the value of the Location header (as coded and tested, at least). In all cases, the alternative formatting produced "Not Found" reply pages in the client. This is perhaps because the standard seems for HTTP header _parameter_ values, not the entire header value; the net effect appears to invalidate the redirect altogether. Hence, this script is sticking with its original formatting scheme of URL-encoding individual parts of the redirect URL. The good news is that this seems to work on most search engines listed here for non-emoji non-ASCII terms such as '真Л⇨' (though emoji support outside Google is still spotty). Also in the browser-dependence column: Yandex now calls out 'automated' searches like those here, and Bing may now require multiple Backs to exit its search-results page (though only in desktop Firefox?); sigh... =============================================================================== """ #################### # CODE STARTS HERE # #################### from __future__ import print_function # enable py3.X print() in py2.X import cgitb; cgitb.enable() # route python exceptions to browser/client (?) import cgi, sys, os, time UsingPython3 = sys.version[0] == '3' UsingPython2 = sys.version[0] == '2' # fetch url reply, url-encode/decode text if UsingPython3: # py 3.X/2.X compatible from urllib.request import urlopen # *urlopen not yet used* from urllib.parse import quote_plus, unquote_plus else: from urllib import urlopen, quote_plus, unquote_plus # Jan-2016: save search terms to this server file for metrics SAVETERMSFILE = 'sitesearch-savedterms.txt' # testing MOCK = False # True=simulate form/url-query inputs (testing) DEBUG = False # True=display url without submitting it (testing) #============================================================================== # Fetch and map inputs to url parts #============================================================================== searchsites = { 'Entire site': 'learning-python.com', # Jun-2017: now Entire only 'Books only': 'learning-python.com/books', # [url-encodes '/' ahead] #'Training only': 'learning-python.com/training' # [not yet (ever) supported] } searchhosts = { 'DuckDuckGo': 'duckduckgo.com', # no tracking, ads, or images 'Google': 'www.google.com', # all of the above + Back breaks... 'Bing': 'www.bing.com', # ads + images + less fruitful? 'StartPage': 'startpage.com', # google results, no tracking 'Yahoo': 'search.yahoo.com', # you be the judge 'Baidu': 'www.baidu.com', # Jun-2018: popular in China 'Yandex': 'www.yandex.com', # Jun-2018: popular in Russia 'Privado': 'www.privado.com', # Sep-2020: privacy-focused newcomer 'OneSearch': 'www.onesearch.com', # Sep-2020: privacy-focused newcomer #'Ixquick': 'ixquick.com', # [Jun-2018: now redirects to StartPage] } #------------------------------------------------------------------------ # Fetch inputs from html form fields, url query parameters, or mock-up. #------------------------------------------------------------------------ if not MOCK: form = cgi.FieldStorage() # parse live form/url input data else: class Mock: # simulate form input to test def __init__(self, value): self.value = value form = {'searchsite': Mock('Entire site'), 'searchterm': Mock('"class decorator" ☞'), # or 真Л⇨'), 'searchhost': Mock('Google')} # => Location: https://www.google.com/ # search?q=%22class+decorator%22+%E2%98%9E+site%3Alearning-python.com #------------------------------------------------------------------------ # Map inputs to query URL values. # # Missing and invalid keys in url queries trigger cgitb KeyError displays. # This script started on py 2.4, which has no 'a if b else c' expr; that # required some arguably verbose coding (which has largely faded today). # # Nov-2020: really, missing site/host trigger 500 status replies today # (Internal Server Error); to stop google-search nastygrams about it, # check explicitly, and send an explicit error-message reply if empty. # searchterm was formerly handled this way, but the other two were not; # site and host cannot be empty in the HTML form, but can in URL queries. # Also now catches invalid key values for passed keys, using dict.get() # (e.g., previously was: site = searchsites[form['searchsite'].value] ). #------------------------------------------------------------------------ site = host = term = '' # missing term, and others: error reply ahead if 'searchsite' in form: site = searchsites.get(form['searchsite'].value, '') if 'searchhost' in form: host = searchhosts.get(form['searchhost'].value, '') if 'searchterm' in form: term = form['searchterm'].value #----------------------------------------------------------------------- # Map search provider to query URL pattern. # # Jun-2018: Drop 'atag' prefix in favor of per-provider patterns, # because Baidu and Yandex URLs are too custom for prior scheme. # The former "/" before an empty atag is optional (ddg adds one). # urltemplate = 'https://%(host)s/%(atag)s?q=%(term)s+%(pref)s%(site)s' # # Sep-2020: Yandex now may detect and call out automated search URLs # (requiring a user verify), and Bing may require multiple Backs to # exit. Newly added: Privado and OneSearch are both privacy-focused # newcomers (there may be more like them). Regrettably, most here # may show sponsored ads first - even those pushing privacy. Bing # also now uses same format as Google, but auto maps urlgeneric to it. #----------------------------------------------------------------------- urlgeneric = 'https://%(host)s?q=%(term)s+%(pref)s%(site)s' # keys should match searchhosts above urltemplates = { 'DuckDuckGo': urlgeneric, 'Google': 'https://%(host)s/search?q=%(term)s+%(pref)s%(site)s', 'Bing': urlgeneric, # sep-2020: now == google, but generic ok 'StartPage': 'https://%(host)s/do/search?q=%(term)s+%(pref)s%(site)s', 'Yahoo': 'https://%(host)s/search?q=%(term)s+%(pref)s%(site)s', 'Baidu': 'https://%(host)s/s?ie=utf-8&wd=%(term)s+%(pref)s%(site)s', 'Yandex': 'https://%(host)s/search/?text=%(term)s+%(pref)s%(site)s', 'Privado': urlgeneric, # Bing-powered 'OneSearch': urlgeneric # caveat: requires an extra click } # Nov-2020: allow missing here too if 'searchhost' in form: urltemplate = urltemplates.get(form['searchhost'].value, '') else: urltemplate = '' #----------------------------------------------------------------------- # Encode and escape parts, expand URL template. # # URL-escapes the search-term text input, site '/', and label ':'. # This escapes individual parts of the redirect URL, not all of it. # Also explicitly encodes term's str codepoints to utf8 bytes in py3.X; # quote_plus() does this auto in 3.X only, but that's far too implicit. # quote_plus() also always returns an all-ASCII _str_, for any input. #----------------------------------------------------------------------- # Sep-2020: formerly relied on quote_plus() utf8 default in py3.X (only) if UsingPython3: term = term.encode('utf8') # it's already encoded bytes in 2.X term = quote_plus(term) # pass encoded bytes in both 3.X and 2.X site = quote_plus(site) # '"A/B C:D E"' -> '%22A%2FB+C%3AD+E%22' pref = quote_plus('site:') # unquote_plus() reverses the %-encoding # expand template searchurl = urltemplate % vars() # Jun-2018: now table-driven #============================================================================== # Pass formatted 'term site:xxx' query to providers #============================================================================== #------------------------------------------------------ # Fallback-option page (or via HTTP Refresh header). # # Caution: the body's HTML may contain non-ASCII text. # Jun-2018: unescape link displayed to user (only). # Noc-2020: drop blank line #2 at top, add one at end. #------------------------------------------------------ manualredirect = """

Redirecting to search host...

If this fails, please click this link instead:

   
    %s

\n""" % (searchurl, searchurl, unquote_plus(searchurl)) #------------------------------------------------------------------------ # Error checks, custom replies #------------------------------------------------------------------------ if not site and not host and not term: # Nov-2020: check more inputs print('Content-type: text/plain\n') # start reply = hdr + blankline=\n print('Search parameters missing or invalid.') # text to show in browser, if any elif not site: print('Content-type: text/plain\n') print('Search site missing or invalid.') elif not host: print('Content-type: text/plain\n') print('Search host missing or invalid.') elif not term: print('Content-type: text/plain\n') print('Please provide a search term.') # traps just missing for term elif DEBUG: # disply built link only print('Content-type: text/plain\n') # start reply = hdr + blankline=\n print(searchurl) # with encoded-and-escaped term else: #-------------------------------------------------------------------- # Valid and live: redirect client to a search-provider's site/page. # # Send the client an HTTP reply = [headers + blank line + HTML page] # having a 'Location:' header line with a formatted search-query URL. # The hosting site auto generates a 3xx status header if 'Location:'. # Print HTML reply too, in case client doesn't redirect as planned. #-------------------------------------------------------------------- #print('HTTP/1.1 302 Found') # <= added by host if 'Location:' print('Location: %s' % searchurl) # cgi http redirect header #print('Connection: close') # <= this seems optional or auto print('Content-type: text/html; charset=UTF-8') # Sep-2020: add charset for body print('') # '' else 2.X tuple sans __future__ #-------------------------------------------------------------------- # HTML page shown if 'Location:' header redirect fails (unlikely). # # Sep-2020: The original print() below produced a to-ASCII encoding # exception in Python 3.X on Linux for all non-ASCII search terms; # the redirect silenced the exception, but later log writes didn't # happen. To fix, 3.X encodes manually and writes as bytes instead. # Why: CGI scripts lack locale settings, so stdout defaults to ASCII. #-------------------------------------------------------------------- if UsingPython2: # py2.X: already encoded bytes print(manualredirect) # py2 stdout allows str of bytes else: # py3.X: manually encode str manualredirect = manualredirect.encode('utf8') # match charset header encoding try: sys.stdout.flush() # mar20: else headers last on 3.X sys.stdout.buffer.write(manualredirect) # py3 stdout is str: use io layer except: print('Page error', repr(sys.exc_info()[1]), file=sys.stderr) #-------------------------------------------------------------------- # Save search term (pre-encode) to flat file on server (Jan-2016). # # Open with exclusive lock for possibly concurrent web access. # Encode str to bytes for Py 3.X, no-op on Py 2.X unless non-ASCII. # Use permission 0200 for updates but inaccessibility on the web. # Sep-2020: don't encode line in 2.X - encoding fails for non-ASCII. # # NOTE: the O_EXCL flag does _not_ suffice to lock the terms file # for exclusive access on the web server. For tools that may, see # Python's os.lockf(), fctl.flock(), and fctl.lockf(). This script # assumes that writes are atomic (or won't overlap in practice). #-------------------------------------------------------------------- try: filename = SAVETERMSFILE message = '[%s] [%s]' % (time.asctime(), unquote_plus(term)) # jan25 if not os.path.exists(filename): open(filename, 'w').close() # make file first time fd = os.open(filename, os.O_EXCL | os.O_APPEND | os.O_WRONLY) # binary on Unix line = message + os.linesep if UsingPython3: line = line.encode('utf8') # Sep-2020: it's already encoded bytes in 2.X os.write(fd, line) os.close(fd) except: print('Log error', repr(sys.exc_info()[1]), file=sys.stderr) # to error_log pass # neither server nor client care if this fails # And the client reroutes to the search provider (which hopefully plays nice)