File: cksums.py

#!/usr/bin/env python3
"""
=======================================================================
Demo Python's hashlib, by comparing the relative speeds of modtime
and MD5-hash (a.k.a. checksum) alternatives for file-change detection
in content-sync programs like Mergeall (learning-python.com/mergeall).

To compare, this walks a folder tree and emulates both schemes' code
for each file along the way.  Pass one command-line argument - the 
path of the root of the folder tree to walk.

Attrib:  © M. Lutz (learning-python.com), June 2022
License: Provided freely, but with no warranties of any kind.  

--------
FINDINGS
--------

Both modtimes and hashes support change detection without full 
content compares.  However, while modtimes have interoperability
issues, hashes are wholly unusable for Mergeall, for two reasons:

Accuracy
   Hashes are probabilistic.  Though very rare, they may yield 
   duplicates (collisions) that cause changes to go unnoticed.

Speed
   Hashes are prohibitively expensive.  They require full reads that
   make them far too slow to be used for larger content collections.

The second of these is a full showstopper for Mergeall.  Per the 
200G content test's output at the end of this file, hashes can be 
~100x slower than modtimes--requiring some 10 minutes (600 seconds) 
for a tree scan, versus modtimes' 6 seconds.  This is impractical 
for on-demand syncs of non-trivial content, of the sort Mergeall 
is designed to handle.

In their defense, hashes may suffice for smaller content and single
files transferred over a network, and might be acceptable for syncs
run overnight.  Moreover, modtimes have some well-known portability 
caveats, including FAT32 skew, and subpar support in some contexts.
Even so, modtimes' 6-second showing still easily beats waiting a 
full 10 minutes for a hashes-based differences scan, especially on 
phones where battery power is a crucial resource.

Takeaways: Mergeall is fast because it uses modtimes instead of
checksums.  Platforms must support modtimes to enable fast syncs.

---------
MORE INFO
---------

For more on Mergeall's use case, as well as typical modtime issues 
in FAT32, some drives, older Androids, and Linux, try these:

https://learning-python.com/
    mergeall-products/unzipped/UserGuide.html#what
    mergeall-products/unzipped/UserGuide.html#dst
    mergeall-android-scripts/_README.html#toc22
    mergeall-android-scripts/_README.html#toc21
    post-release-updates.html#mergealllinuxmodtimes

For more on hashlib and checksums in general:
    https://docs.python.org/3/library/hashlib.html
    https://en.wikipedia.org/wiki/Cryptographic_hash_function
    (and search the web for "md5", "checksum," and the like)
=======================================================================
"""

import time, hashlib, sys, os



def walker(op, root):
    """
    -----------------------------------------------
    Walk folder tree at root, running function op
    on each file in the tree along the way.  op 
    takes one arg - a file's pathname.  Returns
    total walk time, last file's op result, #files.
    This skips symlinks; real code would read them.
    -----------------------------------------------
    """
    visit = 0
    start = time.perf_counter()
    for (dir0, subs0, files0) in os.walk(root):
        for file in files0:
            path = os.path.join(dir0, file)
            if not os.path.islink(path):
                visit += 1
                last = op(path)
    elap = time.perf_counter() - start
    return elap, last, visit



def mtime(path):
    """
    -----------------------------------------------
    Emulate modtime-based diff check - fetch and
    compare FROM and TO files' modtime floats. 
    This requires just two quick os.stat calls.
    Much simpler and ~100x faster than cksum(),
    though OS caching might help modestly here.

    This is roughly what Mergeall does (e.g., it 
    uses os.lstat() to avoid os.path.*() calls), 
    but it also uses a 2-second granularity in 
    modtime checks to support FAT32 dives; applies 
    Unicode normalization to filenames; and checks
    file sizes from the already-fetched stat result
    only when modtimes are the same.
    -----------------------------------------------
    """
    # check modtimes of both sides
    s1 = os.stat(path)
    s2 = os.stat(path)
    test = (s1.st_mtime == s2.st_mtime)    # modtimes
    test = (s1.st_size  == s2.st_size)     # filesizes
    return s2.st_mtime



def cksum(path, hasher=hashlib.md5, csize=4096, cache=True):
    """
    -----------------------------------------------
    Emulate cksum-based diff check - test a FROM
    file's saved prior hash against its new hash
    (alternative: compare new hash to TO's prior).
    This requires reading one of the two trees in 
    full, and is essentially half a diffall scan.

    If cache, emulates prior-hash fetch by reading 
    a file; real code may index a shelve/pickle.
    Set hasher to try alts; e.g., hashlib.sha256
    may be less prone to hash collisions, but was
    also ~25% slower in testing.  Use csize/cache 
    to vary IO ops; the speed diff seems trivial.
    -----------------------------------------------
    """
    hash = hasher()
    temp = __file__ + '.hash'

    # get prior hash of this file
    if cache and os.path.exists(temp):
        prior = open(temp, 'rb').read()
    else:
        prior = b'0' * hash.digest_size

    # calc this file's new hash
    file = open(path, 'rb')
    while True:
        chunk = file.read(csize)
        if not chunk: break
        hash.update(chunk)
    code = hash.digest()
    test = (code == prior)

    # save new hash of this file
    if cache:
        save = open(temp, 'wb')
        save.write(code)
        save.close()
    return code
    


#----------------------------------------------
# Main - apply two alts to same folder tree.
# Print bytes hashes nicely, by 4-byte groups.
#----------------------------------------------

if __name__ == '__main__':
    assert len(sys.argv) == 2 and os.path.isdir(sys.argv[1])
    stuff = sys.argv[1]
    print('walking:', stuff)

    elaps = []
    for test in (mtime, cksum):
        elap, last, num = walker(test, stuff)
        print('\n', test.__name__, '=>')
        print('\telapsed time =', '%.4f' % elap)
        print('\tnumber files =', num)
        print('\tlast result  =', 
                 last.hex('_', 4) if isinstance(last, bytes) else last)
        elaps.append(elap)

    print('\ncksum:modtime ratio = %.2f' % (elaps[1] / elaps[0]))



#----------------------------------------------
# Output - macos, 200G 174k-file ~/MY-STUFF.
# Note: cache=False has no noticeable effect.
#----------------------------------------------

"""
$ cd ~/MY-STUFF/Code/etc
$ python3 cksums.py ~/MY-STUFF
walking: /Users/me/MY-STUFF

 mtime =>
	elapsed time = 5.4100
	number files = 174639
	last result  = 1641914113.1329055

 cksum =>
	elapsed time = 586.3275
	number files = 174639
	last result  = 07120ede_83dbad58_46b0f6cf_1ca45325

cksum:modtime ratio = 108.38

$ python3 cksums.py ~/MY-STUFF
walking: /Users/me/MY-STUFF

 mtime =>
	elapsed time = 7.2121
	number files = 174639
	last result  = 1641914113.1329055

 cksum =>
	elapsed time = 587.9185
	number files = 174639
	last result  = 07120ede_83dbad58_46b0f6cf_1ca45325

cksum:modtime ratio = 81.52
"""