File: mergeall-products/unzipped/scandir_defunct.py
"""
================================================================================
scandir_defunct.py:
Python 3.5+ os.scandir() comparison-phase optimization, no longer used [3.0].
The comparison-phase variant code here, added in 2.2, formerly was some 2X
faster (or better) on Windows and Linux, but also 2X slower on Mac OS X. This
meant that both variants were maintained and tested redundantly -- a major task.
Near the end of the 3.0 project, the non-scandir() variant was optimized to use
os.lstat() stat objects instead of os.path.*() calls, initially for new symlink
support. This reduced the non-scandir() speed hit on Windows to 50%: runtimes
for a 60k-file archive were 10 seconds with scandir(), and 15 seconds without.
However, a final step of this recoding to avoid all extra stat calls made the
non-scandir() variant as fast as -- or slightly faster than -- this scandir()
variant on Windows too (at ~10 seconds). This made the code here obsolete.
See ./docetc/miscnotes/3.0-mergeall-comparison-times-Windows.txt for timings.
On Mac OS X, this recoding to eliminate extra stats calls only improved the
non-scandir() performance. For the same 60k-file archive, runtime fell from
9 seconds for the code here; to 4.5 seconds for the initial non-scandir()
variant; to just 3.8 seconds for the final non-scandir() variant.
See ./docetc/miscnotes/3.0-mergeall-comparison-times.txt for timing results.
These times are slightly better when mergeall is run directly outside the
GUI launcher (for reasons TBD: the launcher's UTF8 stdout encoding or -u?):
# scandir() version on Mac
$ py3 ../mergeall.py /MY-STUFF /MY-STUFF -skipcruft -report | grep 'Phase'
Phase runtime: 8.64699007209856
# non-scandir() version on Mac
$ py3 ../mergeall.py /MY-STUFF /MY-STUFF -skipcruft -report | grep 'Phase'
Phase runtime: 3.5263971380190924
Hence, this file's code will no longer be used, maintained, or tested, but
is retained here indefinitely for reference and historical context only,
and to recreate the relative speed results that led to its demise.
The final lesson here seems to be that os.scandir() can speed programs on
Windows and Linux (though not Mac OS X) which use many os.path.*() calls.
But it is no faster, and may even be slower, than code which uses explicit
os.stat()/lstat() calls and objects. Given that both the scandir() and stat
schemes require similar changes or special coding, os.scandir() seems no
better, and worse on Mac. Using stat objects yields code that is arguably
more cluttered visually than os.path.*(), but also a huge performance win.
================================================================================
"""
#
# this code is exec()d instead of imported to avoid recursion
#
# both should be False, except for testing
USETHISCODE = False # <= change to 1/True to enable comparison variant here
FORCEONMAC = False # <= change to 1/True to use on Mac, despite slowness
################################################################################
# COMPARISON PHASE: analyze trees
# Python 3.5 and later version => use faster os.scandir() objects [2.2]
################################################################################
#-------------------------------------------------------------------------------
# Use optimized variant if scandir() available _and_ faster for comparisons:
# Windows:
# In mergeall, scandir is faster - 5X to 10X before 3.0 optimizations,
# and 2X faster after: 7 sec vs 14 secs for a large 87G 58k-file archive.
# Linux:
# The scandir() variant is 2X faster on Linux post 3.0 optimizations:
# 30 secs vs 60 secs for the same large archive (on a slow test machine).
# Mac OS X:
# [3.0] BUT the scandir() variant is 3X SLOWER on Mac OS X! - it takes
# 6 secs vs 2 secs for the same 87G 58K-file archive (on a fast machine),
# with the 3.0 coding optimizations described above. DON'T USE ON MACS.
#
# UPDATE: these results were seen before the os.lstat() recoding above,
# after which the os.scandir() verison is still 2X *slower* on Macs -- 9~10
# secs vs 4~5 for an archive that grew to 65K files since the prior timings.
# Using python.org's Python 3.5 pm Mac OS X 10.11, only run-time differs:
#
# /Admin-Mergeall/kingston-savagex256g/feb-2-17$ diff \
# noopt1--mergeall-date170202-time091326.txt \
# opt2--mergeall-date170202-time092217.txt
# 0a1
# > Using Python 3.5+ os.scandir() optimized variant.
# 4053c4054
# < Phase runtime: 5.286043012980372
# ---
# > Phase runtime: 10.12333482701797
#
# With 3.0 code, the scandir() varient is still 50% to 2X *faster* on Windows.
#
# [3.0] FINAL UPDATE: this scandir() variant is no longer faster on Windows,
# making this code moot; see this file's main dosctring for the full verdict.
#
# Don't use float(sys.version[:3]) >= 3.5; scandir() can be a separate install.
#-------------------------------------------------------------------------------
if not hasattr(os, 'scandir'): # older Pythons (2.7+): no os.scandir
try:
import scandir # check for PyPI package install
os.scandir = scandir.scandir # store in os, and assume same API
except:
pass # punt: use Py 3.4- (now all) verions
# [3.0] the non-scandir() version using os.lstat() is now faster everywhere
if (USETHISCODE
and hasattr(os, 'scandir') # skip if not available
and (not sys.platform.startswith('darwin') # [3.0] not on Mac: slower!
or FORCEONMAC)
):
"""
------------------------------------------------------------------------
[2.2] Custom comparison code using os.scandir() instead of os.listdir().
SEE NON-SCANDIR() VERSIONS in mergeall.py for docs not repeated here.
os.scandir() provides an object API (versus name lists), that can often
eliminate system calls for attributes such as type, modtime, and size.
This can speed mergeall's comparison phase by a factor of 5 to 10 on
Windows. This speedup is entirely due to os.scandir(), not 3.5 in
general, and can shave dozens of seconds off total runtimes for larger
archives (and more on slower devices).
POSIX systems also run quicker, according to this Python 3.5 change's PEP,
benchmark, and documentation. [UPDATE: Not so! => os.scandir() is faster
on Linux too, but SLOWER on Macs - 2X~3X slower in mergeall; see above.]
This speeds up the comparison phase only, but this phase always runs and
can dominate runtimes in typical runs on large archives with light changes.
The resolution phase was not changed, because it normally visits just a
handful of file paths (changed items only), and is bound by basic file
write/delete speed; system calls are minor component of its time.
Note: diffall.py might benefit from os.scandir() too for path names,
but it's less likely, as that script doesn't fetch modtimes or sizes,
and is bound by the time required to fully read all common files.
Note: scandir() is also available in a scandir module for older Pythons,
including 2.7, via a PyPI package. If installed, the code here imports
and uses it, and so should work with older Pythons too -- not just 3.5+.
CODING TBD: factor-out differing parts to avoid some redundant code?
As is, there are two versions to update and test for comparison changes.
In fact there were -- for 3.0 optimizations, cruft-file skipping, and
symlinks support.
But:
They seem too different in structure to factor well, and factoring
may slow either or both versions, negating this extension's goal.
And:
We need both, because the optimized variant is FASTER on Windows and
Linux, but SLOWER on Mac. This speed matters: a single version would
penalize one or two platforms for the comparison phase, which always
runs regardless of the number of changes in the trees. Tentative
lesson: redundancy is sometimes warranted in the name of optimization.
[3.0] FINAL STORY: in the end, the non-scandir() grew slightly quicker
on Windows too, by using stat objects to avoid all os.path.*() calls.
This made the scandir() 3.5+ optimized variant fully obsolete, and
removed the coding and testing redundancy. Consequently, this scandir()
variant has been moved out of file and kept here for reference only,
and will no longer be updated or tested in the future. Good riddance!
------------------------------------------------------------------------
"""
if __name__ == '__main__': # else prints twice
trace(0, 'Using Python 3.5+ os.scandir() optimized variant.')
# this decrufter flavor is used only here now
from skipcruft import filterCruftDirentrys
def comparedirs(direntsfrom, direntsto, dirfrom, dirto, uniques):
"""
[2.2] python 3.5+ custom version using faster os.scandir().
Doesn't use set difference here, to maintain listing order.
Note: dirfrom and dirto pathnames still used by resolver phase.
See mergeall.py's non-scandir() version for docs on this function.
"""
trace(2, 'dirs:', dirfrom, dirto)
countcompare.folders += 1
uniquefrom = [df.name for df in direntsfrom
if df.name not in (dt.name for dt in direntsto)]
uniqueto = [dt.name for dt in direntsto
if dt.name not in (df.name for df in direntsfrom)]
if uniquefrom:
uniques['from'].append((uniquefrom, dirfrom, dirto))
if uniqueto:
uniques['to'].append((uniqueto, dirfrom, dirto))
def comparelinks(direntfrom, direntto, dirfrom, dirto, diffs):
"""
[3.0] python 3.5+ custom version using faster(?) os.scandir().
See mergeall.py's non-scandir() version for docs on this function.
"""
name = direntfrom.name
# compare link path strs
linkpathfrom = os.readlink(FWP(direntfrom.path))
linkpathto = os.readlink(FWP(direntto.path))
if linkpathfrom != linkpathto:
diffs.append((name, dirfrom, dirto, 'linkpaths'))
def comparefiles(direntfrom, direntto, dirfrom, dirto, diffs, dopeek=False, peekmax=10):
"""
[2.2] python 3.5+ custom version using faster(?) os.scandir().
Uses stat() objects for sizes+times, perhaps avoiding system calls.
Note: dirfrom and dirto pathnames still used by resolver phase.
See mergeall.py's non-scandir() version for docs on this function.
"""
trace(2, 'files:', direntfrom.path, direntto.path)
def modtimematch(statfrom, statto, allowance=2): # [1.3] 2 seconds for FAT32
time1 = int(statfrom.st_mtime) # [2.2] os.stat_result object
time2 = int(statto.st_mtime)
return time2 >= (time1 - allowance) and time2 <= (time1 + allowance)
countcompare.files += 1
startdiffs = len(diffs)
name = direntfrom.name # same name in from and to
statfrom = direntfrom.stat() # call stat() just once
statto = direntto.stat()
if not modtimematch(statfrom, statto): # try modtime 1st
diffs.append((name, dirfrom, dirto, 'modtime')) # the easiest diff
else:
sizefrom = statfrom.st_size # try size next
sizeto = statto.st_size # unlikely case
if sizefrom != sizeto:
diffs.append((name, dirfrom, dirto, 'filesize'))
elif dopeek: # try start+stop bytes
peeksize = min(peekmax, sizefrom // 2) # scale peek to size/2
filefrom = open(FWP(direntfrom.path), 'rb') # sizefrom == sizeto
fileto = open(FWP(direntto.path), 'rb')
if filefrom.read(peeksize) != fileto.read(peeksize):
diffs.append((name, dirfrom, dirto, 'startbytes'))
else:
filefrom.seek(sizefrom - peeksize)
fileto.seek(sizeto - peeksize)
if filefrom.read(peeksize) != fileto.read(peeksize):
diffs.append((name, dirfrom, dirto, 'stopbytes'))
filefrom.close()
fileto.close()
return len(diffs) == startdiffs # true if did not differ, else extends 'diffs'
def comparetrees(dirfrom, dirto, diffs, uniques, mixes, dopeek, skipcruft, skip=None):
"""
[2.2] python 3.5+ custom version using faster(?) os.scandir().
Doesn't use set intersection here, to maintain listing order.
See mergeall.py's non-scandir() version for docs on this function.
"""
trace(2, '-' * 20)
trace(1, 'comparing [%s] [%s]' % (dirfrom, dirto))
def excludeskips(direntsfrom, direntsto, skip): # [3.0] pre-skipcruft filter
if not skip:
return
for dirents in (direntsfrom, direntsto):
matches = [dirent for dirent in dirents if dirent.name == skip]
if matches:
assert len(matches) == 1
trace(1, 'excluding', matches[0].path)
dirents.remove(matches[0])
# get dir content lists here # [2.2] os.DirEntry objects iterator
direntsfrom = list(os.scandir(FWP(dirfrom))) # [2.2] need list() for multiple scans!
direntsto = list(os.scandir(FWP(dirto))) # or pass bytes?--would impact much
excludeskips(direntsfrom, direntsto, skip)
# [3.0] filter out system metadata files and folders
if skipcruft:
direntsfrom = filterCruftDirentrys(direntsfrom)
direntsto = filterCruftDirentrys(direntsto)
# compare dir file name lists to get uniques
comparedirs(direntsfrom, direntsto, dirfrom, dirto, uniques)
# analyse names in common (same name and case)
trace(2, 'comparing common names')
common = [(df, dt) for df in direntsfrom
for dt in direntsto if df.name == dt.name]
# scan common list just once [3.0]
for (direntfrom, direntto) in common:
nonlink = dict(follow_symlinks=False) # [3.0] narrow is() results
# 0) compare linkpaths of links in common [3.0]
if direntfrom.is_symlink() and direntto.is_symlink():
comparelinks(direntfrom, direntto, dirfrom, dirto, diffs)
# 1) compare times/sizes/contents of (non-link) files in common
elif direntfrom.is_file(**nonlink) and direntto.is_file(**nonlink):
comparefiles(direntfrom, direntto, dirfrom, dirto, diffs, dopeek)
# 2) compare (non-link) subdirectories in common via recursion
elif direntfrom.is_dir(**nonlink) and direntto.is_dir(**nonlink):
comparetrees(direntfrom.path, direntto.path,
diffs, uniques, mixes, dopeek, skipcruft)
# 3) same name but not both links, files, or dirs (mixed, fifos)
else:
mixes.append((direntfrom.name, dirfrom, dirto))