File: mergeall-products/unzipped/scandir_defunct.py
""" ================================================================================ scandir_defunct.py: Python 3.5+ os.scandir() comparison-phase optimization, no longer used [3.0]. The comparison-phase variant code here, added in 2.2, formerly was some 2X faster (or better) on Windows and Linux, but also 2X slower on Mac OS X. This meant that both variants were maintained and tested redundantly -- a major task. Near the end of the 3.0 project, the non-scandir() variant was optimized to use os.lstat() stat objects instead of os.path.*() calls, initially for new symlink support. This reduced the non-scandir() speed hit on Windows to 50%: runtimes for a 60k-file archive were 10 seconds with scandir(), and 15 seconds without. However, a final step of this recoding to avoid all extra stat calls made the non-scandir() variant as fast as -- or slightly faster than -- this scandir() variant on Windows too (at ~10 seconds). This made the code here obsolete. See ./docetc/miscnotes/3.0-mergeall-comparison-times-Windows.txt for timings. On Mac OS X, this recoding to eliminate extra stats calls only improved the non-scandir() performance. For the same 60k-file archive, runtime fell from 9 seconds for the code here; to 4.5 seconds for the initial non-scandir() variant; to just 3.8 seconds for the final non-scandir() variant. See ./docetc/miscnotes/3.0-mergeall-comparison-times.txt for timing results. These times are slightly better when mergeall is run directly outside the GUI launcher (for reasons TBD: the launcher's UTF8 stdout encoding or -u?): # scandir() version on Mac $ py3 ../mergeall.py /MY-STUFF /MY-STUFF -skipcruft -report | grep 'Phase' Phase runtime: 8.64699007209856 # non-scandir() version on Mac $ py3 ../mergeall.py /MY-STUFF /MY-STUFF -skipcruft -report | grep 'Phase' Phase runtime: 3.5263971380190924 Hence, this file's code will no longer be used, maintained, or tested, but is retained here indefinitely for reference and historical context only, and to recreate the relative speed results that led to its demise. The final lesson here seems to be that os.scandir() can speed programs on Windows and Linux (though not Mac OS X) which use many os.path.*() calls. But it is no faster, and may even be slower, than code which uses explicit os.stat()/lstat() calls and objects. Given that both the scandir() and stat schemes require similar changes or special coding, os.scandir() seems no better, and worse on Mac. Using stat objects yields code that is arguably more cluttered visually than os.path.*(), but also a huge performance win. ================================================================================ """ # # this code is exec()d instead of imported to avoid recursion # # both should be False, except for testing USETHISCODE = False # <= change to 1/True to enable comparison variant here FORCEONMAC = False # <= change to 1/True to use on Mac, despite slowness ################################################################################ # COMPARISON PHASE: analyze trees # Python 3.5 and later version => use faster os.scandir() objects [2.2] ################################################################################ #------------------------------------------------------------------------------- # Use optimized variant if scandir() available _and_ faster for comparisons: # Windows: # In mergeall, scandir is faster - 5X to 10X before 3.0 optimizations, # and 2X faster after: 7 sec vs 14 secs for a large 87G 58k-file archive. # Linux: # The scandir() variant is 2X faster on Linux post 3.0 optimizations: # 30 secs vs 60 secs for the same large archive (on a slow test machine). # Mac OS X: # [3.0] BUT the scandir() variant is 3X SLOWER on Mac OS X! - it takes # 6 secs vs 2 secs for the same 87G 58K-file archive (on a fast machine), # with the 3.0 coding optimizations described above. DON'T USE ON MACS. # # UPDATE: these results were seen before the os.lstat() recoding above, # after which the os.scandir() verison is still 2X *slower* on Macs -- 9~10 # secs vs 4~5 for an archive that grew to 65K files since the prior timings. # Using python.org's Python 3.5 pm Mac OS X 10.11, only run-time differs: # # /Admin-Mergeall/kingston-savagex256g/feb-2-17$ diff \ # noopt1--mergeall-date170202-time091326.txt \ # opt2--mergeall-date170202-time092217.txt # 0a1 # > Using Python 3.5+ os.scandir() optimized variant. # 4053c4054 # < Phase runtime: 5.286043012980372 # --- # > Phase runtime: 10.12333482701797 # # With 3.0 code, the scandir() varient is still 50% to 2X *faster* on Windows. # # [3.0] FINAL UPDATE: this scandir() variant is no longer faster on Windows, # making this code moot; see this file's main dosctring for the full verdict. # # Don't use float(sys.version[:3]) >= 3.5; scandir() can be a separate install. #------------------------------------------------------------------------------- if not hasattr(os, 'scandir'): # older Pythons (2.7+): no os.scandir try: import scandir # check for PyPI package install os.scandir = scandir.scandir # store in os, and assume same API except: pass # punt: use Py 3.4- (now all) verions # [3.0] the non-scandir() version using os.lstat() is now faster everywhere if (USETHISCODE and hasattr(os, 'scandir') # skip if not available and (not sys.platform.startswith('darwin') # [3.0] not on Mac: slower! or FORCEONMAC) ): """ ------------------------------------------------------------------------ [2.2] Custom comparison code using os.scandir() instead of os.listdir(). SEE NON-SCANDIR() VERSIONS in mergeall.py for docs not repeated here. os.scandir() provides an object API (versus name lists), that can often eliminate system calls for attributes such as type, modtime, and size. This can speed mergeall's comparison phase by a factor of 5 to 10 on Windows. This speedup is entirely due to os.scandir(), not 3.5 in general, and can shave dozens of seconds off total runtimes for larger archives (and more on slower devices). POSIX systems also run quicker, according to this Python 3.5 change's PEP, benchmark, and documentation. [UPDATE: Not so! => os.scandir() is faster on Linux too, but SLOWER on Macs - 2X~3X slower in mergeall; see above.] This speeds up the comparison phase only, but this phase always runs and can dominate runtimes in typical runs on large archives with light changes. The resolution phase was not changed, because it normally visits just a handful of file paths (changed items only), and is bound by basic file write/delete speed; system calls are minor component of its time. Note: diffall.py might benefit from os.scandir() too for path names, but it's less likely, as that script doesn't fetch modtimes or sizes, and is bound by the time required to fully read all common files. Note: scandir() is also available in a scandir module for older Pythons, including 2.7, via a PyPI package. If installed, the code here imports and uses it, and so should work with older Pythons too -- not just 3.5+. CODING TBD: factor-out differing parts to avoid some redundant code? As is, there are two versions to update and test for comparison changes. In fact there were -- for 3.0 optimizations, cruft-file skipping, and symlinks support. But: They seem too different in structure to factor well, and factoring may slow either or both versions, negating this extension's goal. And: We need both, because the optimized variant is FASTER on Windows and Linux, but SLOWER on Mac. This speed matters: a single version would penalize one or two platforms for the comparison phase, which always runs regardless of the number of changes in the trees. Tentative lesson: redundancy is sometimes warranted in the name of optimization. [3.0] FINAL STORY: in the end, the non-scandir() grew slightly quicker on Windows too, by using stat objects to avoid all os.path.*() calls. This made the scandir() 3.5+ optimized variant fully obsolete, and removed the coding and testing redundancy. Consequently, this scandir() variant has been moved out of file and kept here for reference only, and will no longer be updated or tested in the future. Good riddance! ------------------------------------------------------------------------ """ if __name__ == '__main__': # else prints twice trace(0, 'Using Python 3.5+ os.scandir() optimized variant.') # this decrufter flavor is used only here now from skipcruft import filterCruftDirentrys def comparedirs(direntsfrom, direntsto, dirfrom, dirto, uniques): """ [2.2] python 3.5+ custom version using faster os.scandir(). Doesn't use set difference here, to maintain listing order. Note: dirfrom and dirto pathnames still used by resolver phase. See mergeall.py's non-scandir() version for docs on this function. """ trace(2, 'dirs:', dirfrom, dirto) countcompare.folders += 1 uniquefrom = [df.name for df in direntsfrom if df.name not in (dt.name for dt in direntsto)] uniqueto = [dt.name for dt in direntsto if dt.name not in (df.name for df in direntsfrom)] if uniquefrom: uniques['from'].append((uniquefrom, dirfrom, dirto)) if uniqueto: uniques['to'].append((uniqueto, dirfrom, dirto)) def comparelinks(direntfrom, direntto, dirfrom, dirto, diffs): """ [3.0] python 3.5+ custom version using faster(?) os.scandir(). See mergeall.py's non-scandir() version for docs on this function. """ name = direntfrom.name # compare link path strs linkpathfrom = os.readlink(FWP(direntfrom.path)) linkpathto = os.readlink(FWP(direntto.path)) if linkpathfrom != linkpathto: diffs.append((name, dirfrom, dirto, 'linkpaths')) def comparefiles(direntfrom, direntto, dirfrom, dirto, diffs, dopeek=False, peekmax=10): """ [2.2] python 3.5+ custom version using faster(?) os.scandir(). Uses stat() objects for sizes+times, perhaps avoiding system calls. Note: dirfrom and dirto pathnames still used by resolver phase. See mergeall.py's non-scandir() version for docs on this function. """ trace(2, 'files:', direntfrom.path, direntto.path) def modtimematch(statfrom, statto, allowance=2): # [1.3] 2 seconds for FAT32 time1 = int(statfrom.st_mtime) # [2.2] os.stat_result object time2 = int(statto.st_mtime) return time2 >= (time1 - allowance) and time2 <= (time1 + allowance) countcompare.files += 1 startdiffs = len(diffs) name = direntfrom.name # same name in from and to statfrom = direntfrom.stat() # call stat() just once statto = direntto.stat() if not modtimematch(statfrom, statto): # try modtime 1st diffs.append((name, dirfrom, dirto, 'modtime')) # the easiest diff else: sizefrom = statfrom.st_size # try size next sizeto = statto.st_size # unlikely case if sizefrom != sizeto: diffs.append((name, dirfrom, dirto, 'filesize')) elif dopeek: # try start+stop bytes peeksize = min(peekmax, sizefrom // 2) # scale peek to size/2 filefrom = open(FWP(direntfrom.path), 'rb') # sizefrom == sizeto fileto = open(FWP(direntto.path), 'rb') if filefrom.read(peeksize) != fileto.read(peeksize): diffs.append((name, dirfrom, dirto, 'startbytes')) else: filefrom.seek(sizefrom - peeksize) fileto.seek(sizeto - peeksize) if filefrom.read(peeksize) != fileto.read(peeksize): diffs.append((name, dirfrom, dirto, 'stopbytes')) filefrom.close() fileto.close() return len(diffs) == startdiffs # true if did not differ, else extends 'diffs' def comparetrees(dirfrom, dirto, diffs, uniques, mixes, dopeek, skipcruft, skip=None): """ [2.2] python 3.5+ custom version using faster(?) os.scandir(). Doesn't use set intersection here, to maintain listing order. See mergeall.py's non-scandir() version for docs on this function. """ trace(2, '-' * 20) trace(1, 'comparing [%s] [%s]' % (dirfrom, dirto)) def excludeskips(direntsfrom, direntsto, skip): # [3.0] pre-skipcruft filter if not skip: return for dirents in (direntsfrom, direntsto): matches = [dirent for dirent in dirents if dirent.name == skip] if matches: assert len(matches) == 1 trace(1, 'excluding', matches[0].path) dirents.remove(matches[0]) # get dir content lists here # [2.2] os.DirEntry objects iterator direntsfrom = list(os.scandir(FWP(dirfrom))) # [2.2] need list() for multiple scans! direntsto = list(os.scandir(FWP(dirto))) # or pass bytes?--would impact much excludeskips(direntsfrom, direntsto, skip) # [3.0] filter out system metadata files and folders if skipcruft: direntsfrom = filterCruftDirentrys(direntsfrom) direntsto = filterCruftDirentrys(direntsto) # compare dir file name lists to get uniques comparedirs(direntsfrom, direntsto, dirfrom, dirto, uniques) # analyse names in common (same name and case) trace(2, 'comparing common names') common = [(df, dt) for df in direntsfrom for dt in direntsto if df.name == dt.name] # scan common list just once [3.0] for (direntfrom, direntto) in common: nonlink = dict(follow_symlinks=False) # [3.0] narrow is() results # 0) compare linkpaths of links in common [3.0] if direntfrom.is_symlink() and direntto.is_symlink(): comparelinks(direntfrom, direntto, dirfrom, dirto, diffs) # 1) compare times/sizes/contents of (non-link) files in common elif direntfrom.is_file(**nonlink) and direntto.is_file(**nonlink): comparefiles(direntfrom, direntto, dirfrom, dirto, diffs, dopeek) # 2) compare (non-link) subdirectories in common via recursion elif direntfrom.is_dir(**nonlink) and direntto.is_dir(**nonlink): comparetrees(direntfrom.path, direntto.path, diffs, uniques, mixes, dopeek, skipcruft) # 3) same name but not both links, files, or dirs (mixed, fifos) else: mixes.append((direntfrom.name, dirfrom, dirto))