#!/usr/bin/env python3 # -*- coding: utf-8 -*- # Python 3.X and 2.X are both supported # Python 3.X is strongly recommended for trees with non-ASCII Unicode filenames u""" ================================================================================ Usage: [py[thon]] diffall.py dir1 dir2 [-recent [days=90]] [-skipcruft] [-quiet] [-u] See UserGuide.html for version, license, platforms, and attribution. Recursive bytewise directory-tree comparison: report unique files that exist in only dir1 or dir2, report files (and symlinks) of the same name in dir1 and dir2 with differing contents, report instances of same name but different type in dir1 and dir2, and do the same for all subdirectories of the same names in and below dir1 and dir2. The net effect performs both a structural comparison of the dir1 and dir2 trees, and a byte-for-byte comparison of all their common files. "-recent" limits comparisons to files changed in the last N days; "-skipcruft" ignores system hidden and metadata files; "-quiet" omits Unicode-normalization messages; and "-u" unbuffers output for frozen apps/exes. See the change log below for more on command-line arguments supported. A summary of diffs appears at end of this script's output. Redirect its output to a file using ">" and search for "*DIFFER", "*UNIQUE", and "*MISSED" strings for further difference details following a run (per Sep-2016's update below). For instance, in a Unix console: $ M=Mergeall-source-package-install-path $ python3 $M/diffall.py folder1 folder2 -skipcruft > ~/diffall.txt & $ tail -f ~/diffall.txt $ grep --context=8 '*UNIQUE' ~/diffall.txt In sum, diffall compares full, byte-by-byte file content to verify that all files on two folder trees are truly the same. It does not compare file modification times, as these are not relevant to content equivalence. See mergeall.py for a quicker but shallower alternative that checks modification times and structure, but not full content, to detect file changes that merit synchronization. Tip: see docetc/Tools/post-diffall-auto-compare.py for a simple tool that helps you inspect the differences in files that diffall.py flags as DIFFERS. -------------------------------------------------------------------------------- CHANGE LOG New PP3E: limit reads to 1M for large files, catch same name=file/dir mixed type cases. New PP4E: avoid extra os.listdir() calls in dirdiff.comparedirs() by passing results here along. New March-2015, for mergeall 2.0: add "-recent [days]" limited comparisons option to compare just files changed in last days (else compares all files), plus simple stats at end of report. Also for 2.0, added explicit file.close calls, for use outside CPython; we don't care about catching exceptions here, as any kill the script, and we're just reading in any event. New Jan-2016: change incorrect "dirdiff" in usage message to "diffall". Also print total diffall runtime for speed analysis and drive comparisons. Caveat: may run quicker with os.scandir() instead of os.listdir() in Python 3.5+ (only!), but runtime is likely dominated by the exhaustive file reads here, not listings; see mergeall.py for os.scandir() alternative in action. ---- New Sep-2016..Mar-2017 [3.0]: 0) Compare, but do not follow symbolic links (symlinks). Otherwise, may compare arbitarily-large items referenced by intra-archive links > once. Coded as a pretest to avoid changing existing code, treats links to both files and dirs the same, and compares their reference-path strings. 1) Changed difference labels slightly, so users can search the report for uppercase '*UNIQUE' and '*DIFFER' (and the rare '*MISSED') to jump to differences quickly. 2) Use mergeall's extended OPEN() to support long file pathnames on Windows. UPDATE: the fixlongpaths module is now used in multiple places to support long windows paths here, not just for open(); see FWP(). 3) The new '-skipcruft' mode, also added to mergeall, skips system cruft files in both folders so they do not register as differences (and clutter the report to the point of near unusability on some platforms that rhyme with "Mac"!). See mergeall_configs.py for more on cruft metadata; the implementation of cruft skipping is shared with mergeall and cpall. 4) Recoded the diffall algorithm so that rare mixed-type differences are detected _before_ recurring into any subdirs. This way, any "*MISSED" log messages appear in the subject folder's section - not arbitrarily far ahead after all subfolders' sections. As it was, these showed up in the last subfolder's section of the report, and listed their dirs only in the summary. Note that the diffall algorithm must still use multiple loops over items, in order to report file comparisons (and now mixed-type cases) _before_ starting a new report section for subdirectories' content. This structure differs from mergeall's single-loop data builder scheme, but is deliberate. 5) The algorithm was also optimized slightly, to avoid running os.path.join() on an item more than once (though the gain is likely negligible versus file IO, and the speed tradeoff for the added list operations was not determined). 6) Further optimized later to replace os.path.join(x, y) with x + os.sep + y; join() seems complex and slow overkill in simple and known path+file cases, especially on Windows (see Python's Lib\ntpath.py). OPTIMIZATION RESULTS: The prior 2 point's optimizations had NO significant effect on diffall runtimes. See file: test/expected-output-3.0/optimizations-3.0/diffall-results.txt for typical speed test results. In sum, a diffall for an 87G SSD folder with 59K files and 3.5K folders runs in roughly 4 minutes 20 seconds - in BOTH the prior and new diffall code. The optimized version here may shave very low single-digit seconds in some runs, but this is trivial in a 4 minute task. Caveat: timing tests were run on Windows; other platforms may or may not agree. Further optimizations based on different codings or the os.scandir() alternative to os.listdir() in Python 3.5+ used by mergeall are also likely to be pointless (and os.scandir() may run _slower_ on Mac OS X). As expected, the vast majority of this script's time is spent reading files in full, not in analysis of structure. As another metric, the mergeall comparison phase for this same test folder runs in just 7.2 seconds - versus 4 minutes for a byte-for-byte diffall. The latter is clearly too IO-bound to speed further in code, which is why mergeall was developed in the first place! Given these results, the cpall script was not optimized; its runtimes are even _more_ IO-bound by the need to write files (and probably reach hours on slow drives). Faster devices seem a better bet for speeding such programs. 7) Support symlinks by comparing their linkpaths, not the items they refer to. We care only if the links differ here, not that their referents are valid. ---- New Dec-2017 [3.1]: If a "-u" command-line arg is passed to this script or its frozen app/executables (not to Python), flush output lines as they are written. This makes prints unbuffered, useful when monitoring output with a Unix "tail". ---- New in [3.2]: compared symlinks are tallied separately (not lumped in with files), and tallies are reported consistently with other tools. This also displays its version in an opening message, like other Mergeall main scripts. New in [3.2]: this now skips symlink read errors; they print error messages in the log and show up in the summary report. This is required to work around read errors for symlinks burned to BDR discs by macOS; see the issue demo at test/macos-bad-bdr-symlinks-3.2.txt. mergeall.py does the same for syncs. ---- New in [3.3]: normalize Unicode filenames for comparison, but use original filesystem names when reading content. This maps together names that are equivalent but not equal in terms of decoded code-point representation. For example, the filename 'Liñux.png' has two different Unicode formats. See fixunicodedups.py's top-of-file docstring for more background. To omit messages printed for filenames not in NFC (composed) format, use "-quiet". New in [3.3]: docetc/Tools/post-diffall-auto-compare.py diffs DIFFERS files. ================================================================================ """ # CODE from __future__ import print_function # added: Py 2.X compatibility # [3.0] for frozen app/exes, fix module+resource visibility (sys.path) import fixfrozenpaths import os, time, sys, dirdiff # dirdiff now has intersect [3.3] from sys import argv # [3.0] filter out metadata files from skipcruft import filterCruftNames # [3.0] fix too-long paths on Windows from fixlongpaths import FWP # [3.1] autofush print lines if "-u" from autoflush import setFlushedOuput # [3.2] import and display version number from __version__ import VERSION # [3.3] normalize Unicode for filename matching from fixunicodedups import normalizeUnicodeFilenames # [3.3] normalize Unicode for link-path matching from fixunicodedups import normalizeUnicode blocksize = 1024 * 1024 # up to 1M per read numdir = numfile = numlink = numskip = 0 # [2.0] a few stats, [3.2] links # [jan16] python/platform-specific current time (secs) gettime = (time.perf_counter if hasattr(time, 'perf_counter') else (time.clock if sys.platform.startswith('win') else time.time)) def recentlychanged(path1, path2, numdays=90): """ --------------------------------------------------------------------------- [mergeall 2.0] Return True if either path1 or path2 was modified in last "days" days (default 90, if not passed, or not listed in the command-line). This is really days-worth-of-seconds, but close enough. In large achives, most files will not have been changed recently, so this test can speed limited comparisons. Library calls used here: --------------------------------------------------------------------------- >>> t1 = os.path.getmtime('python') >>> t2 = time.time() >>> t1, t2 (1390862766.9136598, 1426117651.752781) >>> time.ctime(t1), time.ctime(t2) ('Mon Jan 27 14:46:06 2014', 'Wed Mar 11 15:47:31 2015') --------------------------------------------------------------------------- """ modtime1 = os.path.getmtime(FWP(path1)) # in seconds since epoch modtime2 = os.path.getmtime(FWP(path2)) # float in 3.X, int in 2.X? nowtime = time.time() secsback = numdays * (24 * 60 * 60) return (modtime1 > nowtime - secsback) or (modtime2 > nowtime - secsback) def comparetrees(dir1, dir2, diffs, recent=False, numdays=0, skipcruft=False, quiet=False, verbose=False): """ --------------------------------------------------------------------------- Compare all files, symlinks, and subdirectories in two directory trees. Uses binary files to prevent Unicode decoding and endline transforms, as trees might contain arbitrary binary files as well as arbitrary text. May need bytes listdir arg for undecodable filenames on some platforms. A procedure: prints as it goes and returns diffs summary-message list. [2.0] Compare only files changed in last "numdays" days if "recent" [3.0] Use OPEN to support long file pathnames on Windows (later: FWP) [3.0] Ignore system metadata files in dir1 and dir2 if skipcruft is True [3.0] Detect and report mixed-type diffs before processing any subdirs [3.0] Optimized to avoid calling os.path.join() more than once per item [3.0] Optimized to use +os.sep+ instead of likely slower os.path.join() [3.0] Handle symlinks explicitly by comparing their link paths directly [3.0] Fix long paths on Windows (only) with FWP(), but don't mod msgs [3.2] Skip BDR symlink read errors caused by burns on macOS (see above) [3.3] Normalize filenames for comparison, but use originals to access [3.3] Add -quiet to omit normalization messages (there may be many) --------------------------------------------------------------------------- """ global numdir, numfile, numlink, numskip # [2.0] [3.2] # compare file name lists (new report section) numdir += 1 print('-' * 20) names1 = os.listdir(FWP(dir1)) names2 = os.listdir(FWP(dir2)) if skipcruft: # [3.0] ignore metadata files names1 = filterCruftNames(names1) names2 = filterCruftNames(names2) # [3.3] map Unicode variants to common form tracer = print if not quiet else lambda *args: None names1, origs1 = normalizeUnicodeFilenames(names1, dir1, tracer) names2, origs2 = normalizeUnicodeFilenames(names2, dir2, tracer) # detect and report unique items if not dirdiff.comparedirs(dir1, dir2, names1, names2): diffs.append('items UNIQUE at [%s] - [%s]' % (dir1, dir2)) # get names common to both dirs print('Comparing contents') common = dirdiff.intersect(names1, names2) #---------------------------------------------------------------------- # Compare contents of files (and links) in common. # Report before any subdirs, and try this most-common case first. #---------------------------------------------------------------------- notfiles = [] for name in common: orig1 = origs1.get(name, name) # [3.3] use unnormalized names orig2 = origs2.get(name, name) # for paths and file access path1 = dir1 + os.sep + orig1 # [3.0] avoid os.path.join path2 = dir2 + os.sep + orig2 if os.path.islink(FWP(path1)) or os.path.islink(FWP(path2)): # [3.0] handle symlinks to files and dirs specially here if os.path.islink(FWP(path1)) and os.path.islink(FWP(path2)): # both are links: read numlink += 1 # [3.2] links sept try: # [3.2] skip bad links link1 = os.readlink(FWP(path1)) # str path name link2 = os.readlink(FWP(path2)) except: diffs.append('links UNREADABLE at [%s] - [%s]' % (path1, path2)) print('*UNREADABLE:', name) else: link1 = normalizeUnicode(link1) # [3.3] may differ link2 = normalizeUnicode(link2) if link1 == link2: if verbose: print(name, 'matches') else: diffs.append('links DIFFER at [%s] - [%s]' % (path1, path2)) print('*DIFFER:', name) else: # only one link: mixed diffs.append('items MISSED at [%s] - [%s]: [%s]' % (dir1, dir2, name)) print('*MISSED:', name) elif os.path.isfile(FWP(path1)) and os.path.isfile(FWP(path2)): # file+file: skip full reads if not recently changed if recent and (not recentlychanged(path1, path2, numdays)): # [2.0] numskip += 1 # [2.0] if verbose: print(name, 'skipped') else: numfile += 1 file1 = open(FWP(path1), 'rb') # [3.0]: long paths file2 = open(FWP(path2), 'rb') while True: bytes1 = file1.read(blocksize) bytes2 = file2.read(blocksize) if (not bytes1) and (not bytes2): if verbose: print(name, 'matches') break if bytes1 != bytes2: diffs.append('files DIFFER at [%s] - [%s]' % (path1, path2)) print('*DIFFER:', name) break file1.close() file2.close() # [2.0] else: # pass others to next phase (non-link dirs, mixes, fifos) notfiles.append((name, path1, path2)) #---------------------------------------------------------------------- # Detect same name but not both files or dirs (rare). # [3.0] Report before subdirs, and use cached paths for speed #---------------------------------------------------------------------- notmixed = [] for (name, path1, path2) in notfiles: if not (os.path.isdir(FWP(path1)) and os.path.isdir(FWP(path2))): diffs.append('items MISSED at [%s] - [%s]: [%s]' % (dir1, dir2, name)) print('*MISSED:', name) else: notmixed.append((path1, path2)) #---------------------------------------------------------------------- # Recur to compare non-link directories in common (the rest). # Each subdir starts a new report section for its own content. #---------------------------------------------------------------------- for (path1, path2) in notmixed: comparetrees(path1, path2, diffs, recent, numdays, skipcruft, quiet, verbose) def getargs(): """ --------------------------------------------------------------------------- [2.0] Args for command-line mode --------------------------------------------------------------------------- """ try: extramsg = None recent, numdays = False, 90 # defaults skipcruft = unbuffered = quiet = False dir1, dir2 = sys.argv[1:3] # first 2 command-line args if not os.path.isdir(dir1): # exists and is a dir [2.0] [3.0] extramsg = 'dir1 is invalid' assert False if not os.path.isdir(dir2): # exists and is a dir [2.0] [3.0] extramsg = 'dir2 is invalid' # was: assert os.path.isdir(dir2) assert False if '-skipcruft' in sys.argv: skipcruft = True # [3.0] skip metadata files (any order) sys.argv.remove('-skipcruft') if '-u' in sys.argv: unbuffered = True # [3.1] make output unbuffered in apps sys.argv.remove('-u') if '-quiet' in sys.argv: quiet = True # [3.3] omit Unicode normalization msgs sys.argv.remove('-quiet') if len(argv) > 3: assert argv[3] == '-recent' # [2.0] last N days only recent = True if len(argv) > 4: numdays = int(argv[4]) # listed else 90 except: print('Usage: ' '[py[thon]] diffall.py dir1 dir2 ' '[-recent [days=90]] [-skipcruft] [-quiet] [-u]') if extramsg: print('Additional details:', extramsg) sys.exit(1) else: return (dir1, dir2, recent, numdays, skipcruft, unbuffered, quiet) if __name__ == '__main__': """ --------------------------------------------------------------------------- Stand-alone/command-line mode. Diffall isn't very useful otherwise, as it prints instead of returning, but its output might be parsed. See also mergeall's variation on the comparisons run here, whic builds explicit results data-structures. --------------------------------------------------------------------------- """ print('diffall %.1f starting' % VERSION) # [3.2] dir1, dir2, recent, numdays, skipcruft, unbuffered, quiet = getargs() # [3.1] force unbuffered output (for apps/exes)? if unbuffered: setFlushedOuput() # walk, compare, change diffs in-place diffs = [] starttime = gettime() comparetrees(dir1, dir2, diffs, recent, numdays, skipcruft, quiet, verbose=True) tottime = gettime() - starttime # report time [jan6], stats [2.0], links [3.2] hours = tottime // (60*60); tottime -= hours * (60*60) minutes = tottime // 60; tottime -= minutes * 60 print('=' * 80) print('Runtime hrs:mins:secs = %.0f:%.0f:%.2f' % (hours, minutes, tottime)) print('Files checked: %d, Folders checked: %d, Symlinks checked: %d, Files skipped: %d' % (numfile, numdir, numlink, numskip)) if skipcruft: print('System metadata (cruft) files were skipped') # report collected diffs list if not diffs: print('No diffs found.') else: print('Diffs found:', len(diffs)) for diff in diffs: print('-', diff) print('End of report.')