File: mergeall-products/unzipped/mergeall.py-devdocs.txt

Additional development docs for mergeall.py

[3.3] This content, some of which is largely historical today, originally 
appeared at the end of mergeall.py's top-of-file docstring.  It was moved
here to make it easier to scroll to the start of that file's code.
See also docetc/MoreDocs/Revisions.html for parallel dev docs, and
UserGuide.html for higher-level usage documentation.

--------------------------------------------------------------------------------

DEVELOPMENT

NEW IN [3.3]: filenames are now run through Unicode normalization prior to
name comparisons, to avoid rare but possible skew when an equivalent name 
differs in FROM and TO in its decoded Unicode code-point representation
(e.g., 'Liñux.png').  Unnormalized/original names are used instead for all 
filesystem access.  See the new fixunicodedups.py, as well as "[3.3]" 
labels and mergeall.py's normalizeUnicodeFilenames() usage for full details.

NEW IN [3.2]: deltas.py provides an alternative run mode, which saves 
FROM changes without applying them to TO; see that script for details.
See docetc/MoreDocs/Revisions.html#version32 for full 3.2 notes.

NEW IN [3.1]: modtimes are now propagated for folders too, where supported;
see cpall.copytree() for the implementation and docs of this change.  3.1
also flushes stdout in Linux executables, as formerly done on Windows.
Also see docetc/MoreDocs/Revisions.html#version32 for more 3.1 details.

NEW IN [3.0]: cruft-file handling, symlink support, and Windows long paths.
If "-skipcruft" is passed, mergeall will skip platform-specific cruft (metadata)
files and dirs defined by patterns in mergeall_configs.py, in both the FROM and
TO trees.  Hence, they will not be reported, and in update modes will never be
copied to, deleted from, or replaced in the TO tree.  Cruft is also skipped by
cpall's copytree(), used here for bulk copies of FROM folders to TO (but not for
backup copies in backup.py), and diffall's content-based reporting.

When mergeall's "-skipcruft" is used, FROM and TO will be the same post merge,
except for their unique cruft files.  Platform-specific cruft is retained on the
creating platform, but not propagated to other copies and computers.  This is
one way to deal with hidden files generated by some operating systems (notably,
Macs).  The related script "nuke-cruft-files.py" here provides an alternative
brute-force and more manual solution.  See that script, mergeall_configs.py,
and UserGuide.html's usage pointers for more details.

Version 3.0 also has explicit support for synchronizing symlinks on both Unix
and Windows, and always skips exotic items like FIFOs.  In addition, 3.0 
supports long pathnames on Windows (via FWP() calls).  See UserGuide.html's
3.0 coverage for more on its new support, and version and platform requirements.

The rest of this section lists original caveats and to-be-decided issues that
have all been addressed over time; it's mostly historical today, but covers 
some subtle issues underlying content syncs.

----

CAVEAT 1: file timestamp dependence

As is, this script relies on the integrity of file modification times (a.k.a.
"modtimes").  It's not impossible that these may be skewed by some devices to
which a backup is written.  If this occurs, the worst this can do is cause a
file to be spuriously classified as a difference, and harmlessly written over
its identical copy in dirto.

If this is problematic, though, edit the comparefiles() function's modtimes
logic.  In the worst cases, this function could be changed to abandon file
modtimes, and use some sort of checksums, or read/compare full file contents
instead (see the file matching logic in diffall.comparetrees() for pointers).
Even full reads would likely still be quicker than a full tree copy, though,
as most devices today read much faster than they write. 

  UPDATE: FAT 2-second issue

  Version 1.3 was patched to allow for the FAT32 file system's 2-second
  file modification time granularity, else files stored on the more accurate
  NTFS file system always mismatch by modtimes and are classified as diffs.
  
  Prior to the fix, the same file on NTFS and FAT32 could register a bogus
  mismatch: NTFS on hard drives gives fractional second accuracy on Windows,
  but FAT32 on USB flashdrives always truncates file modtime fractional parts,
  and usually rounds up to the next multiple of 1 or 2 -- even immediately
  after a mergeall or drag-and-drop copy from NTFS (c:) to FAT32 (e:):

  >>> os.path.getmtime(r'c:\MY-STUFF\__more__\Memos\tablet-issues.txt')
  1393450444.1208856
  >>> os.path.getmtime(r'e:\MY-STUFF\__more__\Memos\tablet-issues.txt')
  1393450446.0
  
  >>> os.path.getmtime(r'c:\MY-STUFF\__more__\calendar\trips.ics')
  1393428663.0284016
  >>> os.path.getmtime(r'e:\MY-STUFF\__more__\calendar\trips.ics')
  1393428664.0

  It is possible to work around this by copying from FAT32 back to NTFS
  redundantly (after copying from NTFS to FAT32), by rerunning with swapped
  to/from roles to make time just stamps the same, but that was inconvenient.
  The fix allows for a match if times are +/- 2 seconds, which may miss some
  very unusual diffs, but compares without modtimes based on limited reads or
  checksums will be slower.

  UPDATE: FAT DST rollover issue
  
  Version 1.4's Revisions.html notes describe an issue regarding FAT32 filesystem
  modtimes being off by 1 hour (versus NTFS and exFAT) after daylight savings
  time (DST) has been automatically adjusted.  This is a well-known Windows
  issue with no easy fix; the best solution seems to be to disable your DST
  auto-adjust on Windows and adjust your time/clock manually when needed;
  the next best solutions may be to either allow timestamp-based backup tools
  like mergeall to rewrite your full archive twice a year (not ideal, but rare),
  or keep two FAT archive copies--one used when DST is in effect, and one used
  when it is not (which also automatically promotes long-term backups).  See
  UsrGuide.html for other workaround ideas.  NEW: see also the workaround script
  fix-fat-dst-modtimes.py, added in version 2.0.

      POSTSCRIPT: Per the new User Guide in 3.0, the best solution to the DST
      rollover issue now seems to be formatting all external drives using exFAT,
      which is portable to Windows and Mac OS (and Linux with an install), and
      uses more modern UTC times just like NTFS, HFS+, and others.

  UPDATE: Excel (and others?) may change content but not modtime

  It's been observed that Excel (and possibly others?) can sometimes change
  file content bytes without updating file size or modification time.  This
  happens on Windows, and occurs even if a file is simply opened and closed.
  The result is that diffall.py's full content bytes comparison detects and
  reports the difference, but mergeall.py's time/size metadata (and optional
  limited 'peek' bytes) comparison does not.  This seems to happen only for
  metadata that's almost certainly harmless and unimportant, but there is no
  known fix for mergeall, short of manually replacing files that report diffs
  in a byte-wise diffall.py run.  For an illustration of this in Python, see:
      examples\issues\excel-covert-changes-issue.txt

  [3.1] Also note that other programs which change file modtimes may also 
  subvert file timestamp-based programs like mergeall.  In particular, any
  program that copies over prior modtimes after changing content will make 
  changed files register as unchanged -- and prevent mergeall propagation.  
  This was the case with an initial design in PyPhoto's viewer_thumbs.py, 
  but was fixed by a later design that stored original modtimes separately 
  from thumb files.  Other cases are unavoidably outside mergeall's scope.
  For the PyPhoto use case, see:
      http://learning-python.com/pygadgets.html
  
  UPDATE: Linux/Windows NTFS cross-platform merge DST issue
  
  On Linux, when comparing trees on mounted Windows NTFS volumes to trees
  on Linux volumes, there may be an issue related or similar to the FAT
  DST rollover issue described above, which skews some NTFS mod times by an
  hour (and hence generates spurious mergeall differences).  The best solution
  so far is to simply synch once to remove the differences.  See release 1.5's
  second "Linux Usage Note" in Revisions.html for more details (alas, a former 
  demo of this issue from 2014 was culled along the way, with its examples/).
      
----

CAVEAT 2: one-way versus two-way synchronization

This script is a one-way merge and prune.  It assumes there is just 1 "golden"
base version of a tree that all other copies are made to mirror, either by merging
changes to and from a common base copy, or by merging to other copies directly.
Changes in a local copy may be uploaded to and from the base copy(s) quickly,
but all bets are off if the same file is changed in multiple trees before
synchronizing them with the base.

If this won't suffice, run with just -report to see differences and resolve
manually, or run without -auto to resolve items on a case-by-case basis by
interactive input.  A more peer-level and bi-direction automatic union merge
mode would fail to allow for renames and deletes, and multiple edited copies
probably encroach on the domain of full source control systems in general.

  UPDATE: It is possible to use this one-way merge to perform peer-level and
  two-way synchronizations after all, by simply running _twice_ in interactive
  and selective mode, with swapped to/from roles -- choose one tree's diffs
  on the first run, and the other's on the second.  For more details on this
  process, see file Whitepaper.html in this system's docetc/MoreDocs folder.

----

CAVEAT 3: 2.X compatibility and file modtimes

This was coded to also work on Python 2.X, but requires os.stat_float_times(False)
to work on 2.X.  This call forces file modtimes to be truncated integers instead
of floats (losing second fractions, irrelevant here).  This works around a bug
in Python 2.7's shutil.copystat(), which copies file modtimes with a different
precision than that in the original file (a low exponent digit differs):

  >>> import os
  >>> os.path.getmtime(r'test\test1\f1.txt')  # original file
  1391819917.6508296
  >>> os.path.getmtime(r'test\test2\f1.txt')  # copy, modtime differs if made in 2.X
  1391819917.650829                           # (but same as original if made in 3.X)

This in turn makes all future comparisons register a difference here.  By
truncating modtime return values to ints on both 3.X and 2.X, the code here
is portable and works on both 3.X and 2.X as is.

The os.stat_float_times() call is today marked as "deprecated in Py 3.3"; if it's
ever removed from 3.X, the call here will automatically be replaced with a manual
truncation of os.path.getmtime() results.  This may be a simpler solution in
any event, and avoids storing truncated modtimes in file copies made in 2.X (only).

  UPDATE: note that truncating fractional parts of mod times is not enough to
  address the 2-second granularity of the FAT32 file system, described earlier,
  even if the truncated times are pushed out to disk (they seem to be in 2.X).

  UPDATE: the [3.0] os.stat rewrite made this partly moot - getmtime is now 
  unused, but os.stat result modtimes are similarly truncated for 2.X's issue.

----

CAVEAT 4: directory removals may fail on Windows due to pending deletes [2.0]

On Windows, deletes may sometimes not be finalized immediately -- they are left
still pending after the delete call returns (perhaps due to other activities).
This is lethal to shutil.rmtree, because directories cannot be removed until
after all their contents are removed.  Version 2.0 adds a workaround that waits
temporarily, retrying shutil.rmtree's os.rmdir directory removal calls that fail.
The operation can still fail, however, leaving log messages, and a difference to
be resolved on the next run: harmless, but less than ideal.  This is very rare
(and may warrant additional research); see Revisions.html for more details.

  UPDATE: [3.0] experimented with - but did not use - code that extends the
  shutil.rmtree error handler to first try to correct read-only items and
  rerun the failed operation, before trying the preceding workaround.
  Read-only failures seem an oddly common issue on windows, but permissions
  should be changed by users only.  See backup.py for this disabled code.

----

TBD 1: symlinks? [RESOLVED]

This code may need some honing on platforms with symlinks and other esoteric
filesystem entries.  As is, these may be skipped in both trees: uniques and mixes
both process only files and dirs (per Python's libs), and report other types skipped.
Skipped and unreadable items don't terminate the script, but could return as
unresolved differences in future runs.  However, this depends on the semantics
of Python's os.isfile()/isdir() results, which may follow symlinks on some platforms
(update: both return _True_ for Unix symlinks).  See Python's manuals and test on
your machine; this script has been used on only Windows to date.

  UPDATE: per Revisions.html's release 1.5 notes, the GUI/console launchers and
  main script are now known to run well on Linux for basic file/directory trees,
  though further testing of more exotic file types is still pending.

  UPDATE: version [3.0] finally resolved this point as part of its Mac OS X port.
  For Unix symbolic links to both files and dirs, mergeall now always copies the
  link itself, instead of following it (i.e., it copies the link's path, not the
  item it refers to).  Otherwise, archives with intra-archive links will wind up
  with multiple copies of the linked data for both normal copies and backups.
  This policy assumes symlinks are both relative and intra-archive, else they may
  not work on a different machine.

  The symlinks extension was coded as pretests to minimize impacts to existing code,
  and relies heavily and implicitly on the fact that cpall.{copyfile(), copytree()}
  were also augmented to check for and copy links first, before copying actual
  items instead.  copyfile(), for example, handles both links and real files. 
  
  As part of this extension, os.path.is*() tests in the 3.4- comparison phase
  version were replaced with (sadly cryptic) os.lstat() and stat module calls
  that do not trigger multiple stat system calls, and do not return True when
  testing if a link is a file or dir (os.path does both).  The os.scandir()
  results in the 3.5+ version work like os.lstat() if follow_symlinks=False,
  though they were eventually dropped as not faster: see scandir_defunct.py.
  
  Windows symlinks work with this code too, but require administrator permission
  and the portability of symlink paths between Windows and Unix is poor at best.
  Also note that FIFO files are False for _both_ isfile() and isdir() (and similar
  os.lstat/scandir tools), so they won't be copied here unintentionally.  For more
  background details, see these session logs:
  
      docetc/miscnotes/demo-3.0-symlinks-unix.txt
      docetc/miscnotes/demo-3.0-symlinks-windows.txt.

  Symlinks generate log messages that start with "propagating" when being both
  copied and backed up to TO, because they are a rare special case that merits
  highlighting in logs, and may require special permission/handling on Windows.
  
  NOTE: mergeall always propagates invalid links (to nonexistent or non-file/dir
  items) because such links may have legitimate use cases or be valid elsewhere.
  This policy is mirrored in cpall and ziptools, and is irrelevant in diffall.
  Your links are your business and asset: mergeall won't silently discard them.
  mergeall discards only items that are impossible to propagate (e.g., FIFOs).

----

TBD 2: remaining Unicode issues? [ADDRESSED]

This script may need to address Unicode filenames on some platforms, perhaps by using
already-encoded bytes filenames in os.listdir().

  UPDATE: in version 1.2, encoding of streams in the processes spawned by the
  GUI and console launchers are forced to agree with subprocess.Popen decoding,
  by setting the inherited PYTHONIOENCODING shell variable; this handles filenames
  in those streams, but does not address all filename contexts.

  UPDATE: in version 1.4, this was patched again to force the mergeall subproc
  to print in UTF8, use binary-mode stream reads for Popen, and manually decode
  per UTF8 in launchers after the read; this allows both mergeall and Popen to
  handle Unicode filenames in messages.

  UPDATE: in version 1.6, the GUI launcher was patched for rare 2.X decoding errors
  for non-ASCII characters in filenames.  See the GUI launcher's code file for details.
  Note that this patch applies only to the GUI launcher's display: PYTHONIOENCODING 
  must still be set manually in your system shell when running script mergeall.py 
  directly from a command line, if it may ever process and thus print non-ASCII 
  filenames, especially in 3.X.

  UPDATE: for version 1.7, Revisions.html includes a Usage Note about different folder
  names being treated the same by Windows if they are the same after Unicode
  accents are dropped.  See Revisions.html's version history for details/workaround.
  =>
    REUPDATE: release 1.7.1 updated this note to clarify that this problem occurs
    only on FAT32 filesystems (of the sort used by USB flash drives), and only if
    a non-accented name is copied before an accented and otherwise equivalent name.
    No automatic workaround is yet known, but this is a very rare and unusual event;
    manually merge folders or manually copy in the desired order if this occurs.

  UPDATE: in version 3.0, stdout text is forced to ASCII when mergeall is running
  as a PyInstaller executable on Windows ONLY.  This is a workaround to a likely
  PyInstaller bug, and impacts message display only (adding quotes and escapes).
  
----

TBD 3: auto backups? [RESOLVED]

An auto-backup copy feature is half-coded here, but was not implemented (it's not
clear if this is desirable, and not clear how/when to dispose of the backups).
Backup your trees manually first if you don't trust or want this script's results.

  UPDATE: version 2.0 adds automatic update of changed items, via the mergeall
  "-backup" argument, and corresponding widgets and prompts in the GUI and console
  launchers, respectively.  Changed items include files and folders replaced or
  deleted in-place.  New items copied to TO are not updated, as this is not a
  destructive change.  Backups are kept in per-run folders in a top-level __bkp__
  folder, and pruned automatically.  This makes mergeall generally safer: if needed,
  files may be restored from any of the latest mergeall run backups in the __bkp__
  of any archive copy.  See UserGuide.html, Revisions.html, backups.py, and
  Whitepaper.html for more details.

  UPDATE: version 2.1 further extended this model to support automatic rollback of
  all changes made by a prior run with backups enabled, including new items added.
  It restores replacements and removals, and removes additions.  A new file
  __bkp__/__added__.txt logs additions.  Rollbacks can be invoked with either
  the new "-restore" command-line argument, or the new "rollback.py" script
  (which disables backups during restores).  See the same sources for more details.

----

TBD 4: drop the copystat() hack? [RESOLVED]

The cpall.copyfile()/copytree() examples from PP4E were extended here to call
shutil.copystat() to copy modtimes too, but this was done in an unusual way
(a.k.a. "monkeypatching").  This works, and reuses book examples intact, but
copyfile() could be changed in-place to do this as an option.

  UPDATE: Done -- version 2.0 changed cpall.copyfile in-place to call copystat
  by default; the original code is retained in quotes as an example (and lesson).

----

TBD 5: counters? [RESOLVED]

Some counts/statistics may be useful additions to the report.

  UPDATE: Done -- version 2.0 adds counts for both comparison and
  resolution phases, and displays at the run's end.  [Update: 2.2
  now also displays runtimes for each mergeall phase.]

---

TBD 6: support long pathnames on Windows? [RESOLVED]

Now uses fixlongpaths.py's FWP() in all Python file-tool calls, to
prefix long paths on Windows with '\\?\'.  See that module's docs.

--------------------------------------------------------------------------------

PSEUDOCODE (original design):

For differences (by modtime, size, or limited content tests),
reports only if "-report"; else at each common directory in
the two directory trees:

  For differing same-named files:
      if -auto, copies dirfrom file to dirto
      else asks if should use dirto or dirfrom version, or ignore
          if use dirto,   takes no action
          if use dirfrom, copies dirfrom file to dirto 

  For unique files in dirfrom:
      if -auto, copies dirfrom file to dirto
      else asks if should do auto action, else ignore
  For unique files in dirto:
      if -auto, deletes dirto file
      else asks if should do auto action, else ignore
      
  For unique dirs in dirfrom:
      if -auto, copies dirfrom tree to dirto
      else asks if should do auto action, else ignore
  For unique dirs in dirto:
      if -auto, deletes dirto tree
      else asks if should do auto action, else ignore

  For same-named items that are both file and dir (rare):
      if -auto or (ask user if should use dirfrom version)
          if dirfrom is a dir
              deletes dirto file, copies dirfrom tree to dirto
          if dirfrom is a file
              deletes dirto tree, copies dirfrom file to dirto
          else
              ignore: the names may be something else (fifos?, not symlinks) 
      else takes no action

There naturally are alternative algorithms (e.g., resolution might just
delete item in TO (file or dir) and then copy item in FROM (dir or file)),
but they may lead to redundant steps and less-intuitive action reporting.