File: mergeall-products/unzipped/backup.py

"""
==========================================================================================
backup.py:
  automatic backup/restore subsystem (part of the mergeall system [2.0])

Summary:

Make backup copies of all files and directories in the TO directory that will
be destructively replaced or deleted in-place during a mergeall run.  These 
items' prior versions in the TO tree are saved in the automatically-created 
__bkp__ folder at the top of the TO archive, with their full directory paths.
Backups are not synchronized across trees, but are automatically pruned by age 
when their number exceeds a limit.  This option makes mergeall generally safer,
as unwanted or failed changes can be later undone by restoring backup copies.

2.1 extends this model to support complete rollback of a prior run, by merging
a backup subfolder to an archive root.  It augments backups to also record files
added in a __bkp__\subfolder\__added__.txt file, and runs a special merge mode
which does not delete unique items in TO, but does delete all items in TO that
are listed in the backup's __added__.txt file (which is itself removed from TO
after the merge).  The net effect restores the TO tree's prior state, as long
as the restore run is made before any additional changes in TO.

Details:

Because a given item name may appear at multiple places in the archive tree,
replaced and deleted items are backed up in folders of this form (the mergeall
log gives a more linear list of changed items):

    TO-archive-root\__bkp__\date-time\full-archive-path-less-root\filename

Items in the __bkp__ folder are local to the TO archive copy, and save items changed in
that copy only.  Backup folders are not propagated to other archive copies by mergeall:
unlike normal archive items, they are not synchronized across trees.  Hence any changes
in __bkp__ folders are not themselves backed up.  Backup subfolders in __bkp__ are,
however, automatically pruned by age when their number exceeds a limit (changeable
in the code below).  This allows changes from any of multiple mergeall runs to be
selected for restores, if needed.

Any errors while writing a backup copy of an item causes the replacement or deletion
of that item to be skipped, as the operation is unsafe.  The item will register as
a difference on the next mergeall run, if not resolved manually.

Note that users can make arbitrary changes to their __bkp__ folders (including deleting
them altogether), as they are not synchronized and are created automatically whenever
needed.  For instance, if the default pruning limit results in too much space being
taken by __bkp__, you may delete portions of it freely.

Also note that __bkp__ folders may register differences in diffall.py runs.  As each
run has its own uniquely-named subfolders, these will usually be few, and can generally
be ignored.  Per-run folders' content differences may be useful, however, to compare and
analyze backup copies from different runs; run diffall on the subfolders themselves.

Purpose:

Though this option makes mergeall runs safer in general, it was primarily intended as an
automatic guard against propagating corrupted files across multiple archive copies, and
an better alternative to manual backup copies or multiple archive stores cycled by date
manually.  Merges generally works without issue if all but one archive copy are treated
as read-only; however, unwanted or incorrect changes may spread though archive copies
if changes are made in an archive copy and then synched back to the main copy.  In such
cases, the original's versions of files may be quickly lost, if not backed up.

By backing up items to be changed in each TO copy automatically and retaining multiple
runs' backups, bad copies can generally be undone easily by taking the most recent
backup's versions, without falling back to potentially much older full archive copies.
Backup folders also serve as a record of the most recent mergeall runs against a tree.

------------------------------------------------------------------------------------------

More about __bkp__ folders:

Stores __bkp__ folder at top of TO archive so it doesn't become disjoint from it, but
does not include __bkp__ in the mergeall synchronization -- its contents are local to
the TO archive, and not copied to/from other copies automatically.  This is so, because:

1) Propagating backups between copies in round-robin trips between three (or more)
devices seems potentially very confusing, and device source would be unclear.

2) Synchronizing would delete copes on TO not also on FROM, thereby limiting TO
from having more than one backup, in one-way propagation use cases.

3) Automatic backup pruning wouldn't be possible, as it violates the principle that
changes in the archive must all be disjoint during a single mergeall run.  If __bkp__
was included in the synchronization process, it's not impossible that one of its folders
on the TO archive would be both pruned by the backup system here and deleted due to an
absence in the FROM archive -- regardless of the timing, the second deletion attempt
would fail with an exception.

More about disjoint changes:

More generally, mergeall works at all only because its updates are mutually exclusive,
and disjoint (a given name can appear in just one change category).  Its changes are
limited to these items, in this order:

  a) Same-named files (replaced in TO)
  b) Unique files and directories in TO (deleted from TO)
  c) Unique files and directories in FROM (copied to TO)
  d) Same-named file/directory type mismatches between TO and FROM (replaced in TO)

Once any item is classified and changed in one of these categories, it is not
further inspected or altered.  Hence, each change in the TO tree is limited to
its TO tree location, and cannot be impacted by, or appear as part of, any other
changes category, regardless of the timing of changes made.  Order does matter for
renames on case-insensitive machines like Windows, but are sound as long as deletes
occur before adds (see mergeall.py's mergetrees() docstring for more details).

Backups rely on the fact that the change sets are disjoint: a file can be saved
for only one category (a, b, or d), and new copies (c) are never saved and thus
cannot overwrite saves for deletions, even for mixed-case differences on Windows.

Synchronizing the __bkp__ tree between archives would violate this disjointedness rule,
however, because automatic backup subfolder prunes may delete an item also scheduled for
deletion in category (b).  This is true even though all differences are detected before
the changes phase begins, and only removing __bkp__ items from the set of backed-up
items does not circumvent this; synchronization changes may still intersect with prunes.

Therefore, top-level __bkp__ folders are removed from the set of synchronized items
in the archives, and an archive copy retains only backups from mergeall runs where
it served the role of TO archive.  Backups can be manually copied into normal archive
folders to propagate them to other archive copies if desired, but their management
then becomes a user task; only items in the special-cased __bkp__ folders are
automatically pruned when their items count limit is exceeded.

------------------------------------------------------------------------------------------

General implementation notes:

Caveat: backup paths may be too long in some cases on some platforms.  The original file's
full archive path is recreated and required, because the same name may appear in more
than one place in the tree (and filenames created by concatenating path parts seem more
likely to exceed platform length limits).  

Does not store a single '.bkp' alongside original: would need to special-case to avoid
.bkp.bkp... accumulation, and a '.bkp' may be a valid to-be-archived user file name.

Uses copy instead of move (rename): need to retain original if code here fails, and also
want to retain the original's mod times in the copy.  os.rename also proved unreliable
on Windows, especially across devices.

All errors return with an exception: caller handles. In mergeall, the backup exception
here causes the removal or deletion to be skipped, as the operation is then unsafe.

Backs up files replaced or deleted, but does not backup new files added, as these
would be just redundant copies of new data.

Backup folder are excluded from the synchronization process by removing only top-level
__bkp__ items from os.listdir results in the comparisons procedure.  This proved more
reliable than os.rename moves to/from a temporary folder (see unused code ahead).

TBD: should backups be stored outside the archive itself?  Originally placed in __bkp__
at root of archive, and synchronized so that prior versions are not stored on just one
device only (which may fail).  The downside of this is that, when this option is used,
backups can accumulate quickly, and deletes are removed from the archive's actual folders
but linger on the device taking up space.  Could backup replaced files only, but that's
less secure.

     RESOLVED: later opted to store __bkp__ in the TO archive for association, but
     not synchronize it across other archive copies.  This was required, because
     prunes violated the disjoint updates rule which is the logical basis of mergeall.

TBD: should backups be pruned?  Currently, this is automatic, lest old backups
accumulate.  The downside of this is that the user may not want old backups removed.
As an alternative, the end user could be expected to manage backups, and could also be
expected to select the backups folder.  This would be flexible, but complicates the
command line and GUI with another directory choice, and deleting backup folders manually
seems a substantial extra admin task.

    RESOLVED: backups are stored in the automatically-created __bkp__ at the top of the
    TO archives, and are automatically pruned by age after N (default 10) copies have
    accumulated.  Given that __bkp__ is now local to each archive copy, this policy
    allows ample backup retention -- any of the last N runs can be unwound if needed.
    Users may also manually delete large backup folders before they are pruned for space.

TBD: __bkp__ folders will register differences in diffall.py runs for their top-level
per-run folders.  These can be simply ignored in reports.  diffall could skip __bkp__,
like mergeall, but this seems too tight a dependency between the two programs to enforce.
==========================================================================================
"""




###################
# CODE STARTS HERE
###################



from __future__ import print_function   # Py 2.X
import os, time, glob, shutil, sys, time, errno, stat, unicodedata

# this module is _largely_ platform-neutral
RunningOnMac     = sys.platform.startswith('darwin')
RunningOnWindows = sys.platform.startswith('win')

import  mergeall    # for copyfile, copytree (from won't work here: recursive)
indent1 = '....'    # to distinguish messages here from main mergeall logic
indent2 = '____'    # rmdir retries: not just for pruning (also for uniqueto)


# [2.1] from configs file, unless absent or errors
try:
    from mergeall_configs import MAXBACKUPS 
except:
    MAXBACKUPS = 10   # keep up to this many backup folders in each archive copy's __bkp__


#-------------------------------------------------------------
# [2.3] Use UTF8 Unicode encoding for __added__.txt restore 
# files in __bkp__, not platform default.  These files contain
# filenames which may have arbitrary characters.  Python 2.X's 
# codecs.open files are always binary mode: must specialize 
# line-ends here too.  [3.2] ditto for deltas-set __added__.
#-------------------------------------------------------------

ADDENC = 'utf-8'

if sys.version[0] == '3':
    unicode_open = open                         # what 3.X probably should have done?
    unicode_linesep = '\n'                      # 3.X text mode files expand \n as needed
else:
    import codecs
    unicode_open = codecs.open                  # 2.X compatibility (or use from...as...)
    unicode_linesep = os.linesep                # 2.X codecs always binary: no \n expansion


#-------------------------------------------------------------
# Make just 1 subfolder per run, on first file backed up.
# Else may make 1 new subfolder for each second in run!
# Now also used by noteaddition, to get run's subfolder:
# either backupitem or noteaddition may be called first.
#-------------------------------------------------------------

datetimestamp = None

# try to prune just once per run, on first file backed up
pruned = False

# if -quiet, print backups message on first backup only [2.4]
firstbkpmsg = True

# [3.0] fix too-long paths on Windows 
from fixlongpaths import FWP

# [3.3] fix Unicode variants in __added__.txt paths
from fixunicodedups import matchUnicodePathnames




def makedirs_ifneeded(dirpath):
    """
    --------------------------------------------------------------------------------------
    [2.3] Run an os.makedirs() call portably on Python 3.X or 2.X to create any and all
    parts of a directory path as needed.  Only 3.X has the exists_ok flag to avoid an
    exception if a part already exists, and 2.X doesn't have 3.X's detailed exceptions.
    Alternative: could also call os.path.exists(dirpath) in an "if", but slightly slower.

    Note: Python's os.makedirs() in Lib\os.py is recursive, but probably doesn't need
    to be.  Because it scans a linear directory path, a simple loop should suffice, and
    yield simpler code (alas, "batteries included" means you get what's shipped in the
    box).  Recursion is really required only for arbitrary shapes, such as folder trees,
    and even then can be replaced with explicit stacks (see Learning Python 5E, p555-561).
    --------------------------------------------------------------------------------------
    """
    if sys.version[0] == '3':
        # python 3.X
        os.makedirs(FWP(dirpath), exist_ok=True)          # 'recursive' mkdir, as needed
    else:
        # python 2.X
        try:
            os.makedirs(FWP(dirpath))                     # 2.X has no exists_ok
        except OSError as E:                              # 2.X requires errno test
            import errno                                  # only need here on exc
            if E.errno != errno.EEXIST: raise             # reraise all others




def backupitem(pathto, toroot, dobackup, quiet):
    """
    ----------------------------------------------------------------------------------------
    If enabled by command-line (or by proxy via GUI toggle or console reply), make a backup
    copy of items (files and directories) in the TO tree before they are destructively
    replaced or deleted in-place.  This includes a:

    1) File or link in the TO tree
           that is about to be replaced by a newer same-named file in the FROM tree
    2) File or directory in the TO tree
           that is about to be deleted due to absence in the FROM tree
    3) File, directory, or link in TO
           that is about to be replaced due to a dir/file/link type mismatch in the FROM tree
    4) File, directory, or link in TO
           that is about to be deleted by a -restore run because it's listed in __added__.txt

    Unique files in FROM copied to TO are not backed up, as this is not a destructive action,
    though they are listed by noteaddition in the backup's __added__.txt file for the run.
    
    "pathto" is where the item to be destroyed resides; "toroot" is the top of the archive in
    the TO tree, where the backup will be stored in a __bkp__ subfolder under pathto's tail.
    Does not backup if not "dobackup": backups not enabled per user or rollback.py script.
    Prints only one general backup log message if "quiet", to mimimize logfile clutter.
    
    Either this or noteaddition may be called first: prune + timestamp if not yet done.
    No longer needed, because __bkp__ not synched: "if pathto.startswith(bkproot): return".

    [3.0] Use newly-added "strict" arg to copytree() to force it to propagate its first
    file exception to here, instead of printing an error message and continuing.  We need
    to pass the eception to this function's caller, so the item's update is cancelled if
    its backup fails.  Else, we might delete a TO folder without backing up parts of it!
    
    [3.0] Don't use copytree()'s new "skipcruft" here: we're backing up data already
    present in the TO tree.

    [3.0] Handle symlinks (to both files or dirs) by passing them to copyfile() just
    like normal/simple files; copyfile() tests for links up-front and copies specially.
    Symlinks are copied, not followed: backup the link itself, not what it refers to!

    [3.2] Now also called to backup __added__.txt items removed from TO by -restore 
    runs, to allow delta-set changes to be rolled back (and rollbacks to be unrolled...).
    ----------------------------------------------------------------------------------------
    """
    global firstbkpmsg

    # do nothing if "-backup" not passed by command, launcher, or rollback.py
    if not dobackup:
        return  # avoid nesting

    assert pathto.startswith(toroot)       # sanity check: changed file must be in TO tree
    todir, tofile = '?', '?'               # initialize for early exceptions
    
    try:    
        # prune old backups the first time through here (or noteaddition)
        try:
            prunebkpdirs(toroot)
        except:
            print(indent1 + 'prune failed, but backups and mergeall continued')
            print(indent1 + '%s %s' % (sys.exc_info()[0], sys.exc_info()[1]))

        # make run's subfolder timestamp first time here (or noteaddition)
        datetimestamp = makeruntimestamp()
        
        # verify or create the backup-to path in the TO copy
        bkproot = os.path.join(toroot, '__bkp__')                   # toroot is cmdargs.dirto
        todir, tofile = os.path.split(pathto)
        archtail = todir[(len(toroot) + len(os.sep)):]              # remove prefix=cmdargs.dirto
        bkppath = os.path.join(bkproot, datetimestamp, archtail)
        makedirs_ifneeded(bkppath)                                  # 'recursive' mkdir, as needed

        # copy file or dir over to backup copy 
        copytopath = os.path.join(bkppath, tofile)
        if not quiet:
            print(indent1 + 'backing up %s to %s' % (tofile, copytopath))
        elif firstbkpmsg:
            # [2.4] suppress per-file messages (superfluous?), but indicate backups mode once
            allbkpsroot = os.path.join(bkproot, datetimestamp)
            print(indent1 + 'backing up all items to %s' % allbkpsroot)    # [3.0] not bkppath!
            firstbkpmsg = False
        
        if (os.path.isfile(FWP(pathto)) or                        # [3.0] treat links like files
            os.path.islink(FWP(pathto))):                         # [3.0] fix long windows paths 
            mergeall.copyfile(pathto, copytopath)                 # this never catches excs
            
        elif os.path.isdir(FWP(pathto)):
            os.mkdir(FWP(copytopath))                             # [3.0] fix long Windows paths
            mergeall.copytree(pathto, copytopath, strict=True)    # [3.0] fail on any except

        else:
            assert False, ('unknown file type: ' + pathto)        # e.g. fifos: punt            

    except:
        print(indent1 + '**Error backing up %s in %s' % (tofile, todir))
        raise   # reraise: handle in caller - cancel update, as it would be destructive




def makeruntimestamp():
    """
    ---------------------------------------------------------------------------------------
    Make the run's unique timestamp, to be used for its subfolder name in the TO tree's
    __bkp__ backups folder.  Factored to here, as this may be triggered by either backitem
    or noteaddition, either of which may be called first during a mergeall run.
    ---------------------------------------------------------------------------------------
    """
    global datetimestamp
    
    if not datetimestamp:
        datetimestamp = time.strftime('date%y%m%d-time%H%M%S')  # backup's unique top dir name
    return datetimestamp




def prunebkpdirs(toroot, maxbackups=MAXBACKUPS):
    """
    ---------------------------------------------------------------------------------------
    On first backup in session, auto-delete the oldest backup dir(s) in the TO archive
    if needed, keeping just the most recent N.  Most of this was adapted from the frigcal
    GUI's backup system.  TBD: this could be left to user, but seems likely to accumulate.
    Caller handles any exceptions here: this pre-merge step shouldn't be fatal - proceed.

    [3.0] Recoded to skip a failed directory and continue the prune to process other
    folders, instead of ending the prune at the first failure.  Else, this may miss
    folders in the unlikely event that a failure of a more recent backup folder (e.g.,
    permissions) prevents the prune from reaching earlier backups later on the list...
    which can only ever happen if the #backups has been reduced in the configs file.
    Users must address the failure to allow the failing folder to be pruned eventually.
    Callers are still notified with an exception, but details will be displayed here.
    ---------------------------------------------------------------------------------------
    """
    global pruned
    
    if pruned:
        return
    else:
        pruned = True
        bkproot = os.path.join(toroot, '__bkp__')
        if os.path.exists(bkproot):
            
            backuppatt  = 'date*-time*'
            currbackups = glob.glob(os.path.join(bkproot, backuppatt))
            currbackups.sort(reverse=True)
            prunes = currbackups[(maxbackups - 1):]              # earliest last, via names sort

            anyfailed = False
            for prunee in prunes:                                # globs have full paths
                print(indent1 + 'pruning', prunee)               # normally 0 or 1, unless failed
                try:
                    shutil.rmtree(FWP(prunee, force=True), onerror=rmtreeworkaround)
                except Exception:
                    anyfailed = True
                    print(indent1 + 'this prunee failed, but pruning continued')
                    print(indent1 + '%s %s' % (sys.exc_info()[0], sys.exc_info()[1]))                    

            assert not anyfailed, 'Some prunes had errors'       # [3.0] notify caller of excs




def noteaddition(pathto, toroot, dobackup):
    """
    --------------------------------------------------------------------------------------
    [2.1] Log unique FROM items (files and dirs) added to the TO tree in a text file,
    with 1 file path (relative to the TO tree's root) per line.  The logged items are
    stored in file: toroot\__bkp__\subfolder\__added__.txt.  This allows additions to be
    automatically removed by "-restore" as part of a later run to perform a complete
    rollback.  The adds file also serves as run documentation, in addition to logfiles.
    Either this or backupitem may be called first: prune + timestamp if not yet done.

    [3.2] The archtail length calc here can be thrown off by a trailing / or \ in 
    toroot (a.k.a. cmdargs_dirto); it's now removed in getargs() to avoid the drama.
    This happened only when toroot cmdline arg had a trailing slash, and only for
    top-level, unnested adds, but truncated the leftmost character in this case. 
    For items nested in folders, a double '//' left by the comparison-phase walker 
    spared the calcs here, and backupitem() was made immune by its spilt() call.
    This also impacted the new deltas.py, where the issue was first uncovered.
    This does not impact diffall.py or cpall.py (they simply treat two // as one).

    [3.3] Note only: this doesn't print per-item messages, because the mergeall.py 
    caller does (for the new copied item recorded here).  By contrast, deltas.py now
    does print a message in similar code that adds to an __added__.txt file in delta
    sets (for future TO removals), because the mergeall.py print code is not run.
    --------------------------------------------------------------------------------------
    """
    if not dobackup:
        return  # avoid nesting

    try:
        # prune old backups the first time through here (or backupitem)
        try:
            prunebkpdirs(toroot)
        except:
            print(indent1 + 'prune failed, but note-add and mergeall continued')
            print(indent1 + '%s %s' % (sys.exc_info()[0], sys.exc_info()[1]))

        # make run's subfolder timestamp first time here (or backupitem)
        datetimestamp = makeruntimestamp()

        # build and make the adds file's path 
        addsname = '__added__.txt'
        bkproot = os.path.join(toroot, '__bkp__')                    # toroot is cmdargs.dirto
        runroot = os.path.join(bkproot, datetimestamp)               # bkp subfolder for this run
        addspath = os.path.join(runroot, addsname)                   # adds file in run's subfolder
        makedirs_ifneeded(runroot)                                   # 'recursive' mkdir, as needed

        # write copied item's relative path to adds file
        # [2.3] use utf8 for filenames, not platform default
        archtail = pathto[(len(toroot) + len(os.sep)):]              # remove prefix=cmdargs.dirto
        addsfile = unicode_open(addspath, encoding=ADDENC, mode='a') # add to end of file
        try:                                                         # OLD: use default encoding in 3.X
            addsfile.write(archtail + unicode_linesep)               # 3.X expands \n, 2.X does not 
        finally:
            addsfile.close()

    except:
        # do NOT reraise: no need to cancel the update, as adds are non-destructive
        print('**Error noting add of %s' % pathto)
        
    else:
        # don't issue a trace mesage here, as it seems gratuitous in the logs
        # print(indent1 + 'noted addition in TO of', archtail)
        pass




def trymangle(delpath):
    """
    ----------------------------------------------------------------------
    [3.2] For removeprioradds(): on Windows, mangle nonportable filename 
    characters to "_" to match other tools' mods.  Do so always, instead
    of just on errors: there's no way the unmangled form could be stored, 
    and os.path type tests fail with unmangled names.

    Mangling can happen when content is copied from Unix to Windows, by 
    ziptools default unzips or similar.  Mergeall in general assumes that
    FROM names are mangled as needed by unzips or other copies before 
    syncs are run on Windows, but __added__.txt lists evade such transforms.
    Hence, this code must mangle items listed in __added__.txt to match the 
    names under which they are stored on TO on Windows.

    Caveats: names may be mangled differently by some tools (e.g., macOS 
    copies to FAT32/exFAT use Unicode privates), but the unmangled form 
    clearly won't work.  This also risks removing files whose names match 
    the mangled name only coincidentally, but that seems astronomically 
    unlikely, and skipping removals seems worse.  See ziptools.ziptools.py
    and ziptools' _README.html#nomangle for more details; per the latter, 
    run the included fix-nonportable-filenames.py before transferring 
    content to skip filename interoperability issues like this altogether,
    and always for transfers to Android shared storage not mangled here.

    Theory: this use case cannot arise when propagating content with 
    Mergeall alone (it never mangles), or using intermediate drives with
    filesystems or platforms that disallow nonportable names (e.g., Linux
    refuses to write such names to FAT32 and exFAT).  However, some other
    tools mangle names on Windows (e.g., ziptools), and some drive/platform
    combos allow but munge names (e.g., macOS munges to and from Unicode 
    privates on exFAT).  In particular, changes propagated indirectly by 
    Mergeall's deltas.py and ziptools may wind up mangled on only Windows:
    comparisons to a proxy drive with unmangled or munged characters may 
    record unique TO names in __added__.txt in unmangled form, but these 
    names are mangled on Windows if content was unzipped there by ziptools.

    TBD: should mangling also be done for FAT/exFAT write errors on Linux?
    ziptools does not mangle on Linux (it skips with error messages), but
    other tools may.  Really, the filename fixer script should be run in
    this use case - unlike macOS (which munges nonportable chars to/from 
    Unicode privates covertly), Linux mangles will be perpetual diffs.
    ----------------------------------------------------------------------
    """
    startpath = delpath

    # drop drive first, so ':' not mangled
    drive, delpath = os.path.splitdrive(delpath)     # 'c:', '\folder\file'

    # split path to parts
    delparts = delpath.split(os.path.sep)            # only on windows \ here, [0]=''

    # illegal chars
    nonportables = ' \x00 / \\ | < > ? * : " '       # for filesystems, not platforms
    nonportables = nonportables.replace(' ', '')     # drop space used for readability

    if not any(c in part for part in delparts for c in nonportables):
        # none found: mangling won't help
        return startpath
    else:
        # mangle entire path, and report
        replacements = {ord(c): '_' for c in nonportables}
        mangledparts = [part.translate(replacements) for part in delparts]
        mangledpath  = os.path.sep.join(mangledparts)
        mangledpath  = drive + mangledpath
        print(indent1 + '--Name mangled: "%s" => "%s"' % (startpath, mangledpath))
        return mangledpath




def removeprioradds(fromroot, toroot, dobackup, quiet):
    """
    ------------------------------------------------------------------------------------
    [2.1] Remove items listed in fromroot's __added__.txt file from the toroot 
    tree, if __added__.txt and the listed files are present, as part of a complete 
    rollback from backups (or deltas-set apply) in a mergall.py "-restore" run.  
    This is a pre-merge step: order matters for renames on Windows (the merge 
    must delete and then add, else delete may remove differently cased names).

    Assumes fromroot is a __bkp__ subfolder (or at least has an __added__.txt), but 
    does not fail if not -- in all cases, ignore exceptions here.  The user may have
    deleted __added__.txt to back out removals+replacements only (not adds), and may 
    have created a custom __added__.txt elsewhere in another tree to be merged.

    We need to care about closing the file on exceptions; this is now a pre-merge step.
    The __bkp__ folder's __added__.txt will be copied over to the TO root by the normal
    merge; it's deleted manually later, rather than skipping the name during the merge.

    CAVEAT: because noteaddition() records added items using the path syntax of the
    platform on which the prior mergeall ran, it's not possible to remove prior
    additions on a different platform having incompatible path syntax without editing
    either the additions file or the code here.  A file added on Windows, for example,
    will be noted with "\" in its path, which likely won't work in a restore on Linux.
    This could be addressed by always using "/" in additions file paths and running
    os.path.normpath() to convert to "\" on Windows only, but this seems a rare use case. 
    As is, restores with additions should be run on the same platform as the prior merge.

      [3.0] UPDATE: this former caveat has been lifted, by converting paths in the
      backup's __added__.txt file for the hosting platform's separators.  Thus,
      backups made on Windows can now be restored on Unix, and vice versa.
      [3.2] CAVEAT: this may turn '\' in Unix filenames into path separators, but there 
      seems no reliable way to know if __added__.txt was created on Unix or Windows. 

    [3.0] Handle symlinks (to both files or dirs) by passing them to os.remove() just
    like normal/simple files; shutil.rmtree() has issues with symlinks at the top;
    symlinks are removed, not followed: delete the link itself, not what it refers to!

    [3.0] Changed arguably-confusing message format here from "restore removed file:"
    to the more consistent "removed added file:" -- it was removed, not restored.

    [3.2] Backup deletions here: to support rollbacks of delta sets made by the new 
    deltas.py, this now always backs up the items it removes from TO, instead of simply
    deleting them.  For deltas applied with mergeall.py's "-restore -backup", this means
    that items which were unique in TO will be put back by a later run with "-restore", 
    along with other undos.  This also makes it possible to roll back true rollbacks 
    made with "-restore -backup", restoring an archive to its former post-sync state;
    this use case is unlikely, but is supported at a small cost in extra backups size. 

    [3.2] Mangle __added__.txt names on Windows, so they match mangled names that may
    have been saved by ziptools extracts for indirect deltas.py syncs; see trymangle().
    
    [3.3] Morph __added__.txt pathnames to the new TO device for delta applies, in 
    case any component has an equivalent but different decoded Unicode representation. 
    Also change "removed added..." to generic "removed listed..." for delta-sets usage,
    and overloaded "-quiet" to suppress messages for normalized names (here, in paths).
    ------------------------------------------------------------------------------------
    """
    addsname = '__added__.txt'
    addspath = os.path.join(fromroot, addsname)
    numfilesdel = numdirsdel = 0

    if os.path.exists(addspath) and os.path.isfile(addspath):
        # [2.3] use utf8, not platform default
        addsfile = unicode_open(addspath, encoding=ADDENC)  # propagate open() exceptions: cancel merge
        try:                                                # OLD: adds file uses default encoding in 3.X
            while True:                                     # decodes can fail - catch via while, not for
                try:
                    line = addsfile.readline()
                except:
                    print('**Error: restore cannot read added file name: file retained')
                    print(sys.exc_info()[0], sys.exc_info()[1])
                    continue
                else:
                    if not line: break  # eof

                # [3.0] make restore paths portable
                delpath = line.rstrip()
                delpath = delpath.replace('/', os.sep).replace('\\', os.sep)                    
                delpath = os.path.join(toroot, delpath)
                
                # [3.2] replace nonportable characters on windows
                if RunningOnWindows:
                    delpath = trymangle(delpath)

                # [3.3] morph Unicode in path to the TO device tree
                tracer = print if not quiet else lambda *args: None
                delpath = matchUnicodePathnames(delpath, tracer)

                if os.path.isfile(FWP(delpath)) or os.path.islink(FWP(delpath)):   # [3.0] +symlinks
                    try:                                                           # [3.0] longpaths
                        backupitem(delpath, toroot, dobackup, quiet)               # [3.2] backup
                        os.remove(FWP(delpath))
                    except:
                        print('**Error: restore cannot delete file, retained:', delpath)
                        print(sys.exc_info()[0], sys.exc_info()[1])
                    else:
                        numfilesdel += 1
                        print(indent1 + 'removed listed file:', delpath)

                elif os.path.isdir(FWP(delpath)):
                    try:
                        backupitem(delpath, toroot, dobackup, quiet)               # [3.2] backup
                        shutil.rmtree(FWP(delpath, force=True), onerror=rmtreeworkaround)
                    except:
                        print('**Error: restore cannot delete dir, retained:', delpath)
                        print(sys.exc_info()[0], sys.exc_info()[1])
                    else:
                        numdirsdel += 1
                        print(indent1 + 'removed listed dir:', delpath)

                else:
                    print('**Error: restore skipped missing or unknown type file:', delpath)

        except:
            print('**Error during prior adds removal')    # others? don't reraise: do merge
            print(sys.exc_info()[0], sys.exc_info()[1])
        finally:
            addsfile.close()                              # close, except or not (non-CPython)

    return (numfilesdel, numdirsdel)                      # sums: add to merge's delete counts




def dropaddsfile(toroot):
    """
    -------------------------------------------------------------------------------
    In "-restore" mode, as a post-merge step get rid of the __added__.txt that
    the normal merge may have copied over to TO's root.  This is a special
    case, but it's quicker to drop it forcibly from the root here than to check
    for it as a skipped filename at each tree level during the merge (though 
    the merge's code supports skipping __bkp__ at top, __added__.txt otherwise).

    [3.2] This is no longer used: __added__.txt is now skipped during the 
    comparisons phase, but only at the trees' top levels; see mergeall.py.
    -------------------------------------------------------------------------------
    """
    addsname = '__added__.txt'
    mergedaddspath = os.path.join(toroot, addsname)
    if os.path.exists(mergedaddspath):
        os.remove(mergedaddspath)
        return True
    else:
        return False   # don't adjust merge's counters or print message




def rmtreeworkaround(function, fullpath, exc_info):
    """
    ---------------------------------------------------------------------------------------
    Catch and try to recover from failures in Python's shutil.rmtree() folder
    removal tool, which calls this function automatically on all system errors.

    ---------------------------------------------------------------------------------------
    [2.0] PENDING DELETION FAILURES:
    
    On Windows, deletes may be marked as pending, and not finalized atomically,
    leaving an item in place after the delete call returns.  This can cause
    rmtree (shutil or custom) operations to fail with a directory-not-empty
    error in rare cases, subject to devices and other activity on the machine.

    This seems a shortcoming (bug?) in shutil.rmtree for Windows, and may be
    improved in the future.  In fact, Python's own test system uses a custom
    rmtree with wait loops to avoid the issue.  Here, update failures are mostly
    harmless (leaving a difference to be resolved on the next run), and rare
    (seen on only 1 machine in 1 year's usage), but errors are better avoided.

    Short of low-level C API possibilities, the two solutions seem to be to move
    (os.rename) to a temp file and delete from there, or fall into a brief wait loop
    to watch for the file removal to be finalized.  The former is subject to some
    os.rename oddness (see ahead), and the latter is used in Python's own test system
    for rmtree calls.  Adopt the latter here -- this function is a callback on errors
    in shutil.rmtree(), and retries the rmdir in a wait-but-bounded loop.

    Note that this applies to, and is used by, _both_ backup folder removals here
    and general content-tree removals in mergeall.py for unique dirs in the TO
    tree; it's here because it was first observed during backup folder removals.
    The fix here mimics Python's test system's wait-timing technique of exponentially
    increasing delay times up to half a second.  Usually 0 or 1, but at most 10,
    retries are run, with delays from .001 to .512 seconds (to see how this is
    computed, run code [x = 0.001, while x < 1.0: print(x); x *= 2]).
    
    Caveats: Deletes that are only pending seem a curious property for a filesystem,
    and this fix feels hackish.  But this is harmless (it kicks in only on os.rmdir
    failures, adding a minor delay), and there's no budget for further research...
    Could watch for not empty (ENOTEMPTY) only, but other errors are not inconceivable.

    Related threads (though something more authoritative from Microsoft would be nice):
    http://stackoverflow.com/questions/3764072/c-win32-how-to-wait-for-a-pending-delete-to-complete
    http://bugs.python.org/issue19811

    Note: shutil.rmtree could also be replaced with the following (sans some Unix cruft):
        for (root, dirs, files) in os.walk(top, topdown=False):
            for name in files:
                os.remove(os.path.join(root, name))
            for name in dirs:
                os.rmdir(os.path.join(root, name))
    but this doesn't fix the Windows pending-deletes issue, and may be less robust
    and portable than shutil's time-honed alternative.

    UPDATE: the retry loop has now been seen to fire in more use cases: both when a
    file is truly in use (in which case, the loop and removal ultimately fail), and
    when it is not (in which case, the loop generally runs one or two times, and the
    removal succeeds).  Strange, but true...

    ---------------------------------------------------------------------------------------   
    [3.0] READ-ONLY PERMISSION FAILURES (pass):
    
    As a different issue, rmtree operations can also fail due to read-only files in
    the tree.  To work around this, an onerror handler like the following can be used,
    which is portable to both Unix-en and Windows, and works like a Unix "rm -rf":

    import stat
    def onerror(func, path, exc_info):           # this is portable code 
        if not os.access(path, os.W_OK):         # read-only permission?
            os.chmod(path, stat.S_IWRITE)        # change to allow writes
            func(path)                           # and retry operation 
        else:
            raise

    This workaround was also incorporated into the general onerror handler below as
    a first step, before attempting the retry loop described above.  However, this
    code is currently DISABLED (via the False), because it makes no sense to override
    read-only permissions in this single context only (what about simple file deletes?),
    and the user may have marked an item read-only on purpose to protect it.  Instead,
    users are expected to fix their read-only permissions and run mergeall again.

    It can be argued that permission changes may be okay in this context, because
    users understand that mergeall intends to remove items, and read-only items will
    leave trash behind if not fixed.  Moreover, files on Windows are sometimes marked
    read-only without any user action (e.g., camera card copies), causing rmtree to
    fail unexpectedly.  This argument was rejected in the end, in favor of mergeall's
    overarching policy that your data is your property; read-only files should never
    be deleted without user intervention, even if this incurs extra manual steps.

    ---------------------------------------------------------------------------------------   
    [3.0]..[3.3] FILE-NOT-FOUND ERRORS:

    On macOS, a hidden AppleDouble "._xxx" resource-fork file is automatically
    deleted with its "xxx" data-fork (real) file.  This occasionally causes
    exceptions in Python's shutil.rmtree, if the AppleDouble is deleted
    automatically _before_ rmtree gets around to deleting it manually.  Though
    rare, this has been seen to happen both on fast internal SSD and slow USB
    flash drives.  Skip the exception (and file) here for Mac "._" names only.

    A similar coupling can occur on Windows (e.g., for web-site folders and
    their HTML files, though this may be an Explorer kludge).  As of [3.2], this 
    case is now handled explicitly for foldrs before the pending-deletion loop.  
    Should this simply always ignore all file-not-found errors?  A file is gone 
    if it's gone, but file-not-found might be triggered in other contexts that 
    shouldn't be muted (e.g., long paths).

    UPDATE: Windows long paths seem moot here.  They can trigger file-not-found 
    too, but this should never arise in Mergeall.  It prefixes all paths by FWP() 
    to lift the default path-length limit, before they are passed to system calls.
    Hence, this context seems impossible, though file-not-found may arise otherwise.

    UPDATE: per testing, Windows auto-deletion of folders with their same-named
    files appears to happen in Explorer only, and not in the shell or Python.  
    Conversely, macOS auto-deletion of "._" AppleDouble files with their non-"._"
    files IS automatic in all contexts tested, including the shell and Python
    (create and view a file on an exFAT or FAT drive to see for yourself).  This
    seems to be embedded deeply in macOS's libraries, and likely reflects the 
    odd and proprietary resource/data-fork files split on that platform--a topic
    generally best ignored... except in POSIX tree-removal code like this.

    UPDATE [3.3]: per later research in the Android Deltas Sync system, Windows 
    auto-deletes are part of a Windows concept known as Connected Files, which
    can be tweaked in the registry; Explorer just happens to recognize this
    concept.  Still, auto-deletes do NOT happen in either the Windows shell or 
    Python file ops, so the test here is moot, and has been disabled as of [3.3]. 
    For a fuller exploration, visit the parallel common.rmtreeworkaround() tools 
    in A-D-S, at https://learning-python.com/android-deltas-sync/common.py.
    ---------------------------------------------------------------------------------------
    """

    """
    # NO: why chmod in this context only?
    # try to fix read-only errors first? [3.0]
    if (not os.access(fullpath, os.W_OK)
        and function in [os.rmdir, os.remove, os.unlink]):
        try:
            print(indent2 + 'fixing read-only on', fullpath)
            os.chmod(fullpath, stat.S_IWRITE)
            function(fullpath)
        except:
            pass     # fail: try other workaround below (or not? tbd)
        else:
            return   # okay: this fix worked, proceed with shutil.rmtree
    """

    # Windows only, directory deletes only [2.0]
    if RunningOnWindows and function == os.rmdir:

        """
        # NO: moot per above - disable till use case arises [3.3]
        # assume auto-removed with associated file by Windows, or other [3.2]
        if exc_info[0] == FileNotFoundError:
            msg = '**Note: ignored FileNotFoundError for Windows dir'
            print(msg, fullpath)
            return                                   # folder deleted with file: proceed                               
        """

        # wait for pending deletes of contents
        timeout = 0.001                              # nit: need to try iff ENOTEMPTY
        while timeout < 1.0:                         # 10 tries only, increasing delays
            print(indent2 + 'retrying rmdir')        # set off, but not just for pruning
            try:
                os.rmdir(fullpath)                   # rerun the failed delete (post FWP!)
            except os.error as exc:
                if exc.errno == errno.ENOENT:        # no such file (not-empty=ENOTEMPTY) 
                    return                           # it's now gone: proceed with rmtree
                else:
                    time.sleep(timeout)              # wait for a fraction of second (.001=1 msec)
                    timeout *= 2                     # and try again, with longer delay
            else:
                return                               # it's now gone: proceed with rmtree

    # macOS only, ignore file-not-found for AppleDouble files [3.0]
    if exc_info[0] == FileNotFoundError:

        if RunningOnMac:
            itemname = os.path.basename(fullpath)
            if itemname.startswith('._'):
                # assume auto-removed with associated file by macOS, or other
                print('**Note: ignored FileNotFoundError for AppleDouble', fullpath)
                return
            
        elif RunningOnWindows:
            # NO: this seems moot - isdir() should be False if FileNotFound [3.3];
            # if this once had a purpose in a prior coding, it's been lost to time;
            # it may have been for files auto-deleted with folders - that's moot too;
            pass
            """
            if os.path.isdir(fullpath):
                # assume removed by Windows, or other
                print('**Note: ignored FileNotFoundError for Windows dir', fullpath)
                return   # or just ignore all not-found excs on Windows?
            """

    raise  # all other cases, or wait loop end: reraise exception to kill rmtree caller







'''DEFUNCT

==============================================================================================
THE FOLLOWING FUNCTIONS ARE NO LONGER USED (but retained as examples and lessons)

Instead of the following, restructured mergeall's recursive comparison algorithm to
skip '__bkp__' items in the top-level os.listdir result only.

Testing in Python 3.X showed os.rename to be unreliable.  On Windows, it fails when
the directories are on different devices (e.g., a USB stick and C:, possibly due to
differing file systems).  It also generated unexplainable permission errors on one
Windows test machine, even when the source and destination were on the same file
system.  As recursive copies and deletes are slow, recoded comparisons to skip
the folders in pure Python code instead.  The shutil.move call tries os.rename and
falls back on copy+delete too, but it's prone to the same issues seen here.
==============================================================================================

import tempfile, stat


def excludebkpdirs(toroot, fromroot):
    """
    ---------------------------------------------------------------------------------------
    Remove both archive's __bkp__ dirs from consideration, before diffs detection begins.
    To avoid complicating and slowing (or rewriting) change detection, simply move
    (rename) these out to a temp dir, and restore them after change detection finishes.
    They will not register changes, and so won't be propagated to any other archives.
    
    This and restorebkpdirs are coded fairly defensively, as this requires system calls;
    os.rename has been seen to fail for a true temp dir on Windows due to permissions
    (for no readily apparent reason...), so resort to program's cwd as a fallback option.
    
    On Windows, the destination of os.rename cannot exist, even for dirs; use new subdirs.
    mkdtemp adds random 6-character sequences to dir name till unique; add pid to be sure.
    Caveat: this may run up against directory path-length limits on some platforms?
    ---------------------------------------------------------------------------------------
    """
    global tempdir, temptobkp, tempfrombkp
    tempdir = temptobkp = tempfrombkp = None
    exists, join = os.path.exists, os.path.join
    try:
        tobkp   = join(toroot, '__bkp__')
        frombkp = join(fromroot, '__bkp__')
        if exists(tobkp) or exists(frombkp):          
            try:
                tempdir = tempfile.mkdtemp(prefix='mergeall-', suffix=str(os.getpid()))
                if RunningOnWindows:
                    os.chmod(tempdir, stat.S_IWRITE)   # may require force writeable?
                open('temp.txt', 'w').write('try rename\n')
                try:
                    os.rename('temp.txt', join(tempdir, 'temp.txt'))
                except:
                    os.remove('temp.txt')
                    raise  # reraise
                else:
                    os.remove(join(tempdir, 'temp.txt'))
            except:
                print('using cwd as temp dir fallback')         # show sys.exc_info()[0,1]?
                print(sys.exc_info()[0], sys.exc_info()[1])
                tempdir = os.getcwd()                           # temp unusable; or os.curdir

            if exists(tobkp):
                print('excluding', tobkp)
                temptobkp = join(tempdir, 'to.__bkp__')
                os.rename(tobkp, temptobkp)                     # quick move, not copy

            if exists(frombkp):
                print('excluding', frombkp)
                tempfrombkp = join(tempdir, 'from.__bkp__')
                os.rename(frombkp, tempfrombkp)                 # either can exist or not
    except:
        print('Cannot move __bkp__ to temp: rerun after manually moving out of archive')
        assert False, 'mergeall changes cancelled'



def restorebkpdirs(toroot, fromroot):
    """
    ---------------------------------------------------------------------------------------
    Restore __bkp__ folders from temp dir, after diffs detection, and before changes begin.
    See excludebkpdirs above for more details.
    ---------------------------------------------------------------------------------------
    """
    global tempdir, temptobkp, tempfrombkp
    join = os.path.join
    try:
        if temptobkp:
            tobkp = join(toroot, '__bkp__')
            print('restoring', tobkp)
            os.rename(temptobkp, tobkp)
            
        if tempfrombkp:
            frombkp = join(fromroot, '__bkp__')
            print('restoring', frombkp)
            os.rename(tempfrombkp, frombkp)
    except:
        print('Cannot restore __bkp__ from temp, changes cancelled: restore from %s' % tempdir)
        assert False, 'mergeall changes cancelled'
    else:
        if tempdir != None and tempdir != os.getcwd():    # not if fallback to cwd!
            os.rmdir(tempdir)
    
    
    
def isbkpdir(path, archroot):
    """
    ---------------------------------------------------------------------------------------
    Original idea: Call this from mergeall to skip a TO or FROM __bkp__ path during diffs
    detection phase.  Because these are skipped, they won't trigger any changes in the
    changes phase.  This is required to avoid including __bkp__ in synchronization (see top
    docstring).  Later replaced with os.rename moves, which was later replaced with
    comparison recoding.
    ---------------------------------------------------------------------------------------
    """
    bkproot = os.path.join(archroot, '__bkp__')
    return os.path.normpath(path).startswith(os.path.normpath(bkproot))  # equate / and \; case?

    # that is...
    #return path[(len(archroot) + len(os.sep)):] == '__bkp__'

DEFUNCT'''

    



[Home page] Books Code Blog Python Author Train Find ©M.Lutz