File: mergeall-products/unzipped/backup.py
""" ========================================================================================== backup.py: automatic backup/restore subsystem (part of the mergeall system [2.0]) Summary: Make backup copies of all files and directories in the TO directory that will be destructively replaced or deleted in-place during a mergeall run. These items' prior versions in the TO tree are saved in the automatically-created __bkp__ folder at the top of the TO archive, with their full directory paths. Backups are not synchronized across trees, but are automatically pruned by age when their number exceeds a limit. This option makes mergeall generally safer, as unwanted or failed changes can be later undone by restoring backup copies. 2.1 extends this model to support complete rollback of a prior run, by merging a backup subfolder to an archive root. It augments backups to also record files added in a __bkp__\subfolder\__added__.txt file, and runs a special merge mode which does not delete unique items in TO, but does delete all items in TO that are listed in the backup's __added__.txt file (which is itself removed from TO after the merge). The net effect restores the TO tree's prior state, as long as the restore run is made before any additional changes in TO. Details: Because a given item name may appear at multiple places in the archive tree, replaced and deleted items are backed up in folders of this form (the mergeall log gives a more linear list of changed items): TO-archive-root\__bkp__\date-time\full-archive-path-less-root\filename Items in the __bkp__ folder are local to the TO archive copy, and save items changed in that copy only. Backup folders are not propagated to other archive copies by mergeall: unlike normal archive items, they are not synchronized across trees. Hence any changes in __bkp__ folders are not themselves backed up. Backup subfolders in __bkp__ are, however, automatically pruned by age when their number exceeds a limit (changeable in the code below). This allows changes from any of multiple mergeall runs to be selected for restores, if needed. Any errors while writing a backup copy of an item causes the replacement or deletion of that item to be skipped, as the operation is unsafe. The item will register as a difference on the next mergeall run, if not resolved manually. Note that users can make arbitrary changes to their __bkp__ folders (including deleting them altogether), as they are not synchronized and are created automatically whenever needed. For instance, if the default pruning limit results in too much space being taken by __bkp__, you may delete portions of it freely. Also note that __bkp__ folders may register differences in diffall.py runs. As each run has its own uniquely-named subfolders, these will usually be few, and can generally be ignored. Per-run folders' content differences may be useful, however, to compare and analyze backup copies from different runs; run diffall on the subfolders themselves. Purpose: Though this option makes mergeall runs safer in general, it was primarily intended as an automatic guard against propagating corrupted files across multiple archive copies, and an better alternative to manual backup copies or multiple archive stores cycled by date manually. Merges generally works without issue if all but one archive copy are treated as read-only; however, unwanted or incorrect changes may spread though archive copies if changes are made in an archive copy and then synched back to the main copy. In such cases, the original's versions of files may be quickly lost, if not backed up. By backing up items to be changed in each TO copy automatically and retaining multiple runs' backups, bad copies can generally be undone easily by taking the most recent backup's versions, without falling back to potentially much older full archive copies. Backup folders also serve as a record of the most recent mergeall runs against a tree. ------------------------------------------------------------------------------------------ More about __bkp__ folders: Stores __bkp__ folder at top of TO archive so it doesn't become disjoint from it, but does not include __bkp__ in the mergeall synchronization -- its contents are local to the TO archive, and not copied to/from other copies automatically. This is so, because: 1) Propagating backups between copies in round-robin trips between three (or more) devices seems potentially very confusing, and device source would be unclear. 2) Synchronizing would delete copes on TO not also on FROM, thereby limiting TO from having more than one backup, in one-way propagation use cases. 3) Automatic backup pruning wouldn't be possible, as it violates the principle that changes in the archive must all be disjoint during a single mergeall run. If __bkp__ was included in the synchronization process, it's not impossible that one of its folders on the TO archive would be both pruned by the backup system here and deleted due to an absence in the FROM archive -- regardless of the timing, the second deletion attempt would fail with an exception. More about disjoint changes: More generally, mergeall works at all only because its updates are mutually exclusive, and disjoint (a given name can appear in just one change category). Its changes are limited to these items, in this order: a) Same-named files (replaced in TO) b) Unique files and directories in TO (deleted from TO) c) Unique files and directories in FROM (copied to TO) d) Same-named file/directory type mismatches between TO and FROM (replaced in TO) Once any item is classified and changed in one of these categories, it is not further inspected or altered. Hence, each change in the TO tree is limited to its TO tree location, and cannot be impacted by, or appear as part of, any other changes category, regardless of the timing of changes made. Order does matter for renames on case-insensitive machines like Windows, but are sound as long as deletes occur before adds (see mergeall.py's mergetrees() docstring for more details). Backups rely on the fact that the change sets are disjoint: a file can be saved for only one category (a, b, or d), and new copies (c) are never saved and thus cannot overwrite saves for deletions, even for mixed-case differences on Windows. Synchronizing the __bkp__ tree between archives would violate this disjointedness rule, however, because automatic backup subfolder prunes may delete an item also scheduled for deletion in category (b). This is true even though all differences are detected before the changes phase begins, and only removing __bkp__ items from the set of backed-up items does not circumvent this; synchronization changes may still intersect with prunes. Therefore, top-level __bkp__ folders are removed from the set of synchronized items in the archives, and an archive copy retains only backups from mergeall runs where it served the role of TO archive. Backups can be manually copied into normal archive folders to propagate them to other archive copies if desired, but their management then becomes a user task; only items in the special-cased __bkp__ folders are automatically pruned when their items count limit is exceeded. ------------------------------------------------------------------------------------------ General implementation notes: Caveat: backup paths may be too long in some cases on some platforms. The original file's full archive path is recreated and required, because the same name may appear in more than one place in the tree (and filenames created by concatenating path parts seem more likely to exceed platform length limits). Does not store a single '.bkp' alongside original: would need to special-case to avoid .bkp.bkp... accumulation, and a '.bkp' may be a valid to-be-archived user file name. Uses copy instead of move (rename): need to retain original if code here fails, and also want to retain the original's mod times in the copy. os.rename also proved unreliable on Windows, especially across devices. All errors return with an exception: caller handles. In mergeall, the backup exception here causes the removal or deletion to be skipped, as the operation is then unsafe. Backs up files replaced or deleted, but does not backup new files added, as these would be just redundant copies of new data. Backup folder are excluded from the synchronization process by removing only top-level __bkp__ items from os.listdir results in the comparisons procedure. This proved more reliable than os.rename moves to/from a temporary folder (see unused code ahead). TBD: should backups be stored outside the archive itself? Originally placed in __bkp__ at root of archive, and synchronized so that prior versions are not stored on just one device only (which may fail). The downside of this is that, when this option is used, backups can accumulate quickly, and deletes are removed from the archive's actual folders but linger on the device taking up space. Could backup replaced files only, but that's less secure. RESOLVED: later opted to store __bkp__ in the TO archive for association, but not synchronize it across other archive copies. This was required, because prunes violated the disjoint updates rule which is the logical basis of mergeall. TBD: should backups be pruned? Currently, this is automatic, lest old backups accumulate. The downside of this is that the user may not want old backups removed. As an alternative, the end user could be expected to manage backups, and could also be expected to select the backups folder. This would be flexible, but complicates the command line and GUI with another directory choice, and deleting backup folders manually seems a substantial extra admin task. RESOLVED: backups are stored in the automatically-created __bkp__ at the top of the TO archives, and are automatically pruned by age after N (default 10) copies have accumulated. Given that __bkp__ is now local to each archive copy, this policy allows ample backup retention -- any of the last N runs can be unwound if needed. Users may also manually delete large backup folders before they are pruned for space. TBD: __bkp__ folders will register differences in diffall.py runs for their top-level per-run folders. These can be simply ignored in reports. diffall could skip __bkp__, like mergeall, but this seems too tight a dependency between the two programs to enforce. ========================================================================================== """ ################### # CODE STARTS HERE ################### from __future__ import print_function # Py 2.X import os, time, glob, shutil, sys, time, errno, stat, unicodedata # this module is _largely_ platform-neutral RunningOnMac = sys.platform.startswith('darwin') RunningOnWindows = sys.platform.startswith('win') import mergeall # for copyfile, copytree (from won't work here: recursive) indent1 = '....' # to distinguish messages here from main mergeall logic indent2 = '____' # rmdir retries: not just for pruning (also for uniqueto) # [2.1] from configs file, unless absent or errors try: from mergeall_configs import MAXBACKUPS except: MAXBACKUPS = 10 # keep up to this many backup folders in each archive copy's __bkp__ #------------------------------------------------------------- # [2.3] Use UTF8 Unicode encoding for __added__.txt restore # files in __bkp__, not platform default. These files contain # filenames which may have arbitrary characters. Python 2.X's # codecs.open files are always binary mode: must specialize # line-ends here too. [3.2] ditto for deltas-set __added__. #------------------------------------------------------------- ADDENC = 'utf-8' if sys.version[0] == '3': unicode_open = open # what 3.X probably should have done? unicode_linesep = '\n' # 3.X text mode files expand \n as needed else: import codecs unicode_open = codecs.open # 2.X compatibility (or use from...as...) unicode_linesep = os.linesep # 2.X codecs always binary: no \n expansion #------------------------------------------------------------- # Make just 1 subfolder per run, on first file backed up. # Else may make 1 new subfolder for each second in run! # Now also used by noteaddition, to get run's subfolder: # either backupitem or noteaddition may be called first. #------------------------------------------------------------- datetimestamp = None # try to prune just once per run, on first file backed up pruned = False # if -quiet, print backups message on first backup only [2.4] firstbkpmsg = True # [3.0] fix too-long paths on Windows from fixlongpaths import FWP # [3.3] fix Unicode variants in __added__.txt paths from fixunicodedups import matchUnicodePathnames def makedirs_ifneeded(dirpath): """ -------------------------------------------------------------------------------------- [2.3] Run an os.makedirs() call portably on Python 3.X or 2.X to create any and all parts of a directory path as needed. Only 3.X has the exists_ok flag to avoid an exception if a part already exists, and 2.X doesn't have 3.X's detailed exceptions. Alternative: could also call os.path.exists(dirpath) in an "if", but slightly slower. Note: Python's os.makedirs() in Lib\os.py is recursive, but probably doesn't need to be. Because it scans a linear directory path, a simple loop should suffice, and yield simpler code (alas, "batteries included" means you get what's shipped in the box). Recursion is really required only for arbitrary shapes, such as folder trees, and even then can be replaced with explicit stacks (see Learning Python 5E, p555-561). -------------------------------------------------------------------------------------- """ if sys.version[0] == '3': # python 3.X os.makedirs(FWP(dirpath), exist_ok=True) # 'recursive' mkdir, as needed else: # python 2.X try: os.makedirs(FWP(dirpath)) # 2.X has no exists_ok except OSError as E: # 2.X requires errno test import errno # only need here on exc if E.errno != errno.EEXIST: raise # reraise all others def backupitem(pathto, toroot, dobackup, quiet): """ ---------------------------------------------------------------------------------------- If enabled by command-line (or by proxy via GUI toggle or console reply), make a backup copy of items (files and directories) in the TO tree before they are destructively replaced or deleted in-place. This includes a: 1) File or link in the TO tree that is about to be replaced by a newer same-named file in the FROM tree 2) File or directory in the TO tree that is about to be deleted due to absence in the FROM tree 3) File, directory, or link in TO that is about to be replaced due to a dir/file/link type mismatch in the FROM tree 4) File, directory, or link in TO that is about to be deleted by a -restore run because it's listed in __added__.txt Unique files in FROM copied to TO are not backed up, as this is not a destructive action, though they are listed by noteaddition in the backup's __added__.txt file for the run. "pathto" is where the item to be destroyed resides; "toroot" is the top of the archive in the TO tree, where the backup will be stored in a __bkp__ subfolder under pathto's tail. Does not backup if not "dobackup": backups not enabled per user or rollback.py script. Prints only one general backup log message if "quiet", to mimimize logfile clutter. Either this or noteaddition may be called first: prune + timestamp if not yet done. No longer needed, because __bkp__ not synched: "if pathto.startswith(bkproot): return". [3.0] Use newly-added "strict" arg to copytree() to force it to propagate its first file exception to here, instead of printing an error message and continuing. We need to pass the eception to this function's caller, so the item's update is cancelled if its backup fails. Else, we might delete a TO folder without backing up parts of it! [3.0] Don't use copytree()'s new "skipcruft" here: we're backing up data already present in the TO tree. [3.0] Handle symlinks (to both files or dirs) by passing them to copyfile() just like normal/simple files; copyfile() tests for links up-front and copies specially. Symlinks are copied, not followed: backup the link itself, not what it refers to! [3.2] Now also called to backup __added__.txt items removed from TO by -restore runs, to allow delta-set changes to be rolled back (and rollbacks to be unrolled...). ---------------------------------------------------------------------------------------- """ global firstbkpmsg # do nothing if "-backup" not passed by command, launcher, or rollback.py if not dobackup: return # avoid nesting assert pathto.startswith(toroot) # sanity check: changed file must be in TO tree todir, tofile = '?', '?' # initialize for early exceptions try: # prune old backups the first time through here (or noteaddition) try: prunebkpdirs(toroot) except: print(indent1 + 'prune failed, but backups and mergeall continued') print(indent1 + '%s %s' % (sys.exc_info()[0], sys.exc_info()[1])) # make run's subfolder timestamp first time here (or noteaddition) datetimestamp = makeruntimestamp() # verify or create the backup-to path in the TO copy bkproot = os.path.join(toroot, '__bkp__') # toroot is cmdargs.dirto todir, tofile = os.path.split(pathto) archtail = todir[(len(toroot) + len(os.sep)):] # remove prefix=cmdargs.dirto bkppath = os.path.join(bkproot, datetimestamp, archtail) makedirs_ifneeded(bkppath) # 'recursive' mkdir, as needed # copy file or dir over to backup copy copytopath = os.path.join(bkppath, tofile) if not quiet: print(indent1 + 'backing up %s to %s' % (tofile, copytopath)) elif firstbkpmsg: # [2.4] suppress per-file messages (superfluous?), but indicate backups mode once allbkpsroot = os.path.join(bkproot, datetimestamp) print(indent1 + 'backing up all items to %s' % allbkpsroot) # [3.0] not bkppath! firstbkpmsg = False if (os.path.isfile(FWP(pathto)) or # [3.0] treat links like files os.path.islink(FWP(pathto))): # [3.0] fix long windows paths mergeall.copyfile(pathto, copytopath) # this never catches excs elif os.path.isdir(FWP(pathto)): os.mkdir(FWP(copytopath)) # [3.0] fix long Windows paths mergeall.copytree(pathto, copytopath, strict=True) # [3.0] fail on any except else: assert False, ('unknown file type: ' + pathto) # e.g. fifos: punt except: print(indent1 + '**Error backing up %s in %s' % (tofile, todir)) raise # reraise: handle in caller - cancel update, as it would be destructive def makeruntimestamp(): """ --------------------------------------------------------------------------------------- Make the run's unique timestamp, to be used for its subfolder name in the TO tree's __bkp__ backups folder. Factored to here, as this may be triggered by either backitem or noteaddition, either of which may be called first during a mergeall run. --------------------------------------------------------------------------------------- """ global datetimestamp if not datetimestamp: datetimestamp = time.strftime('date%y%m%d-time%H%M%S') # backup's unique top dir name return datetimestamp def prunebkpdirs(toroot, maxbackups=MAXBACKUPS): """ --------------------------------------------------------------------------------------- On first backup in session, auto-delete the oldest backup dir(s) in the TO archive if needed, keeping just the most recent N. Most of this was adapted from the frigcal GUI's backup system. TBD: this could be left to user, but seems likely to accumulate. Caller handles any exceptions here: this pre-merge step shouldn't be fatal - proceed. [3.0] Recoded to skip a failed directory and continue the prune to process other folders, instead of ending the prune at the first failure. Else, this may miss folders in the unlikely event that a failure of a more recent backup folder (e.g., permissions) prevents the prune from reaching earlier backups later on the list... which can only ever happen if the #backups has been reduced in the configs file. Users must address the failure to allow the failing folder to be pruned eventually. Callers are still notified with an exception, but details will be displayed here. --------------------------------------------------------------------------------------- """ global pruned if pruned: return else: pruned = True bkproot = os.path.join(toroot, '__bkp__') if os.path.exists(bkproot): backuppatt = 'date*-time*' currbackups = glob.glob(os.path.join(bkproot, backuppatt)) currbackups.sort(reverse=True) prunes = currbackups[(maxbackups - 1):] # earliest last, via names sort anyfailed = False for prunee in prunes: # globs have full paths print(indent1 + 'pruning', prunee) # normally 0 or 1, unless failed try: shutil.rmtree(FWP(prunee, force=True), onerror=rmtreeworkaround) except Exception: anyfailed = True print(indent1 + 'this prunee failed, but pruning continued') print(indent1 + '%s %s' % (sys.exc_info()[0], sys.exc_info()[1])) assert not anyfailed, 'Some prunes had errors' # [3.0] notify caller of excs def noteaddition(pathto, toroot, dobackup): """ -------------------------------------------------------------------------------------- [2.1] Log unique FROM items (files and dirs) added to the TO tree in a text file, with 1 file path (relative to the TO tree's root) per line. The logged items are stored in file: toroot\__bkp__\subfolder\__added__.txt. This allows additions to be automatically removed by "-restore" as part of a later run to perform a complete rollback. The adds file also serves as run documentation, in addition to logfiles. Either this or backupitem may be called first: prune + timestamp if not yet done. [3.2] The archtail length calc here can be thrown off by a trailing / or \ in toroot (a.k.a. cmdargs_dirto); it's now removed in getargs() to avoid the drama. This happened only when toroot cmdline arg had a trailing slash, and only for top-level, unnested adds, but truncated the leftmost character in this case. For items nested in folders, a double '//' left by the comparison-phase walker spared the calcs here, and backupitem() was made immune by its spilt() call. This also impacted the new deltas.py, where the issue was first uncovered. This does not impact diffall.py or cpall.py (they simply treat two // as one). [3.3] Note only: this doesn't print per-item messages, because the mergeall.py caller does (for the new copied item recorded here). By contrast, deltas.py now does print a message in similar code that adds to an __added__.txt file in delta sets (for future TO removals), because the mergeall.py print code is not run. -------------------------------------------------------------------------------------- """ if not dobackup: return # avoid nesting try: # prune old backups the first time through here (or backupitem) try: prunebkpdirs(toroot) except: print(indent1 + 'prune failed, but note-add and mergeall continued') print(indent1 + '%s %s' % (sys.exc_info()[0], sys.exc_info()[1])) # make run's subfolder timestamp first time here (or backupitem) datetimestamp = makeruntimestamp() # build and make the adds file's path addsname = '__added__.txt' bkproot = os.path.join(toroot, '__bkp__') # toroot is cmdargs.dirto runroot = os.path.join(bkproot, datetimestamp) # bkp subfolder for this run addspath = os.path.join(runroot, addsname) # adds file in run's subfolder makedirs_ifneeded(runroot) # 'recursive' mkdir, as needed # write copied item's relative path to adds file # [2.3] use utf8 for filenames, not platform default archtail = pathto[(len(toroot) + len(os.sep)):] # remove prefix=cmdargs.dirto addsfile = unicode_open(addspath, encoding=ADDENC, mode='a') # add to end of file try: # OLD: use default encoding in 3.X addsfile.write(archtail + unicode_linesep) # 3.X expands \n, 2.X does not finally: addsfile.close() except: # do NOT reraise: no need to cancel the update, as adds are non-destructive print('**Error noting add of %s' % pathto) else: # don't issue a trace mesage here, as it seems gratuitous in the logs # print(indent1 + 'noted addition in TO of', archtail) pass def trymangle(delpath): """ ---------------------------------------------------------------------- [3.2] For removeprioradds(): on Windows, mangle nonportable filename characters to "_" to match other tools' mods. Do so always, instead of just on errors: there's no way the unmangled form could be stored, and os.path type tests fail with unmangled names. Mangling can happen when content is copied from Unix to Windows, by ziptools default unzips or similar. Mergeall in general assumes that FROM names are mangled as needed by unzips or other copies before syncs are run on Windows, but __added__.txt lists evade such transforms. Hence, this code must mangle items listed in __added__.txt to match the names under which they are stored on TO on Windows. Caveats: names may be mangled differently by some tools (e.g., macOS copies to FAT32/exFAT use Unicode privates), but the unmangled form clearly won't work. This also risks removing files whose names match the mangled name only coincidentally, but that seems astronomically unlikely, and skipping removals seems worse. See ziptools.ziptools.py and ziptools' _README.html#nomangle for more details; per the latter, run the included fix-nonportable-filenames.py before transferring content to skip filename interoperability issues like this altogether, and always for transfers to Android shared storage not mangled here. Theory: this use case cannot arise when propagating content with Mergeall alone (it never mangles), or using intermediate drives with filesystems or platforms that disallow nonportable names (e.g., Linux refuses to write such names to FAT32 and exFAT). However, some other tools mangle names on Windows (e.g., ziptools), and some drive/platform combos allow but munge names (e.g., macOS munges to and from Unicode privates on exFAT). In particular, changes propagated indirectly by Mergeall's deltas.py and ziptools may wind up mangled on only Windows: comparisons to a proxy drive with unmangled or munged characters may record unique TO names in __added__.txt in unmangled form, but these names are mangled on Windows if content was unzipped there by ziptools. TBD: should mangling also be done for FAT/exFAT write errors on Linux? ziptools does not mangle on Linux (it skips with error messages), but other tools may. Really, the filename fixer script should be run in this use case - unlike macOS (which munges nonportable chars to/from Unicode privates covertly), Linux mangles will be perpetual diffs. ---------------------------------------------------------------------- """ startpath = delpath # drop drive first, so ':' not mangled drive, delpath = os.path.splitdrive(delpath) # 'c:', '\folder\file' # split path to parts delparts = delpath.split(os.path.sep) # only on windows \ here, [0]='' # illegal chars nonportables = ' \x00 / \\ | < > ? * : " ' # for filesystems, not platforms nonportables = nonportables.replace(' ', '') # drop space used for readability if not any(c in part for part in delparts for c in nonportables): # none found: mangling won't help return startpath else: # mangle entire path, and report replacements = {ord(c): '_' for c in nonportables} mangledparts = [part.translate(replacements) for part in delparts] mangledpath = os.path.sep.join(mangledparts) mangledpath = drive + mangledpath print(indent1 + '--Name mangled: "%s" => "%s"' % (startpath, mangledpath)) return mangledpath def removeprioradds(fromroot, toroot, dobackup, quiet): """ ------------------------------------------------------------------------------------ [2.1] Remove items listed in fromroot's __added__.txt file from the toroot tree, if __added__.txt and the listed files are present, as part of a complete rollback from backups (or deltas-set apply) in a mergall.py "-restore" run. This is a pre-merge step: order matters for renames on Windows (the merge must delete and then add, else delete may remove differently cased names). Assumes fromroot is a __bkp__ subfolder (or at least has an __added__.txt), but does not fail if not -- in all cases, ignore exceptions here. The user may have deleted __added__.txt to back out removals+replacements only (not adds), and may have created a custom __added__.txt elsewhere in another tree to be merged. We need to care about closing the file on exceptions; this is now a pre-merge step. The __bkp__ folder's __added__.txt will be copied over to the TO root by the normal merge; it's deleted manually later, rather than skipping the name during the merge. CAVEAT: because noteaddition() records added items using the path syntax of the platform on which the prior mergeall ran, it's not possible to remove prior additions on a different platform having incompatible path syntax without editing either the additions file or the code here. A file added on Windows, for example, will be noted with "\" in its path, which likely won't work in a restore on Linux. This could be addressed by always using "/" in additions file paths and running os.path.normpath() to convert to "\" on Windows only, but this seems a rare use case. As is, restores with additions should be run on the same platform as the prior merge. [3.0] UPDATE: this former caveat has been lifted, by converting paths in the backup's __added__.txt file for the hosting platform's separators. Thus, backups made on Windows can now be restored on Unix, and vice versa. [3.2] CAVEAT: this may turn '\' in Unix filenames into path separators, but there seems no reliable way to know if __added__.txt was created on Unix or Windows. [3.0] Handle symlinks (to both files or dirs) by passing them to os.remove() just like normal/simple files; shutil.rmtree() has issues with symlinks at the top; symlinks are removed, not followed: delete the link itself, not what it refers to! [3.0] Changed arguably-confusing message format here from "restore removed file:" to the more consistent "removed added file:" -- it was removed, not restored. [3.2] Backup deletions here: to support rollbacks of delta sets made by the new deltas.py, this now always backs up the items it removes from TO, instead of simply deleting them. For deltas applied with mergeall.py's "-restore -backup", this means that items which were unique in TO will be put back by a later run with "-restore", along with other undos. This also makes it possible to roll back true rollbacks made with "-restore -backup", restoring an archive to its former post-sync state; this use case is unlikely, but is supported at a small cost in extra backups size. [3.2] Mangle __added__.txt names on Windows, so they match mangled names that may have been saved by ziptools extracts for indirect deltas.py syncs; see trymangle(). [3.3] Morph __added__.txt pathnames to the new TO device for delta applies, in case any component has an equivalent but different decoded Unicode representation. Also change "removed added..." to generic "removed listed..." for delta-sets usage, and overloaded "-quiet" to suppress messages for normalized names (here, in paths). ------------------------------------------------------------------------------------ """ addsname = '__added__.txt' addspath = os.path.join(fromroot, addsname) numfilesdel = numdirsdel = 0 if os.path.exists(addspath) and os.path.isfile(addspath): # [2.3] use utf8, not platform default addsfile = unicode_open(addspath, encoding=ADDENC) # propagate open() exceptions: cancel merge try: # OLD: adds file uses default encoding in 3.X while True: # decodes can fail - catch via while, not for try: line = addsfile.readline() except: print('**Error: restore cannot read added file name: file retained') print(sys.exc_info()[0], sys.exc_info()[1]) continue else: if not line: break # eof # [3.0] make restore paths portable delpath = line.rstrip() delpath = delpath.replace('/', os.sep).replace('\\', os.sep) delpath = os.path.join(toroot, delpath) # [3.2] replace nonportable characters on windows if RunningOnWindows: delpath = trymangle(delpath) # [3.3] morph Unicode in path to the TO device tree tracer = print if not quiet else lambda *args: None delpath = matchUnicodePathnames(delpath, tracer) if os.path.isfile(FWP(delpath)) or os.path.islink(FWP(delpath)): # [3.0] +symlinks try: # [3.0] longpaths backupitem(delpath, toroot, dobackup, quiet) # [3.2] backup os.remove(FWP(delpath)) except: print('**Error: restore cannot delete file, retained:', delpath) print(sys.exc_info()[0], sys.exc_info()[1]) else: numfilesdel += 1 print(indent1 + 'removed listed file:', delpath) elif os.path.isdir(FWP(delpath)): try: backupitem(delpath, toroot, dobackup, quiet) # [3.2] backup shutil.rmtree(FWP(delpath, force=True), onerror=rmtreeworkaround) except: print('**Error: restore cannot delete dir, retained:', delpath) print(sys.exc_info()[0], sys.exc_info()[1]) else: numdirsdel += 1 print(indent1 + 'removed listed dir:', delpath) else: print('**Error: restore skipped missing or unknown type file:', delpath) except: print('**Error during prior adds removal') # others? don't reraise: do merge print(sys.exc_info()[0], sys.exc_info()[1]) finally: addsfile.close() # close, except or not (non-CPython) return (numfilesdel, numdirsdel) # sums: add to merge's delete counts def dropaddsfile(toroot): """ ------------------------------------------------------------------------------- In "-restore" mode, as a post-merge step get rid of the __added__.txt that the normal merge may have copied over to TO's root. This is a special case, but it's quicker to drop it forcibly from the root here than to check for it as a skipped filename at each tree level during the merge (though the merge's code supports skipping __bkp__ at top, __added__.txt otherwise). [3.2] This is no longer used: __added__.txt is now skipped during the comparisons phase, but only at the trees' top levels; see mergeall.py. ------------------------------------------------------------------------------- """ addsname = '__added__.txt' mergedaddspath = os.path.join(toroot, addsname) if os.path.exists(mergedaddspath): os.remove(mergedaddspath) return True else: return False # don't adjust merge's counters or print message def rmtreeworkaround(function, fullpath, exc_info): """ --------------------------------------------------------------------------------------- Catch and try to recover from failures in Python's shutil.rmtree() folder removal tool, which calls this function automatically on all system errors. --------------------------------------------------------------------------------------- [2.0] PENDING DELETION FAILURES: On Windows, deletes may be marked as pending, and not finalized atomically, leaving an item in place after the delete call returns. This can cause rmtree (shutil or custom) operations to fail with a directory-not-empty error in rare cases, subject to devices and other activity on the machine. This seems a shortcoming (bug?) in shutil.rmtree for Windows, and may be improved in the future. In fact, Python's own test system uses a custom rmtree with wait loops to avoid the issue. Here, update failures are mostly harmless (leaving a difference to be resolved on the next run), and rare (seen on only 1 machine in 1 year's usage), but errors are better avoided. Short of low-level C API possibilities, the two solutions seem to be to move (os.rename) to a temp file and delete from there, or fall into a brief wait loop to watch for the file removal to be finalized. The former is subject to some os.rename oddness (see ahead), and the latter is used in Python's own test system for rmtree calls. Adopt the latter here -- this function is a callback on errors in shutil.rmtree(), and retries the rmdir in a wait-but-bounded loop. Note that this applies to, and is used by, _both_ backup folder removals here and general content-tree removals in mergeall.py for unique dirs in the TO tree; it's here because it was first observed during backup folder removals. The fix here mimics Python's test system's wait-timing technique of exponentially increasing delay times up to half a second. Usually 0 or 1, but at most 10, retries are run, with delays from .001 to .512 seconds (to see how this is computed, run code [x = 0.001, while x < 1.0: print(x); x *= 2]). Caveats: Deletes that are only pending seem a curious property for a filesystem, and this fix feels hackish. But this is harmless (it kicks in only on os.rmdir failures, adding a minor delay), and there's no budget for further research... Could watch for not empty (ENOTEMPTY) only, but other errors are not inconceivable. Related threads (though something more authoritative from Microsoft would be nice): http://stackoverflow.com/questions/3764072/c-win32-how-to-wait-for-a-pending-delete-to-complete http://bugs.python.org/issue19811 Note: shutil.rmtree could also be replaced with the following (sans some Unix cruft): for (root, dirs, files) in os.walk(top, topdown=False): for name in files: os.remove(os.path.join(root, name)) for name in dirs: os.rmdir(os.path.join(root, name)) but this doesn't fix the Windows pending-deletes issue, and may be less robust and portable than shutil's time-honed alternative. UPDATE: the retry loop has now been seen to fire in more use cases: both when a file is truly in use (in which case, the loop and removal ultimately fail), and when it is not (in which case, the loop generally runs one or two times, and the removal succeeds). Strange, but true... --------------------------------------------------------------------------------------- [3.0] READ-ONLY PERMISSION FAILURES (pass): As a different issue, rmtree operations can also fail due to read-only files in the tree. To work around this, an onerror handler like the following can be used, which is portable to both Unix-en and Windows, and works like a Unix "rm -rf": import stat def onerror(func, path, exc_info): # this is portable code if not os.access(path, os.W_OK): # read-only permission? os.chmod(path, stat.S_IWRITE) # change to allow writes func(path) # and retry operation else: raise This workaround was also incorporated into the general onerror handler below as a first step, before attempting the retry loop described above. However, this code is currently DISABLED (via the False), because it makes no sense to override read-only permissions in this single context only (what about simple file deletes?), and the user may have marked an item read-only on purpose to protect it. Instead, users are expected to fix their read-only permissions and run mergeall again. It can be argued that permission changes may be okay in this context, because users understand that mergeall intends to remove items, and read-only items will leave trash behind if not fixed. Moreover, files on Windows are sometimes marked read-only without any user action (e.g., camera card copies), causing rmtree to fail unexpectedly. This argument was rejected in the end, in favor of mergeall's overarching policy that your data is your property; read-only files should never be deleted without user intervention, even if this incurs extra manual steps. --------------------------------------------------------------------------------------- [3.0]..[3.3] FILE-NOT-FOUND ERRORS: On macOS, a hidden AppleDouble "._xxx" resource-fork file is automatically deleted with its "xxx" data-fork (real) file. This occasionally causes exceptions in Python's shutil.rmtree, if the AppleDouble is deleted automatically _before_ rmtree gets around to deleting it manually. Though rare, this has been seen to happen both on fast internal SSD and slow USB flash drives. Skip the exception (and file) here for Mac "._" names only. A similar coupling can occur on Windows (e.g., for web-site folders and their HTML files, though this may be an Explorer kludge). As of [3.2], this case is now handled explicitly for foldrs before the pending-deletion loop. Should this simply always ignore all file-not-found errors? A file is gone if it's gone, but file-not-found might be triggered in other contexts that shouldn't be muted (e.g., long paths). UPDATE: Windows long paths seem moot here. They can trigger file-not-found too, but this should never arise in Mergeall. It prefixes all paths by FWP() to lift the default path-length limit, before they are passed to system calls. Hence, this context seems impossible, though file-not-found may arise otherwise. UPDATE: per testing, Windows auto-deletion of folders with their same-named files appears to happen in Explorer only, and not in the shell or Python. Conversely, macOS auto-deletion of "._" AppleDouble files with their non-"._" files IS automatic in all contexts tested, including the shell and Python (create and view a file on an exFAT or FAT drive to see for yourself). This seems to be embedded deeply in macOS's libraries, and likely reflects the odd and proprietary resource/data-fork files split on that platform--a topic generally best ignored... except in POSIX tree-removal code like this. UPDATE [3.3]: per later research in the Android Deltas Sync system, Windows auto-deletes are part of a Windows concept known as Connected Files, which can be tweaked in the registry; Explorer just happens to recognize this concept. Still, auto-deletes do NOT happen in either the Windows shell or Python file ops, so the test here is moot, and has been disabled as of [3.3]. For a fuller exploration, visit the parallel common.rmtreeworkaround() tools in A-D-S, at https://learning-python.com/android-deltas-sync/common.py. --------------------------------------------------------------------------------------- """ """ # NO: why chmod in this context only? # try to fix read-only errors first? [3.0] if (not os.access(fullpath, os.W_OK) and function in [os.rmdir, os.remove, os.unlink]): try: print(indent2 + 'fixing read-only on', fullpath) os.chmod(fullpath, stat.S_IWRITE) function(fullpath) except: pass # fail: try other workaround below (or not? tbd) else: return # okay: this fix worked, proceed with shutil.rmtree """ # Windows only, directory deletes only [2.0] if RunningOnWindows and function == os.rmdir: """ # NO: moot per above - disable till use case arises [3.3] # assume auto-removed with associated file by Windows, or other [3.2] if exc_info[0] == FileNotFoundError: msg = '**Note: ignored FileNotFoundError for Windows dir' print(msg, fullpath) return # folder deleted with file: proceed """ # wait for pending deletes of contents timeout = 0.001 # nit: need to try iff ENOTEMPTY while timeout < 1.0: # 10 tries only, increasing delays print(indent2 + 'retrying rmdir') # set off, but not just for pruning try: os.rmdir(fullpath) # rerun the failed delete (post FWP!) except os.error as exc: if exc.errno == errno.ENOENT: # no such file (not-empty=ENOTEMPTY) return # it's now gone: proceed with rmtree else: time.sleep(timeout) # wait for a fraction of second (.001=1 msec) timeout *= 2 # and try again, with longer delay else: return # it's now gone: proceed with rmtree # macOS only, ignore file-not-found for AppleDouble files [3.0] if exc_info[0] == FileNotFoundError: if RunningOnMac: itemname = os.path.basename(fullpath) if itemname.startswith('._'): # assume auto-removed with associated file by macOS, or other print('**Note: ignored FileNotFoundError for AppleDouble', fullpath) return elif RunningOnWindows: # NO: this seems moot - isdir() should be False if FileNotFound [3.3]; # if this once had a purpose in a prior coding, it's been lost to time; # it may have been for files auto-deleted with folders - that's moot too; pass """ if os.path.isdir(fullpath): # assume removed by Windows, or other print('**Note: ignored FileNotFoundError for Windows dir', fullpath) return # or just ignore all not-found excs on Windows? """ raise # all other cases, or wait loop end: reraise exception to kill rmtree caller '''DEFUNCT ============================================================================================== THE FOLLOWING FUNCTIONS ARE NO LONGER USED (but retained as examples and lessons) Instead of the following, restructured mergeall's recursive comparison algorithm to skip '__bkp__' items in the top-level os.listdir result only. Testing in Python 3.X showed os.rename to be unreliable. On Windows, it fails when the directories are on different devices (e.g., a USB stick and C:, possibly due to differing file systems). It also generated unexplainable permission errors on one Windows test machine, even when the source and destination were on the same file system. As recursive copies and deletes are slow, recoded comparisons to skip the folders in pure Python code instead. The shutil.move call tries os.rename and falls back on copy+delete too, but it's prone to the same issues seen here. ============================================================================================== import tempfile, stat def excludebkpdirs(toroot, fromroot): """ --------------------------------------------------------------------------------------- Remove both archive's __bkp__ dirs from consideration, before diffs detection begins. To avoid complicating and slowing (or rewriting) change detection, simply move (rename) these out to a temp dir, and restore them after change detection finishes. They will not register changes, and so won't be propagated to any other archives. This and restorebkpdirs are coded fairly defensively, as this requires system calls; os.rename has been seen to fail for a true temp dir on Windows due to permissions (for no readily apparent reason...), so resort to program's cwd as a fallback option. On Windows, the destination of os.rename cannot exist, even for dirs; use new subdirs. mkdtemp adds random 6-character sequences to dir name till unique; add pid to be sure. Caveat: this may run up against directory path-length limits on some platforms? --------------------------------------------------------------------------------------- """ global tempdir, temptobkp, tempfrombkp tempdir = temptobkp = tempfrombkp = None exists, join = os.path.exists, os.path.join try: tobkp = join(toroot, '__bkp__') frombkp = join(fromroot, '__bkp__') if exists(tobkp) or exists(frombkp): try: tempdir = tempfile.mkdtemp(prefix='mergeall-', suffix=str(os.getpid())) if RunningOnWindows: os.chmod(tempdir, stat.S_IWRITE) # may require force writeable? open('temp.txt', 'w').write('try rename\n') try: os.rename('temp.txt', join(tempdir, 'temp.txt')) except: os.remove('temp.txt') raise # reraise else: os.remove(join(tempdir, 'temp.txt')) except: print('using cwd as temp dir fallback') # show sys.exc_info()[0,1]? print(sys.exc_info()[0], sys.exc_info()[1]) tempdir = os.getcwd() # temp unusable; or os.curdir if exists(tobkp): print('excluding', tobkp) temptobkp = join(tempdir, 'to.__bkp__') os.rename(tobkp, temptobkp) # quick move, not copy if exists(frombkp): print('excluding', frombkp) tempfrombkp = join(tempdir, 'from.__bkp__') os.rename(frombkp, tempfrombkp) # either can exist or not except: print('Cannot move __bkp__ to temp: rerun after manually moving out of archive') assert False, 'mergeall changes cancelled' def restorebkpdirs(toroot, fromroot): """ --------------------------------------------------------------------------------------- Restore __bkp__ folders from temp dir, after diffs detection, and before changes begin. See excludebkpdirs above for more details. --------------------------------------------------------------------------------------- """ global tempdir, temptobkp, tempfrombkp join = os.path.join try: if temptobkp: tobkp = join(toroot, '__bkp__') print('restoring', tobkp) os.rename(temptobkp, tobkp) if tempfrombkp: frombkp = join(fromroot, '__bkp__') print('restoring', frombkp) os.rename(tempfrombkp, frombkp) except: print('Cannot restore __bkp__ from temp, changes cancelled: restore from %s' % tempdir) assert False, 'mergeall changes cancelled' else: if tempdir != None and tempdir != os.getcwd(): # not if fallback to cwd! os.rmdir(tempdir) def isbkpdir(path, archroot): """ --------------------------------------------------------------------------------------- Original idea: Call this from mergeall to skip a TO or FROM __bkp__ path during diffs detection phase. Because these are skipped, they won't trigger any changes in the changes phase. This is required to avoid including __bkp__ in synchronization (see top docstring). Later replaced with os.rename moves, which was later replaced with comparison recoding. --------------------------------------------------------------------------------------- """ bkproot = os.path.join(archroot, '__bkp__') return os.path.normpath(path).startswith(os.path.normpath(bkproot)) # equate / and \; case? # that is... #return path[(len(archroot) + len(os.sep)):] == '__bkp__' DEFUNCT'''