mergeall — Folder Synchronization for Manual "Clouds"

Last changed Aug-31-2016.

This is the high-level usage guide. If you are looking for brief start-up or version details, see Readme.html and its Quick Start.

This page describes and gives usage pointers for mergeall—an open source Python 3.X/2.X script and tkinter GUI useful for managing backups and changes in multiple copies of large data sets (a.k.a. archives) stored in directory trees (a.k.a. folders). mergeall is specifically targeted at quickly synchronizing changes in content mirrored across multiple devices such as laptops, tablets and USB flashdrives, and in some contexts can provide a manual alternative to cloud-based storage.

In this document:

 

The Short Story

This system makes a destination folder the same as a source folder quickly.

It first detects differences by walking the two directory trees in full, comparing their structure, and checking their files' timestamps and sizes, with an optional limited content test. It then resolves variances by automatically or selectively running in-place changes for differing items only—copying changed and unique items in the source to the destination, and pruning unique items in the destination. These changes are applied to both files and folders, as described in more detail ahead.

The net result is a fast one-way synchronization that makes the entire destination tree identical to the source, without requiring full-tree copies. This may be used to synchronize multiple folder copies to each other directly, or to and from a common base (e.g., a USB stick or local network drive). In the latter mode, the base device serves the same intermediary role as cloud storage, and program runs achieve the same effect as cloud transfers.

Though broader in scope, the most common role for this system is mirroring large archives across multiple devices—on changes, run mergeall once to synchronize to a USB flashdrive (or other), run again to propagate to other computers as desired, and run diffall occasionally to verify archive integrity, as covered ahead here and here.

A Few Details

The system spans 3,518 source-code lines. It consists of a 1,318-line main script (roughly half of which is docs); a 680-line threaded tkinter/Tkinter GUI launcher; a 303-line interactive console launcher; and new and modified utility scripts and modules that span 1,217 lines. It's related to, and reuses parts of PP4E's diffall and cpall examples, but is designed to merge trees quickly, without byte-for-byte compares or exhaustive copies. To protect data, the system can also automatically make backup copies of items changed, and restore a tree's prior version altogether if needed (recent upgrades described ahead).

For its author's perhaps pathological use case—a currently 73G archive with 45K files and 2.6K folders—full copies and compares can run for hours, regardless of the volume of changes made. Because the mergeall in-place merge system updates only for actual differences, it typically finishes in just 1 minute when changes are moderate. Running twice to merge changes both to and from an intermediary storage device serves to synchronize two computers, and usually takes just 5 minutes or less. These times reflect USB 2.0 ports on some devices and might be better for all-USB 3.0 usage, but merges are strikingly quicker than copies either way (see more timing details here and here).

Sync versus Merge

It may be useful to note that mergeall is not the same as a Windows file explorer folder merge. Unlike mergeall, Windows 7's merge does not automatically skip unchanged items (or even process them distinctly); handle mixed-mode names; or prune unique items in the destination. It also doesn't report on its plans, backup changes outside the archive, or support true rollbacks. Windows 8's merge improves on this by allowing users to manually skip all unchanged files whose timestamps and sizes match; this requires complex and error-prone user interaction, though, and still does not remove unique items in the destination, which means that renames and deletions in the source are not propagated.

In other words, Windows explorer merge does not synchronize two trees, it simply combines them; depending on user choices, it forms only their sum or union. By contrast, mergeall does not combine trees—it quickly makes a destination the same as a source. This makes its role overlap much more closely with cloud storage, which seeks to unify a single archive across multiple devices. With mergeall, archive copies are unified to and from local storage devices instead of a remote server, but the goal and effect are similar. The key difference is that cloud storage generally resides on devices owned by a third-party which fully controls access and price; with mergeall, your media is your own.

Why mergeall?

In short: privacy and control. The mergeall system supports just one approach to archive copy synchronization, based on manual whole-archive merges (see the usage notes ahead); may require changes on some platforms with more exotic file types (see its limitations ahead); and does not address the more difficult problem of multiple differing copies of the same file. On the other hand, it might just help you avoid giving away your personal property to cloud providers (and/or advertisers and intelligence agencies!). There's more on the tradeoffs of cloud storage in the conclusion, after we explore mergeall usage details.

 

Code, Docs, and Screenshots

This section summarizes available resources for readers who prefer to jump into a program right away (and others who might more shrewdly return here after perusing the usage modes guide and other recommendations ahead). Subsections here:

 

Code and Docs

Fetch the mergeall distribution package at the following listed link, and please mind its usage warnings—this system changes a directory permanently by design, though 2.0's backups option also saves prior versions of items replaced or deleted:

Some of the important bits in the package:

You can view the entire contents of the zipfile, including all its source code, either on your own computer after unpacking, or online at this site. mergeall is also available on this code index page.

 

Screenshots and Examples

The package also contains screen captures, logfiles, and sessions that help document the system's behavior. The links below characterize these resources. For a quick look at all of this system's screenshots, see its screenshots folder. This folder and its subfolders also have newer thumbnail index pages displayed automatically on a server; click the "index.html" files along the way when viewing these offline.

The latest GUI captures:
Version 2.4 on Windows 7. A new switch and toggle suppresses arguably-superfluous per-file backup messages.

Recent GUI captures:
Version 2.2 on Windows 7 and Windows 10. In the latter, 2.2 compares trees 10X faster with Python 3.5 (right) than 3.4 (left).

Primary GUI behavior—mostly version 2.0 on Windows 7, but unchanged in versions 2.1 and 2.2:
main window | with changes | folder browse | confirm run | report only | finish run | main+log+help+quit

Additional GUI modes and examples:
console mode | selective mode | minimal widgets | delete retries | path errors | cancel run | version 1.7

Example logfiles and sessions (all but one run in version 2.1):
2.2 synch/backup logfile | synch+restore session | rollback.py session | synch+restore stress test

An example use case:
mergeall is used regularly to sync a large data set maintained redundantly on a Windows 7 Ultrabook and a Windows 8.1 Tablet, via a USB (or other) drive intermediary. To make files, folders, and sizes match in Properties this way, select all of an archive except its unique __bkp__ folder, described in the next section. The minor cross-machine size difference reflects mergeall's own bytecode files nested in the archive.

Linux portability:
Mergeall is also used to backup archives to USB flash and external hard drives, and mirror them to a Linux (Gnome) desktop.

For more examples:
There are additional screenshots and example run logfiles and sessions in the package's examples folders.

The following sections explain how all these examples work.

 

Recent Upgrades

This section documents the most noteworthy enhancements made to the system in recent releases. It currently covers:

 

Release 2.0: Automatic Backups for Changes

Version 2.0 adds an automatic backups option for changes. When enabled, this option makes backup copies of all files and directories in the destination directory that will be destructively replaced or removed in-place during a mergeall synchronization run, and notes new items added. This makes mergeall runs generally safer, as unwanted or failed changes can be later undone by restoring backup copies.

Specifically, in both the automatic and selective updates modes described ahead, the prior versions of items about to be changed in the destination (TO) tree are saved in an automatically-created __bkp__ folder. This folder resides at the top of the destination archive; has one date/time-stamped subfolder for each mergeall run with backups; and recreates the full directory paths of items stored within it. Backup folders are local to an archive copy and not synchronized across trees by mergeall, but their per-run subfolders are automatically pruned by age when their number exceeds a changeable limit. As of version 2.1, new additions are also listed in __added__.txt files in __bkp__ run subfolders; see the next section.

For instance, an archive copy rooted at D:\MY-STUFF will have backup data of the following form after serving as the TO folder for a mergeall run with backups enabled (specific date/time values in folder names allow for by-name sorts):

    D:\MY-STUFF\__bkp__                                          # all backups for this copy
    D:\MY-STUFF\__bkp__\dateyymmdd-timehhmmss                    # a run's subfolder
    D:\MY-STUFF\__bkp__\dateyymmdd-timehhmmss\__added__.txt      # items added, by pathname (2.1)
    D:\MY-STUFF\__bkp__\dateyymmdd-timehhmmss\subfolders         # items removed and replaced 

Backups are particularly useful when using mergeall with a common base device to synchronize changes between multiple computers, as backups for changes are maintained independently on both the base device and each target machine. For example, if you configure mergeall to save 15 backups, each archive copy's __bkp__ can contain the 15 most-recent backup copies of frequently-changed files—one for each backups-enabled mergeall run in which the archive copy was the TO destination.

Backups also serve as a record of changes made that is an alternative to the logfile, and perhaps more easily inspected, and are required for version 2.1 restores (see the next section). The only downsides to change backups are that they take up extra space, and may slow the merge's resolution phase for extra copies; these penalties are incurred only for items recently changed, though, and are generally far outweighed by the extra data safety that backups provide.

To enable backups for changes, use either the new -backup command-line argument in mergeall itself, or the corresponding widgets and replies in its GUI and console launchers. When this option is used, if you ever need to restore prior versions of files, you can choose from the __bkp__ folders of any of the latest backup-enabled mergeall runs, on any of your archive copies. You can also change backup folders arbitrarily (e.g., deleting if too large), and can generally ignore any diffall.py differences generated by their per-run subfolders. See backup.py for implementation details, and Readme.html for more on this and other 2.0 changes.

Usage update: as of version 2.4, a new "-quiet" script flag (and corresponding toggles in the GUI and replies in the console launchers) suppresses per-file backup log messages for space. See version 2.4 release notes for details.

 

Release 2.1: Automatic Restores from Backups

Version 2.0's automatic change backups described in the preceding section were intended to allow one or a small group of files or folders to be restored by manual copies or nested subfolder merges, in the unlikely event that some mergeall changes went askew or were unwanted. This suffices for most cases, but doesn't help much if there are very many changes to back out. If, for example, a user inadvertently swaps the FROM and TO folders when using an old backup—a worst case user error—there may be hundreds or thousands of changes to undo, and the entire run should be backed out.

mergeall could formerly almost handle this itself by merging from an archive's backup folder to the archive's root, except that the merge normally deletes items unique to the TO tree—which would include all content not changed and hence not recorded in the backup. Merging the other way, from root to backup, won't help either, as the merge simply mirrors source to destination, and would thus erase saved prior versions in the backup tree. Moreover, neither merge would do anything about backing out new items added to the TO tree, as they are not recorded in backup folders.

Version 2.1 addresses this very rare case with full restores (a.k.a. rollbacks) that take the form of a merge from backup folder (FROM) to archive root (TO). To make this work, it enhances the 2.0 -backup command-line option available in all launch modes, and adds a new -restore option available in the main mergeall.py script only:

In automatic updates mode (described ahead), the combined net effect is a complete rollback of all changes made in a preceding run—restoring all items replaced or removed, and deleting all items added. To back out all the changes made by a prior run with backups enabled, simply locate your archive's most recent backup by its date/time name in the archive's __bkp__ folder, and run mergeall with the archive root as TO, and the latest backup's subfolder as FROM, with a command line of one of the following forms:

    mergeall.py archiveroot\__bkp__\dateyymmdd-timehhmmss archiveroot -auto -restore    # automatic rollback (see ahead)
    mergeall.py archiveroot\__bkp__\dateyymmdd-timehhmmss archiveroot -restore          # selective rollback (see ahead)
Run this in mergeall's source directory, or give the script's full path. You can also delete the __added__.txt file in __bkp__ first if you wish to back out only replacements or removals, and may use older backups (though they are best used with the selective updates mode described in the next section, as they may be arbitrarily out of synch with the current tree). Here's a more concrete example for reference and cut-paste-edit—backing out a run from a USB drive and saving the log:
    mergeall.py D:\MY-STUFF\__bkp__\date150325-time165817 D:\MY-STUFF -auto -restore > C:\...\Desktop\logfile.txt

If you're not a fan of complicated command lines, version 2.1 also includes a convenience script—rollback.py—that builds and runs an automatic updates mode restore command line, by globbing and sorting to find the archive's latest backups folder automatically. This script also verifies the run and its inputs for safety. Run it with just the root path, or with no arguments to be asked for the root interactively:

    rollback.py archiveroot        # convenience script, one argument or input
    rollback.py                    # input root path interactively: command line or click 
    rollback.py > logfile.txt      # save mergeall output (only) to a logfile 

You can also click this script's filename or icon to run it on Windows (like this) and skip the command line altogether. However, you may still want to use command lines to save mergeall's output to a logfile with ">" (the script's interactive prompts go to the console only), or to use selective updates mode during the restore (this requires a manual command line). See this script's example session for more details.

However they are invoked, restores generally assume that:

  1. You used -backup in the prior run. This is a requirement for restores in all usage modes; without backups, there is nothing to restore.

  2. You have not made additional changes to the tree since the run you're rolling back. Restoring after any additional changes are made in TO won't fully reset the tree's prior state—and may erase more recent work in automatic updates mode. Devices used only for backups or transfers, however, may retain their restorability indefinitely.

If your tree meets those criteria, there are three additional usage notes to be aware of:

Finally, keep in mind that -restore is just a failsafe, designed primarily to be used immediately after a run you wish to undo. Luckily, you will probably never need it if you use mergeall with a normal amount of care. As a guideline, you should generally be cautious with FROM/TO selection to avoid having to restore, and should ideally run with -report before -auto to see what will be changed. Backups of changes in __bkp__ are still intended mainly for manual piecemeal restores, though its new __added__.txt also serves as additional run documentation.

For an example of the restore option at work that demonstrates its general usage, see this example session and others in 2.1's examples folder. For an example __bkp__ folder with its new __added__.txt file, see this backups folder in the shipped test folders (if it's still present in your copy). For backup and restore implementation-level details, see Readme.html's change notes, and backup.py where most of the code resides.

Usage update: restores should generally be run on the same platform as the prior mergeall, for reasons detailed in the Readme.html note.

 

Release 2.2: Faster Execution via Python 3.5's os.scandir()

Version 2.2 is a performance optimization. As of this version, mergeall uses Python 3.5's os.scandir(), if available, to speed up tree comparisons radically. This new function eliminates system calls for some attributes of files, and is available both as a standard library tool in 3.5, and a PyPI package install. When this function is present, the mergeall comparison phase uses a custom implementation that leverages the new call instead of os.listdir(), which is retained to support older Pythons.

So how much faster is mergeall with the new call? Timing results on Windows show that the new function speeds tree comparisons—a major time component in most mergeall runs—by a factor of 5 to 10 depending on devices. This can shave dozens of seconds off total mergeall runtime for larger trees, and perhaps more. For an example use-case archive that's now 78G and has 50k files in 3k folders:

With a different archive, comparisons clocked in at 10x faster on a Windows 10 tablet, as captured here—with Python 3.4 on the left and 3.5 on the right (run with command lines of the form "py -3.X launch-mergeall-GUI.pyw" to test specific 3.X's on your own).

This speed gain is fully automatic, but requires that a scandir() be present. You can satisfy this requirement and take advantage of the mergeall 2.2 optimization by either:

  1. Running mergeall and its GUI with Python 3.5 or newer, where scandir() is standard in the os module
  2. Installing the PyPI package version of scandir() that supplies the call for older Pythons, including 2.7 and older 3.X
Because the current PyPI package version of scandir() requires a C code compile for most Pythons, using Python 3.5+ (option #1) may the simplest way to speed your merges, but watch the PyPI package for new developments on this front. You can use Python 3.5 for mergeall without breaking programs that rely on older Pythons; see this note for pointers.

If you do neither of the options listed above, mergeall falls back on the original os.listdir() scheme so that it still runs on older Pythons, albeit with the original and slower speed. Given its potentially major speed boost, though, a scandir() is now recommended for most mergeall users who manage non-trivial trees.

For more about this change, see Readme.html's version 2.2 notes, including its list of related links.

 

Usage Modes Guide

Feel free to use and modify this system as you wish, but this section provides some pointers on its intended roles. If you're looking for quick advice, skip ahead to the recommended usage modes below. This section's contents:

 

What mergeall Does

In general, this system is designed to make an entire destination (TO) directory tree the same as a source (FROM) directory tree much more quickly than brute-force copies. It achieves this by first scanning the tree to detect differences (using structural inspection for folders and primarily modification times for files), and then running the following updates in the following order:

  1. Differing same-named files are copied from FROM to TO.
  2. Unique items (files and folders) in TO are removed from TO.
  3. Unique items (files and folders) in FROM are copied to TO.
  4. Mixed-mode same-named items (file or folder) are replaced in TO by their FROM version.

Along the way, changes are backed up as described earlier. The net result mirrors the FROM tree to TO. How you utilize this tool, though, involves choices between automatic and selective update modes, and common base devices or direct transfers. The following sections define these terms and explore intended patterns of usage.

 

A Few Definitions

To described usage patterns, we need to first define some terms:

Common base device

This refers to a storage device— a USB flashdrive, local network drive, or other—that will be required in most cases to serve as an intermediary between different computers. Data is uploaded to the common base, and from there downloaded to other devices to synchronize them. Besides supporting such indirect transfers, a common base device also serves as a backup copy.

Direct transfer

This means a data transfer between trees performed without a common base—possible for synchronizing folders on the same machine, and for some types of device interfaces. A device that appears as a drive when connected by USB, for instance, allows for direct transfers.

Automatic updates mode

This is a mergeall option which automatically resolves tree differences without user intervention. This mode automatically applies all data-set changes from one tree or device to another, making a destination folder the same as a source. This can be useful both for quick backups, and for synchronizing multiple trees. Because this mode mirrors one whole tree to another, it generally requires that you work in only one archive or subfolder copy at a time, to avoid erasing another copy's changes on later full tree merges. In exchange, this mode provides the simplest and least error-prone option.

Automatic updates mode can be invoked in manual mergeall command lines, the console launcher, or the GUI launcher—which by design supports this updates mode only, along with reports. Choose this mode by using -auto (and omitting -report) in command lines, or by using inputs and widgets in the launchers.

Selective updates mode

This is a mergeall option which asks the interactive console user to approve or skip each individual file or folder update. Though more user-intensive and error-prone, this mode allows you to work in multiple trees simultaneously, and reconcile their changes in a more ad-hoc file-by-file fashion. It still requires that change sets be disjoint, to avoid changing files already changed in other trees but not yet synchronized. Unlike automatic updates mode, though, it can incorporate multiple and arbitrary change sets without having to treat entire archives or subfolders as locked while one copy is modified—albeit a substantial cost in user interaction requirements.

Selective updates mode can be invoked in manual mergeall command lines, or the console launcher—whose selective mode interaction is captured here. Choose this mode by omitting both -auto and -report in command lines, or by using launcher inputs.

 

How to Use the System

With the preceding definitions in mind, this section describes usage patterns—approaches to using the mergeall system to manage your data. Some use automatic updates, some selective, and some may use both. All may be applied with or without a common base. One note up front: in the first two sections that follow the notion of "tree" generally refers to a whole archive copy, but this need not always be so; the third section on subfolders will tighten up this concept.

 

Working in One Archive Copy at a Time

This system was designed in part for automatic merging of all data-set changes from one tree or device to another—the automatic updates mode described earlier. This mode can be useful both for general backups, and for synchronizing multiple trees. As a backup tool, for example, merging changes to a device in automatic updates mode will change only items modified since the prior backup.

In its synchronization role, because automatic updates mode makes the destination tree a mirror copy of the source, it works best if you're careful to make changes in only one copy at a time. When you want to work in another tree copy or device, first synchronize to propagate changes as follows:

  1. When using a common base, run mergeall to automatically upload changes from the changed tree to the base; then run again to automatically download the changes from the base to other trees to synchronize.

  2. When no common base is used, the upload step effectively goes away, as you'll run mergeall to automatically transfer the changed tree's updates to other trees directly.
This might become a frequent task if you work in multiple trees or devices often, but the merge steps run quickly (see the example run times above), and can use a simple USB stick or shared network drive as a common base. Moreover, you need to perform synchronization runs only when an entire batch of changes is ready for transfer, not on changes to each individual file. For example, in mode #1, when you're ready to make changes on a different device, simply run mergeall to upload from the last active device and download to the next active device; in between these transfers, no synchronization tasks are required.

The chief downside of this approach is that it requires some discipline to follow properly. Automatic updates mode assumes the destination tree should mirror the source tree exactly and in full, no matter how the destination may have changed. If you change multiple copies without synchronizing, automatically uploading or downloading from one tree may overwrite and thus erase the changes made in other trees. This can occur even if the trees' change sets are disjoint, because automatic merges work on a whole-tree basis. All changes, additions, deletions, and renames made in one tree are propagated to other copies, regardless of other trees' states.

That is, automatic updates works well and is simple to use, but requires some procedural diligence to avoid losing prior changes if multiple trees are modified but not synchronized between uploads. Specifically: You must treat the entire tree or device you're working in as the effectively "locked" copy until its changes are propagated; other copies must be treated as "read-only" until they incorporate the locked tree's updates by mergeall runs. Because mergeall makes synchronizations relatively quick and easy, though, this isn't necessarily more difficult than interfacing with cloud services on changes, and need not be run at all until switching active devices or propagating data for viewing (there's more on clouds in the wrap-up).

 

Working in Multiple Archive Copies at the Same Time

This system also has a selective updates mode described earlier, which allows you to choose updates to be applied. This mode supports working in multiple trees or devices simultaneously, and combining the changes made in them since the latest synchronization step on a file-by-file basis. Unlike automatic updates mode, it can incorporate multiple and arbitrary disjoint change sets without having to treat an entire tree as locked while any one copy is modified. At the same time, it also requires much more user interaction, and is much more prone to user error.

Like its automatic relative, selective updates mode can be used with or without a common base, and supports a variety of usage patterns. To reconcile two changed trees, do the following (and generalize these procedures for more than two trees):

  1. When using a common base, run once to selectively upload just the first tree's changes to the base; run again to selectively do the same for the second tree; and then run again to download the resulting combined base to each tree (perhaps in automatic updates mode, as the base should have both change sets).

  2. With no common base, simply run mergeall twice—once to selectively merge just the first tree's changes to the second; and once more with swapped from/to roles to merge just the second tree's changes to the first (perhaps via automatic updates mode, as only the second tree's changes should remain as differences).

To broadcast just one tree's changes to multiple possibly-changed trees:

  1. When using a common base, run once to selectively upload just the changed tree's changes to the base; then run again to selectively download just those changes to each other tree.

  2. With no common base, run mergeall to selectively transfer just the changed tree's changes into each other tree.

With a common base, you can also defer downloading changes, but this seems a recipe for disaster. Merges will grow more complex over time, as the base will grow more and more different from individual copies. Synchronizing from the base immediately when changes are integrated will minimize the risk of accidentally losing its changes in later mergeall runs, or changing a file already changed in another tree but not yet synchronized—a worst case scenario for shared data sets, and a state this more piecemeal mode seems likely to foster.

In other words, selective usage patterns require some diligence too, to integrate changes before trees grow too out of synch to reconcile. In fact, you still must treat the entire set of all modified files in any tree as "locked" until they are transferred to other trees or devices; other copies of these files should be "read-only" till synchronized. This constraint doesn't apply to an entire tree (as it does in automatic updates mode), but it's an inherent consequence of working in multiple copies simultaneously. Selective updates ultimately trade procedure for reliance on user memory—you don't have to restrict edits to one tree copy at a time, but you do have to keep track of which files in which tree are current.

Selective updates also requires careful choice of updates to apply, and is manual to be sure, but reconciling two arbitrarily disparate trees by nature requires some sort of manual human intervention. See file peer-to-peer-merge-run.txt in this system's examples/_older/other folder for a console log of a peer-level merge, performed by running mergeall twice as suggested by mode #4 above. This may be useful in limited contexts, but seems too manual to be a primary synchronization technique.

Sidebar: Selective Updates Alternatives

A future variant of this script could support the preceding's peer merges more directly instead of requiring multiple runs—by asking which version of changed files to use, and whether unique items in either tree should be copied over or pruned—but awaits some end-user experience. It's not clear whether this would be less or more confusing than separate one-way runs, and the merit of selective-mode usage in general remains to be shown. On the other hand, a direct peer merge would avoid analyzing differences twice. To try this extension as an exercise yourself, see mergeall.py's reusable comparetrees(), which already does half the work.

An automatic peer-to-peer merge, however, is impossible; without user input, it could not choose from differing same-named files, and could produce only the union of two trees' unique items in response to deletions or renames. A merge could, perhaps as an option, use the newest version whenever two same-named files differ, regardless of which tree it belongs to. This would pick up the latest changes, but was not pursued as it seems highly prone to error—it makes the extreme assumption that any change in any copy should invalidate all others, regardless of divergence since the last merge. It's also unclear in this scheme which tree to prefer for unique items (are they deletions or additions?). A more manual selective approach that asks the user about each difference seems more rational and safe.

 

Working in Multiple Subfolders at the Same Time

So far, we've seen how to apply automatic updates to work in one archive tree copy at a time, and selective updates to work in multiple trees simultaneously, but the automatic/selective dichotomy isn't quite as orthogonal as this may imply, and other schemes are possible.

For example, although mergeall makes an entire tree the same as another, this doesn't necessarily have to include every piece of data you've accumulated since the dawn of digital time. It's always possible to use automatic updates to synchronize just selected subfolders nested within an archive—rather than the whole archive tree—to and from a common base (or to another copy directly). This has some advantages, but they come with cautions:

In fact, this scheme is essentially the same as the preceding section's topic—with subfolders representing change sets, and automatic updates on subfolders replacing selective updates on a broader tree. To synchronize, simply use the prior section's modes #3 through #6 with these translations, and be sure to treat the currently-active subfolder copy as "locked" and all others as "read-only" just as for the prior section's more arbitrary change sets.

Subfolders have an advantage over the prior section's approach: parallel changes are easier to manage when limited to specific tree locales, and automatic updates mode is much easier than selective updates from a user's perspective. However, subfolder synchronization also comes with most of the same burdens and dangers: unlike full-tree approaches, keeping track of tree changes still becomes more your task than the system's.

Hence, selected subfolder synching is not generally recommended, except in limited cases. You're more likely keep trees in synch if your automatic updates are made on a whole-archive basis, and you restrict your edits to one full-tree copy at a time. That said, your merges will run faster if you organize your data wisely, with rarely-changed files in archive trees that need rarely be merged. If you do wish to make changes on different machines in parallel, though, you'll have to exercise some caution to avoid losing changes.

 

Recommended Usage Modes

Because of the complexities—and perils—of both selective updates mode and changing multiple trees simultaneously, mergeall's automatic updates mode and usage patterns #1 or #2 listed in the preceding section are generally recommended for most users (and frankly, have been the only techniques used in practice by the system's original developer). See the earlier example use case for screenshots of this approach in action with a common base. To summarize the model:

For data sets shared by multiple devices:

When changes are made on one device, run mergeall's automatic updates mode to upload them to a common base device, and run mergeall's automatic updates mode again to download them from the base to other devices when needed. You can make changes on just one device at a time, but need to synchronize this way only when switching to another device for edits, or propagating current data for viewing on other devices.

For data sets shared by multiple folders on the same device:

When changes are made in one folder, run mergeall's automatic updates mode to transfer them to other folders directly when needed. You can make changes in just one folder copy at a time, but need to synchronize this way only when switching to another folder for edits, or propagating current data for viewing in other folders.

Although you can apply these procedures to any subfolder nested in an archive's directory tree, it's generally simpler and recommended to run them on a whole archive. That way, mergeall is responsible for locating changes anywhere in the tree; for most real-world usage, this is much easier than keeping track of them yourself.

For multiple devices, this model is essentially a manual emulation of some cloud storage interfaces, where mergeall runs replace network transactions to and from a cloud server, and a local device used as the common base replaces remote cloud storage. Especially when augmented by version 2.0's automatic backup of items changed on each device, the common base's role becomes functionally very similar to many cloud services (again, more on clouds ahead).

The recommended automatic usage modes listed above offer the simplest and least error-prone solution, where their procedural requirements can be met. If you really must work on disjoint file sets in multiple trees or devices at the same time, though, be sure to synchronize regularly to avoid version skew—transfer your changes to other tree copies as soon as possible (if not immediately), per modes #3 through #6 above. You can perform these transfers with automatic updates mode if your changes are isolated in disjoint subfolders, but must use selective updates mode if they are more haphazard.

Put more strongly, version 2.0's automatic backup of changes helps protect your data in both automatic and selective updates modes, and 2.1's rollbacks provide a failsafe for catastrophic mistakes, but there's nothing mergeall can do if the same file is changed in two trees without synchronizing—a case that seems more likely when using a more sporadic simultaneous changes model. Following the recommendations above is the simplest way to avoid this situation.

Because it's generally easier, automatic updates mode is the only updates mode supported by the mergeall GUI launcher—the recommended way to use this system for most users. Selective updates mode is available in both the console launcher and manual mergeall command lines, which are more powerful alternatives for more advanced use cases.

 

Other Usage Recommendations

Apart from the preceding section's usage pattern suggestions, a variety of general techniques can help make mergeall more effective for your data. Here's a quick rundown of additional usage suggestions:

Experiment with the GUI live:

There is no formal usage guide for mergeall's GUI, because it is simple enough to qualify as self-explanatory. The screenshots above give a static picture of the GUI, but your best bet may be to experiment with it live—open the GUI launcher script, launch-mergeall-GUI.pyw; select your FROM and TO folders; choose a report-only or auto-updates run; make your logfile and backups choices as appropriate for your run; and press the "GO" button at the GUI screen's bottom to start the mergeall process. mergeall output appears in the GUI's text area, and the GUI changes its structure to present only items relevant to the selections you make. Be sure to start out with report-only mode, and use a TO folder you don't mind changing in auto-updates mode.

Design your archives wisely:

As a rule, all files and media that you wish to be managed by mergeall should be saved in your mergeall archive tree (or trees), not in any platform-specific default folders; this requires some discipline, but allows for quick copies and backups. To make merges faster, store infrequently changed data in a different archive tree than data you typically change; that way, you can run mergeall on just the regular-changes tree and skip the rest. Decade-old photo collections, for example, are unlikely to change often enough to warrant regular mergeall inspection. On the other hand, any data that may change should be in a folder mergeall visits so that updates are propagated, and version 2.2's speed optimization can make comparisons much faster for larger trees. Also note that your archive trees must be no larger than the storage space of devices to which they will be propagated; split up your tree if it's too big for your external drives.

View reports before updating:

Especially when first using the system, it's a good idea to run it in report-only mode before running it to perform updates—automatic updates in particular. The report shows differences found and describes the changes that automatic updates would make, allowing you to preview and verify the plan. In command-line usage, this means run with -report before -auto; in the launchers, use inputs and widgets to report first.

Use automatic backups:

As of version 2.0, for data safety it is recommended to always use mergeall's automatic backups option for changes described earlier, in both automatic and selective updates modes. While not foolproof, this option allows unwanted or erroneous mergeall run changes to be backed out if needed. Because this helps protect your archives (which are your digital property), it's enabled by default in the GUI; don't disable it unless backup copies would be too large or slow for your devices. Backups are also required for the next bullet's restores.

Use automatic restores if needed:

Though primarily intended for piecemeal restores, the prior bullet's backups also allow for complete rollbacks of immediately preceding runs as of version 2.1, in unlikely but catastrophic scenarios (if you mix up FROM and TO folders, for example). For details, see the new restore option described earlier. You should generally be cautious with folder selection to avoid restores altogether, and full rollbacks should be very rarely required; restores provide a failsafe recovery option if ever needed.

Keep multiple copies:

To protect your data further, keep multiple archive copies, and rotate their mergeall updates by age (always merge to the oldest copy). This way, you'll have additional backups to fall back on in case of rare but catastrophic device failures.

Run diffall.py:

For further archive fidelity, run the accompanying diffall.py script occasionally, to verify the integrity of archive copies by byte-for-byte comparisons. Unlike mergeall, diffall compares full file content instead of just file modification times, and so gives a slower but more complete proof of data equality. To run diffall, use mergeall's -verify command-line option or direct command lines; see manual-commands-cheat.txt for examples. For more on diffall, see its script's docstring, this 2.1 example session that uses it, and its -recent option documented in version 2.0 change notes in Readme.html. Also note that some differences are normal, including __bkp__ per-run subfolders used for change backups, and files changed trivially by Excel as discussed in Lessons-Learned.html and a version 1.4 usage note in Readme.html.

Fix file permissions:

Some file permissions preclude mergeall updates. This includes read-only and hidden/system files; some may be copied over to a destination, but cannot be updated there on changes. As the system does not modify your files' permissions automatically (your files are your property), you may want to change these yourself if they register as errors and skips in the mergeall log. In-use file errors can be addressed by rerunning. See the related usage note in Readme.html for more details.

Handle DST rollover:

If you use FAT devices (e.g., most flashdrives) on Windows, you'll probably want to adopt a policy for dealing with the 1-hour modtime skew that occurs at Daylight Savings Time (DST) rollovers. See the version 1.4 usage note in Readme.html for options—including the new script workaround in 2.0—and Lessons-Learned.html for additional context. This is easy to handle, but the default policy means that your FAT archive copies will be rewritten in full twice a year.

Use shorter names and paths:

On most systems (including Windows), there is a limit on both filename and directory path length, above which mergeall updates may fail. To avoid this, try to avoid excessively long filenames and excessively deep directory trees. If items fail due to length, you may need to manually shorten or prune them, or move them closer to the archive root; saved web pages are notorious in this department. On the upside, mergeall can handle any filename that works on your platform (including those containing spaces and other odd characters on Windows), even though some may be difficult to use in other platforms' shells; and properly propagates mixed-case file renames, even on platforms whose filenames are case-insensitive (including Windows).

Log to different drives:

Routing mergeall's output to a logfile that is located in the TO destination folder may cause mergeall to run substantially slower due to the extra writes, especially on flashdrives. Make sure your logfiles are routed to a different drive (e.g., on Windows, use C: for the log if D: is the TO destination tree).

Set your shell Unicode type:

Be sure to set Python's Unicode environment variable PYTHONIOENCODING to UTF8 (or other) in your shell or Control Panel if you receive Unicode errors when scripts like mergeall.py attempt to print non-ASCII filenames on your platform. This manual setting is not required for the GUI launcher—it automatically sets and propagates this variable to its mergeall.py subprocess, and does not route text to a console (only to a GUI and bytes-mode logfile). However, this setting may be required for both the console launcher, and mergeall.py when run directly from a command line—because both print filenames to the console, visiting any file with a non-ASCII name may otherwise abort these scripts, especially in 3.X. For more on this variable, see PP4E or LP5E.

Consider using Python 3.X and 3.5+:

As described in the Readme's release notes, it's generally recommended that mergeall and its launchers be run under Python 3.X instead of 2.X for trees having many non-ASCII filenames, and under Python 3.5 or later for larger trees. Using 3.X avoids some minor display issues, and using 3.5+ allows mergeall to run much quicker. Note that you should be able to install these Pythons without breaking programs that rely on prior releases; on Windows, install without filename associations, and run mergeall and its launchers from command lines instead of clicks (e.g., "py -3.5 launch-mergeall-GUI.pyw").

 

Limitations and Cautions

Though mergeall works as intended and continues to see regular action, it's not without the usual dark corners inherent in system-related tools. None of these are unique to mergeall—in fact, cloud providers and other backup systems must deal with many of the same issues. This section's subsections summarizes the set, though, so you can be fully aware of issues that may crop up:

 

Same-File Differences

Keep in mind that change sets must be disjoint to reconcile two trees at all (e.g., working on only website files in one tree, and spreadsheet files in another). The recommended usage modes described earlier avoid this issue altogether by limiting changes to one copy at a time. If you change the same file in two or more trees without synchronizing, though, you'll have to select a single version, and may have to manually reconcile in-file changes.

This is a dilemma that source control systems aim to address, and for which some products may attempt to apply proprietary solutions for a limited number of file types, but it remains an unavoidable potential pitfall in the general case. Whether you merge local copies with this system or resort to a network cloud, you must still be careful to avoid changing the same file in multiple trees or devices without synchronizing the file to all copies after each set of changes, by either manual or automated transfers.

 

Other Limitations

Beyond basic usage models, this system also comes with some open issues and caveats, described in the docstrings at the top of its code files; search for TBD and CAVEAT for details. Among them:

For notes on unusual file types, see version 1.5. For more on device failures, see Lessons-Learned.html in the docs folder. For more on timestamp and filesystem issues, see Lessons-Learned.html; the CAVEATS section in the docstring of the main mergeall.py script; and the usage notes in the version history of the top-level Readme.html file. The latter of these also includes specific usage notes and workarounds, some of which are omitted here; among its coverage and solutions:

The good news is that update failures are not generally harmful—they produce error messages in the log, but simply leave a difference to be resolved either manually or on the next mergeall run. See also other recommendations earlier, for pointers on dealing with some of these limitations.

 

Use With Care

However you use this system, also keep in mind that it may change its destination tree in-place. Moreover, by default it does so without making backups of files or directories added, replaced, or deleted (see the next paragraph). This is all by design, to optimize speed and space requirements; after all, the goal of this system is to synchronize large trees faster than brute-force copies. If in doubt, though, please try it on a temporary copy first, and make manual backups as needed. There is also a more formal warning in the Readme.html file that you should read before use.

Update: for added data safety, see the 2.0 automatic backup for changes option described earlier. When enabled, this option mitigates some data loss risk by automatically saving all files and directories replaced or deleted in-place, and noting all files added. This allows you to back out changes if needed—either by manual piecemeal copies, or by version 2.1's complete rollbacks—and should generally always be used. However, this should still not be considered foolproof, given the many ways that storage devices can fail. See the preceding recommended usage and other recommendations if you haven't already, for more pointers on promoting archive integrity.

 

Manual Merges versus Cloud Storage

In closing, here are a few words on this system's purpose. As noted earlier, mergeall's recommended usage mode corresponds closely to cloud services, where program runs replace cloud server transfers, a local base device replaces remote cloud storage, and backed-up changes on each destination device can help protect your data.

Compared to some of the current claims of cloud storage providers, though, the recommended mergeall usage model may require extra manual steps to synchronize; its on-demand whole-archive resolution must be run only when needed, but it must be run. On the other hand, some cloud services come with interface tasks of their own, and may not be quite as automatic as their marketing may imply. More crucially, cloud servers are controlled, in most cases, by financially-interested third parties, on which your digital property becomes wholly dependent—a massive downside, and a primary motivation for starting this project.

To be clear, data stored on commercial and/or public clouds—including Google Drive, Dropbox, Microsoft's OneDrive (formerly SkyDrive), Apple's iCloud, and Amazon's Cloud Drive:

If that's not enough to raise a red flag or two, also keep in mind that clouds are not a panacea for all the issues inherent in data storage (despite the Orwellian language on some of their web sites). Multiple copies or devices raise difficult problems that require careful resolution under any regime. The more a cloud promises simple solutions, the less likely it is to deliver them.

This issue has grown more acute as the ongoing computer revolution has coaxed more of us to move important personal property to digital storage. This is convenient to be sure, but comes with substantial tradeoffs and risks. Uploading your photo libraries to a public cloud is no different than giving a shoebox full of them to a complete stranger you met on the bus, for safe keeping. Some such strangers may not only pass along your shoebox to others without getting your okay, they might just hold it for ransom in the future. If you wouldn't do this in the "real" world, why would you do so on the web?

Regardless of how you proceed, please be careful out there. Trusting your personal digital property to a third party is inherently perilous, especially when that party is laden with agendas. For better or worse, the computer industry at present seems to have no shortage of companies jockeying to establish points of control that can be used to squeeze nickels out of people with fewer nickels left to be squeezed. Random example: a company that abruptly adopts an advertising-or-subscription model for a game that had been freely available for over two decades may not have your best interest at heart (see Windows 8 Solitaire!).

Postscripts

May-17-14: Adobe's Creative Cloud goes offline for a day leaving subscribers in the dark, as reported here, here, and here.

Mar-13-15: Per the web, Apple, Amazon, Google, Microsoft, Dropbox, and Facebook aren't immune to service outages either.

Mar-18-15: For an example of how sensitive an issue cloud storage can be, see this controversy regarding a cloud provider.

Mar-28-15: Speaking of changing the rules after you've become dependent, see Amazon's change here and here.


[Python Logo] Email Books Training Main © Mark Lutz