File: mergeall-products/unzipped/

# -*- coding: utf-8 -*-

   Normalize Unicode filenames for matching (part of the Mergeall system [3.3]).


This module defines tools to handle differing code-point representations of 
the same filename--an odd border case allowed under the Unicode standard.  If
these equivalent but different variants are not mapped (a.k.a. normalized) to
the same common form for comparisons, filename matching may fail, causing 
potentially lossy skew in syncs.  

To avoid this, matching now uses normalized filenames, but filesystem access 
still uses names' original, unnormalized forms.  This rule is applied in the
mergeall, deltas, and diffall scripts.  In addition, symlink paths require a
similar adjustment for compares; components in pathnames recorded from one 
device must be mapped to others; and the "-quiet" flag has been overloaded to
also suppress normalization messages in all three scripts.

This file largely documents this subtle issue, but also codes fixes. 


Name normalization is required because the Unicode standard has a massive hole: 
it allows the same text to be represented with different code-point strings.
This in turn requires equality to be generalized, to avoid issues when the same
string uses different Unicode forms across platforms or devices.  This impacts 
all tools that search or compare text, but can lead to content skew in sync 
tools like Mergeall if Unicode-variant filenames are present and not normalized.

As an example, the filename string 'Liñux.png' may take two equivalent but 
different decoded forms per the Unicode standard, which will not match by 
simple equality (e.g., '==' and 'in' in Python):

    >>> l = b'Lin\xcc\x83ux.png'.decode('utf-8')   # decomposed (NFD)
    >>> m = b'Li\xc3\xb1ux.png'.decode('utf-8')    # composed (NFC)
    >>> l, m
    ('Liñux.png', 'Liñux.png')     # equivalent text
    >>> len(l), len(m)             # len(code points)
    (10, 9)
    >>> l == m                     # same but different!

The first of these uses the longer NFD format; notice its encoded 'n\x...'.
The second instead uses the shorter NFC form, with a dedicated ñ code point.
Because different tools and platforms may use either form, both may be 
present, which can wreak havoc with programs that process text or filenames.
To do better, programs must normalize to a common form prior to matches:

    >>> from unicodedata import normalize             # Python stdlib tool
    >>> normalize('NFC', l) == normalize('NFC', m)    # or 'NFD': any common
Python's unicodedata module provides the required conversions to make text
agree.  For more background on this very dark Unicode corner, see:

This ultimately seems like a failure of the Unicode standard to fix a huge
interoperability problem, but is now accommodated in Mergeall 3.3 per above.


In terms of Mergeall specifically, this issue arose when applying a 
changes set to another device and platform via the Android Deltas Scripts 
package, which uses Mergeall as a nested tool.  See the package's docs here:

In that system, deltas are computed between a PC and proxy drive, and later 
applied on a phone from a copied zipfile.  Because filenames on the PC and 
proxy may be arbitrarily different from those on the phone, Unicode-variant 
skew is a potential whenever applying deltas between devices with differing 
policies.  Moreover, zipfiles may skirt name mapping used in other contexts.

While this issue materialized in delta sets, its risks prior to 3.3's 
normalization fix vary by the use case for which is used:

Basic syncs:
    In basic runs for direct merges of full content copies, the worst outcome
    appears to be one-time pointless copies of folders.  Basic file changes work 
    in 3.2: skewed names will be classified as unique in FROM and TO, but the FROM
    version will still replace the TO via a delete and copy, instead of a replace.

    Folder names are more problematic: a FROM folder with a varying name may be 
    copied in its entirety to TO as a unique item, rather than being traversed 
    as a same-name match to a nested diff.  The net result is simply a single 
    pointless copy of a FROM folder to TO: because TO is also deleted as unique,
    FROM and TO will match thereafter.  Even so, this is subpar; although it does
    not damage content, it may still be costly for large folders.

    In -restore runs to rollback prior changes, there is little risk of damage. 
    Name-variant mismatches are not generally possible, because FROM backup sets
    are applied to the same TO content tree from which they were generated.
    Although a TO tree may be relocated to a different platform before its
    rollback, this seems unlikely, and would be equivalent to the next item.

Delta sets:
    In -restore runs to apply change sets created by, the risk of
    content skew in 3.2 looms largest.  In this mode, FROM reflects both new 
    items, as well as changes to be applied later to matching TO names.  Unlike 
    full syncs, delta applications compare just deltas, not full content.  As a 
    result, when applying FROM delta sets, divergent Unicode names in TO will be 
    classified as unique and unchanged, and hence skipped rather than removed.

    The net effect will be duplicate files and folders for Unicode variants 
    in FROM: the FROM version will be copied over to TO, but the prior version
    in TO will linger as an out-of-sync copy.  Moreover, direct Mergeall 3.2 
    syncs that could repair the damage by removing the older and now-unique TOs 
    are not available on some platforms (e.g., Android 11+: see Android Deltas 
    Scripts).  Subtly, deltas' __added__.txt removal lists must also be adjusted
    for later TOs, because components of Unicode paths may differ arbitrarily.

This cropped up just once after nearly 8 years of Mergeall use, and may require 
a platform-specific context to appear at all (there's more on the failure below).
Nevertheless, Mergeall's developer does not use many non-ASCII filenames, and this
may be a larger concern for those who do.  By normalizing filenames for matching
in the,, and scripts, 3.3 avoids all the potential
3.2 issues listed above, and ensures data integrity more broadly.

Fine print: Mergeall 3.3 uses Unicode's "canonical" equivalence (e.g., NFC) but 
not its "compatibility" (e.g., NFKC), because the program operates in the realm 
of engineering, not linguistics.  It makes sense to equate code-point sequences
defined to stand for the same character.  It does not make sense to guess human 
intent: syncs must be deterministic.  For more on this topic, see links above.
Also see Additional Notes ahead for more on 3.3's assumptions and choices.


Unicode variants actually caused a doppelgänger (duplicate) for a changed file 
with a non-ASCII name in the examples/unicode folder of the thumbspage program 
(at  This file's name was the same, but 
used only _decomposed_ form on both macOS and Android 11, and _composed_ form 
on Android 10--until a propagate via zipfile from macOS stored the decomposed 
form redundantly on Android 10 only.  Here's the post-propagate scene in the 
same content folder on all three platforms, as revealed by Python 3.X:

On macOS APFS internal drive (MacBook Pro, Catalina):

    >>> for f in glob.glob('frigcal-L*.png'):          # display encoded forms
    ...     print(f, '=>', f.encode('utf8'))           # just one name flavor
    frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png'

On Android 11 shared storage (Samsung Z Fold3)

    >>> for f in glob.glob('frigcal-L*.png'):          # same as macOS: decomp
    ...     print(f, '=>', f.encode('utf8'))                                                   
    frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png'
On Android 10 shared storage (Samsung Note20 Ultra):

    >>> for f in glob.glob('frigcal-L*.png'):          # same-but-different names
    ...     print(f, '=>', f.encode('utf8'))
    frigcal-Liññux.png => b'frigcal-Li\xc3\xb1\xc3\xb1ux.png'
    frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png'
    >>> for f in glob.glob('frigcal-L*.png'):
    ...     print('-'.join(str(ord(c)) for c in f))    # decoded code points

The same sort of doppelgängers were generated for filename 'pymailgui-spÄÄÄm.png'
and the thumbspage HTML files for each in multiple locations--yielding 21 dups
after the files in question were changed and synced via a deltas set.


More generally, all three of the last section's platforms differ in their 
handling of Unicode filename variants, and some automatically map one form 
to another.  This effectively requires normalization for interoperability. 

macOS, for example, allows either form, but stores just one, and auto-maps the 
other via name normalization to the form first stored (at least on the current 
APFS filesystem; HFS+ formerly enforced a decomposed-only rule):


    >>> os.chdir('/Users/me/Desktop/temp')
    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'

    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    AAA-Liñux => b'AAA-Li\xc3\xb1ux'

Android 11 shared storage (on a Samsung Z Fold3, at least) similarly 
maps names to a single stored form for writes, but fully disallows the 
composed form by itself in some contexts.  The latter seems an Android 
11 bug; the failing open() call below succeeds in app-specific storage 
(and on other platforms), and may fail only if both name forms are 
created and mapped together first as follows (the composed form has been
seen to work by itself otherwise).  This bug is also out of scope here, 
but see the Unicode converter note ahead for a work-around if needed:


    >>> os.chdir('/sdcard/temp')
    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'

    # ANDROID 11 BUG!

    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    FileNotFoundError: [Errno 2] No such file or directory: 'AAA-Liñux'

    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'

Android 10 shared storage, by contrast, happily stores both forms on writes,
and doesn't perform any normalization mapping on its own (per a long story,
Android 11 swapped in a new FUSE-based implementation of shared storage to 
implement access restrictions, so all bets on 10/11 compatibility are off):


    >>> os.chdir('/sdcard/temp')
    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'
    AAA-Liñux => b'AAA-Li\xc3\xb1ux'

    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    AAA-Liñux => b'AAA-Li\xc3\xb1ux'
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'

And on all three devices, the internal code-point representation of a text
literal depends on what is used for the literal.  Hence, what is created
by an open() with a literal varies--what you paste is what you'll get:

    >>> len('AAA-Liñux'), 'AAA-Liñux'.encode('utf8')   # pasted NFD form
    (10, b'AAA-Lin\xcc\x83ux')
    >>> len('AAA-Liñux'), 'AAA-Liñux'.encode('utf8')   # pasted NFC form
    (9, b'AAA-Li\xc3\xb1ux')
    >>> open('AAA-Liñux', 'w').close()                 # varies per literal

In fact, it's even possible to generate mixed-format paths on macOS;
the first of the following has both decomposed and composed components,
and was generated by an 'echo' in the Bash shell:

    >>> for f in glob.glob('FROM/dir-Liñux/*'):  print(f, '=>', f.encode('utf8'))
    FROM/dir-Liñux/nested3-Liñux.txt => b'FROM/dir-Lin\xcc\x83ux/nested3-Li\xc3\xb1ux.txt'
    FROM/dir-Liñux/nested1-Liñux.txt => b'FROM/dir-Lin\xcc\x83ux/nested1-Lin\xcc\x83ux.txt'

Finally, from the anti-interoperability department, macOS's exFAT driver 
changes composed-form filenames to decomposed form on file writes.  This 
guarantees a perpetual diff in Mergeall 3.2 when composed names are copied
to exFAT (a later sync will remove the decomposed form--and then rewrite it 
again), and fully requires Mergeall 3.3 normalization to ignore the skew:

    >>> os.chdir('/Volumes/T7/_test')
    >>> for f in glob.glob('file3*'): os.remove(f)
    >>> len(b'\xc3\xb1'.decode('utf8'))
    >>> open(b'file3-Li\xc3\xb1ux.txt'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('file3*'):
    ...     print(f, '=>', f.encode('utf8'))
    file3-Liñux.txt => b'file3-Lin\xcc\x83ux.txt'

For the record, Windows, Linux, Android 11 app-specific storage, and Pixel
Android 12 shared storage were not involved in the failure, but all four 
store both name forms just like Android 10 shared storage's results above.
Given the reach of these platforms, name mismatches may be common.  Also 
for the record, all devices tested are configured to use US English in 
their settings, though this seems unlikely to influence these results.

Some additional variables (e.g., FAT32 drives) were not tested for this 
analysis, because they are in some sense irrelevant: whatever its causes,
Unicode skew is a vendor-supported reality that must be accommodated.


In the observed failure, auto-mapping like that demoed in the preceding 
section may have masked filenames' Unicode discord until the forgery was 
taken out of the picture by zipfiles.  But this is clearly a much broader 
platform-, version-, and filesystem-specific nightmare.  Alas, rather than
removing this Babel, the Unicode standard seems to have simply enabled it.

It's less clear how the unique filename forms arose on Android 10 alone.  
The subject files have traversed many platforms and drives--including FAT32
and exFAT, which were both used to sync the Android 10 device in the recent 
past, and the fully decomposed HFS+, which was used on macOS devices in the 
further past.  Android shared storage's quasi emulation of FAT32 is always a 
prime suspect in such border cases too: Android 10's lack of normalization 
may have been a defect, and Android 11 may have been fixed (and broken!), 
but Android's frequent auto-updates make forensics nearly impossible.

Ultimately, though, a change to the files on macOS created the duplicate in
Mergeall 3.2: the macOS and Android 10 filenames did not match while applying
updates recorded in the deltas zipfile, and macOS's version was classified as
unique instead of changed--and thus copied redundantly alongside the original.  

While a direct sync in Mergeall 3.2 would remove the former TO variant and 
so repair this skew specifically, this is not a solution generally (large 
folder mismatches and their copies are still problematic in syncs); won't 
help for blatant skewers like macOS's exFAT driver (whose diffs are forever); 
and is not an available option on Android 11+ (see Android Deltas Scripts).

Mergeall 3.3's normalization is the only full fix for this problem.  By 
neutralizing filename skew in comparisons, 3.3 sidesteps an arguable failing
of the Unicode standard, at a small cost in backward-incompatible behavior.


Bytes API: 
    Passing bytes to Python file interfaces to avoid Unicode decoding won't 
    help here.  Filenames that differ are different in both their encoded 
    bytes format, and their decoded code-point format.  This is true 
    regardless of the encoding scheme used to convert to/from bytes.

Nonportable filenames: 
    Unicode skew is similar to the problem of nonportable filename
    characters, but the auto-adjustment here is safer.  Auto-mangling 
    nonportable characters for comparisons can lead to data loss when the 
    mangled form is the same as another filename coincidentally.  No known 
    use case requires both a composed and decomposed Unicode-format name.

Normalization assumption:
    To expand on the prior point, 3.3's Unicode normalization assumes that
    the same name is not intentionally present in multiple Unicode forms 
    in the same folder.  If two equivalent forms are present, syncs will 
    arbitrarily update either of them.  This is potentially lossy, but seems
    far less likely than forms differing across content copies inadvertently.

    Indeed, the Unicode standard defines differing forms to be canonically 
    equivalent, so their redundant presence is really an error state allowed
    by some filesystems.  It seems astronomically unlikely that such a state 
    would be created by design, but please report contrary use cases.
    Note: direct syncs, where possible, can remove doppelgängers, but only
    if you run the syncs with Mergeall 3.2; 3.3 will equate names instead. 

Unicode converter:
    A new utility script which changes the Unicode format of all items in a 
    folder tree (e.g., to NFD) is available in Android Deltas Scripts here:

    This isn't shipped in Mergeall, as it's not required for normalization,
    but may be generally useful when propagating to Android 11 shared storage.
    Run it if needed to work around the bug illustrated above (see "BUG!").
    Unlike syncs, this utility script skips files with multiple name forms.
    Note: this script can also be used to display names not in a selected 
    format, and hence isolate any doppelgängers in your content.

Deltas contexts:
    The normalization code here runs twice for deltas: at creation and 
    apply times, which handle skew at the origin and destination, respectively.
    Nit: the resolution phase logic used for both deltas and rollbacks could
    be recoded separately (it simply deletes __added__.txt items and copies 
    over FROM items).  This would skip a pointless comparison phase, make run
    logs simpler, and eliminate some convolution for normal syncs, but would 
    also have to handle normalization and more separately and redundantly.

Change lists: 
    The "diffs" changes list in must now record both FROM and TO
    names.  The original TO is fetched on demand in the comparison phase, but 
    this won't work in the resolution phase because each tree level has it's 
    own originals dicts.  Hence, the changes-save code of required 
    mods too (even though it simply saves FROM items to a separate folder, 
    whose content will be mapped here to a TO later).  Unique-item lists 
    required no such structural change for names (they are FROM/TO specific),
    but must record unique names in unnormalized form for use in resolution.
Backups fix:'s removeprioradds() required a related fix, because names saved 
    in __added__.txt at deltas-create time may not match unnormalized TOs at
    deltas-apply time.  As names may differ arbitrarily when applying a deltas 
    set to a different device, that function must inspect every component of 
    a saved pathname and adjust as needed.  For implementation details, see 
    this file's new matchUnicodePathnames() below.

Diffall fix: imports and uses this function too, to normalize filenames for
    comparison of directory lists.  It's a very broad issue.


from __future__ import print_function    # Py 2.X compatibility
import sys, os, unicodedata              # stdlib normalizer tool
from fixlongpaths import FWP             # Windows long paths
error = print                            # always display error messages

def normalizeUnicode(name):
    [3.3] Normalize a decoded Unicode filename string into its composed form 
    (NFC).  This was split off to a separate function from the next, because it
    is also used to compare symlink paths: they may differ between FROM and TO,
    and in both and  Also used by script mode.

    Caveat: Python 2.X requires decoded unicode in normalize(), not encoded str,
    but 2.X is using str (a.k.a. bytes in 3.X) everywhere, obtained from file 
    APIs.  It's unclear what encoding to use here: UTF-8 is broad but may not 
    match some platform or drive defaults, yet platform defaults may not apply 
    to files on external drives.  This guesses, but Mergeall now very strongly 
    recommends Python 3.X for content with non-ASCII Unicode filenames--and 
    defers to 3.X for proper, or at least best possible, filename decoding.
    if int(sys.version[0]) >= 3:
        # python 3.X: all unicode
        return unicodedata.normalize('NFC', name)    # play well with others
        # python 2.X: str=>unicode=>str, partly a guess
           tryenc = sys.getfilesystemencoding()      # or hardcode utf-8?
           return unicodedata.normalize('NFC', name.decode(tryenc)).encode(tryenc)
           error('**Error normalizing 2.X Unicode name:', name)
           raise    # reraise error to stop and avoid data loss

def normalizeUnicodeFilenames(names, namesdir, trace=print):
    [3.3] Normalize all file/folder/link names in the argument list to NFC 
    (composed) Unicode representation, and return a dictionary mapping any
    normalized names back to their original unnormalized forms.  All filename
    comparisons are performed with the normalized (equated) forms, but all 
    file operations map back to the unnormalized forms (for both to and from):

      Name comparison:
          namesto, origtos = normalizeUnicodeFilenames(namesto, dirto)
          # compare namesto

      File access:
          origto = origtos.get(nameto, nameto)
          # access origto

    This function is used by both syncs and comparisons.
    Both clients must map back to original names to access the filesystem. must also delete a TO file before copying over a matched FROM
    if the two items' names differ; in this case, simply creating the FROM file
    in TO is not enough to replace, and will produce a duplicate.  Important:
    the initial delete should be run only when names differ, which is typically
    rare.  An initial coding that always deleted before copying ran a deltas sync
    some 14X slower on Android 11 shared storage for a comparable changes set 
    (28 minutes versus 2, @10k changes), due to that filesystem's deletion sloth.

    Note: this does not detect form dups, per "Normalization assumption" above;
    it could check for names membership, but the callers' response is unclear.
    See this file's top docstring for more info, and "[3.3]" in code for usage.

    orignames = {}
    newnames  = []

    for name in names:
        altname = normalizeUnicode(name)            # convert to common form to compare
        if altname == name:                         # unless it's already there
            newnames.append(altname)                # save original form for file access
            orignames[altname] = name
            trace('--Unicode normalized for name:', name, 'in:', namesdir)

    return newnames, orignames

def matchUnicodePathnames(delpath, trace=print, verbose=True):
    [3.3] When applying deltas with -restore, any part of an item's pathname in 
    __added__.txt may be equivalent but different from its counterpart in TO.  
    For example, the filename 'Liñux.png' has two different Unicode formats, and may
    appear anywhere along the listed delpath pathname.  This can happen because TO in 
    delta applies is not necessarily the same TO device used to make the deltas, and 
    platforms and tools may treat such same but unequal filenames differently. 

    To adjust, scan the path to find the true names of each part in the new TO.  This
    is a no-op most of the time, and is useless but harmless for -restore rollbacks 
    (where, unlike deltas, TO is always the same TO that was changed earlier).  See 
    also the docstring at the top of this file, and the backups.removeprioradds() call.
    debug1, debug2 = False, False  # force loop/norm on macos
    exists, join = os.path.exists, os.path.join

    if exists(FWP(delpath)) and not debug1:
        return delpath                        # skip the drama for most cases
        delparts = delpath.split(os.sep)      # already changed to use os.sep
        altpath = ''
        while delparts:
            front, *delparts = delparts       # pop front part off delparts
            tmppath = join(altpath, front)

            if exists(FWP(tmppath)) and not debug2:
                # this part okay
                altpath = tmppath
                if verbose: trace('--Path okay:', tmppath)

                # fix this part
                nfc = unicodedata.normalize('NFC', front)
                nfd = unicodedata.normalize('NFD', front)
                if front == nfc:
                    altpath = join(altpath, nfd)     # failed as NFC: replace with NFD
                    altpath = join(altpath, nfc)     # failed as NFD: replace with NFC

                if exists(FWP(altpath)):
                    trace('--Unicode morphed for name:', front, 'in:', altpath)
                    error('--Unicode morph failed for name:', front, 'in:', altpath)
                    altpath = delpath 
                    break                            # still not found?: punt

        return altpath

[Home page] Books Code Blog Python Author Training Search ©M.Lutz