File: mergeall-products/unzipped/fixunicodedups.py

# -*- coding: utf-8 -*-

r"""
================================================================================
fixunicodedups.py:
   Normalize Unicode filenames for matching (part of the Mergeall system [3.3]).


SYNOPSIS
========

This module defines tools to handle differing code-point representations of 
the same filename--an odd border case allowed under the Unicode standard.  If
these equivalent but different variants are not mapped (a.k.a. normalized) to
the same common form for comparisons, filename matching may fail, causing 
potentially lossy skew in syncs.  

To avoid this, matching now uses normalized filenames, but filesystem access 
still uses names' original, unnormalized forms.  This rule is applied in the
mergeall, deltas, and diffall scripts.  In addition, symlink paths require a
similar adjustment for compares; components in pathnames recorded from one 
device must be mapped to others; and the "-quiet" flag has been overloaded to
also suppress normalization messages in all three scripts.

This file largely documents this subtle issue, but also codes fixes, including 
both name mappers and a path-normalization walker. 


BACKGROUND
==========

Name normalization is required because the Unicode standard has a massive hole: 
it allows the same text to be represented with different code-point strings.
This in turn requires equality to be generalized, to avoid issues when the same
string uses different Unicode forms across platforms or devices.  This impacts 
all tools that search or compare text, but can lead to content skew in sync 
tools like Mergeall if Unicode-variant filenames are present and not normalized.

As an example, the filename string 'Liñux.png' may take two equivalent but 
different decoded forms per the Unicode standard, which will not match by 
simple equality (e.g., '==' and 'in' in Python):

    >>> l = b'Lin\xcc\x83ux.png'.decode('utf-8')   # decomposed (NFD)
    >>> m = b'Li\xc3\xb1ux.png'.decode('utf-8')    # composed (NFC)
    
    >>> l, m
    ('Liñux.png', 'Liñux.png')     # equivalent text
    
    >>> len(l), len(m)             # len(code points)
    (10, 9)
    
    >>> l == m                     # same but different!
    False

The first of these uses the longer NFD format; notice its encoded 'n\x...'.
The second instead uses the shorter NFC form, with a dedicated ñ code point.
Because different tools and platforms may use either form, both may be 
present, which can wreak havoc with programs that process text or filenames.
To do better, programs must normalize to a common form prior to matches:

    >>> from unicodedata import normalize             # Python stdlib tool
    >>> normalize('NFC', l) == normalize('NFC', m)    # or 'NFD': any common
    True
 
Python's unicodedata module provides the required conversions to make text
agree.  For more background on this very dark Unicode corner, see:

    https://en.wikipedia.org/wiki/Unicode_equivalence
    https://docs.python.org/3.6/library/unicodedata.html#unicodedata.normalize

This ultimately seems like a failure of the Unicode standard to fix a huge
interoperability problem, but is now accommodated in Mergeall 3.3 per above.


MERGEALL IMPACT
===============

In terms of Mergeall specifically, this issue arose when applying a deltas.py 
changes set to another device and platform via the Android Deltas Sync 
package, which uses Mergeall as a nested tool.  See the package's docs here:

    https://learning-python.com/android-deltas-syncs/_README.html.

In that system, deltas are computed between a PC and proxy drive, and later 
applied on a phone from a copied zipfile.  Because filenames on the PC and 
proxy may be arbitrarily different from those on the phone, Unicode-variant 
skew is a potential whenever applying deltas between devices with differing 
policies.  Moreover, zipfiles may skirt name mapping used in other contexts.

While this issue materialized in delta sets, its risks prior to 3.3's 
normalization fix vary by the use case for which mergeall.py is used:

Basic syncs:
    In basic runs for direct merges of full content copies, the worst outcome
    appears to be one-time pointless copies of folders.  Basic file changes work 
    in 3.2: skewed names will be classified as unique in FROM and TO, but the FROM
    version will still replace the TO via a delete and copy, instead of a replace.

    Folder names are more problematic: a FROM folder with a varying name may be 
    copied in its entirety to TO as a unique item, rather than being traversed 
    as a same-name match to a nested diff.  The net result is simply a single 
    pointless copy of a FROM folder to TO: because TO is also deleted as unique,
    FROM and TO will match thereafter.  Even so, this is subpar; although it does
    not damage content, it may still be costly for large folders.

Rollbacks:
    In -restore runs to rollback prior changes, there is little risk of damage. 
    Name-variant mismatches are not generally possible, because FROM backup sets
    are applied to the same TO content tree from which they were generated.
    Although a TO tree may be relocated to a different platform before its
    rollback, this seems unlikely, and would be equivalent to the next item.

Delta sets:
    In -restore runs to apply change sets created by deltas.py, the risk of
    content skew in 3.2 looms largest.  In this mode, FROM reflects both new 
    items, as well as changes to be applied later to matching TO names.  Unlike 
    full syncs, delta applications compare just deltas, not full content.  As a 
    result, when applying FROM delta sets, divergent Unicode names in TO will be 
    classified as unique and unchanged, and hence skipped rather than removed.

    The net effect will be duplicate files and folders for Unicode variants 
    in FROM: the FROM version will be copied over to TO, but the prior version
    in TO will linger as an out-of-sync copy.  Moreover, direct Mergeall 3.2 
    syncs that could repair the damage by removing the older and now-unique TOs 
    are not available on some platforms (e.g., Android 11+: see Android Deltas 
    Sync).  Subtly, deltas' __added__.txt removal lists must also be adjusted
    for later TOs, because components of Unicode paths may differ arbitrarily.

This cropped up just once after nearly 8 years of Mergeall use, and may require 
a platform-specific context to appear at all (there's more on the failure below).
Nevertheless, Mergeall's developer does not use many non-ASCII filenames, and this
may be a larger concern for those who do.  By normalizing filenames for matching
in the mergeall.py, diffall.py, and deltas.py scripts, 3.3 avoids all the potential
3.2 issues listed above, and ensures data integrity more broadly.

Fine print: Mergeall 3.3 uses Unicode's "canonical" equivalence (e.g., NFC) but 
not its "compatibility" (e.g., NFKC), because the program operates in the realm 
of engineering, not linguistics.  It makes sense to equate code-point sequences
defined to stand for the same character.  It does not make sense to guess human 
intent: syncs must be deterministic.  For more on this topic, see links above.
Also see Additional Notes ahead for more on 3.3's assumptions and choices.


FAILURE DETAILS
===============

Unicode variants actually caused a doppelgänger (duplicate) for a changed file 
with a non-ASCII name in the examples/unicode folder of the thumbspage program 
(at learning-python.com/thumbspage.html).  This file's name was the same, but 
used only _decomposed_ form on both macOS and Android 11, and _composed_ form 
on Android 10--until a propagate via zipfile from macOS stored the decomposed 
form redundantly on Android 10 only.  Here's the post-propagate scene in the 
same content folder on all three platforms, as revealed by Python 3.X:

On macOS APFS internal drive (MacBook Pro, Catalina):

    >>> for f in glob.glob('frigcal-L*.png'):          # display encoded forms
    ...     print(f, '=>', f.encode('utf8'))           # just one name flavor
    ...
    frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png'

On Android 11 shared storage (Samsung Z Fold3)

    >>> for f in glob.glob('frigcal-L*.png'):          # same as macOS: decomp
    ...     print(f, '=>', f.encode('utf8'))                                                   
    ...
    frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png'
 
On Android 10 shared storage (Samsung Note20 Ultra):

    >>> for f in glob.glob('frigcal-L*.png'):          # same-but-different names
    ...     print(f, '=>', f.encode('utf8'))
    ...
    frigcal-Liññux.png => b'frigcal-Li\xc3\xb1\xc3\xb1ux.png'
    frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png'
    
    >>> for f in glob.glob('frigcal-L*.png'):
    ...     print('-'.join(str(ord(c)) for c in f))    # decoded code points
    ...
    102-114-105-103-99-97-108-45-76-105-241-241-117-120-46-112-110-103
    102-114-105-103-99-97-108-45-76-105-110-771-110-771-117-120-46-112-110-103

The same sort of doppelgängers were generated for filename 'pymailgui-spÄÄÄm.png'
and the thumbspage HTML files for each in multiple locations--yielding 21 dups
after the files in question were changed and synced via a deltas set.


PLATFORM VARIATION
==================

More generally, all three of the last section's platforms differ in their 
handling of Unicode filename variants, and some automatically map one form 
to another.  This effectively requires normalization for interoperability. 

macOS, for example, allows either form, but stores just one, and auto-maps the 
other via name normalization to the form first stored (at least on the current 
APFS filesystem; HFS+ formerly enforced a decomposed-only rule):

    # MACOS AUTO-MAPPING (APFS)

    >>> os.chdir('/Users/me/Desktop/temp')
    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    ... 
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'

    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    ... 
    AAA-Liñux => b'AAA-Li\xc3\xb1ux'

Android 11 shared storage (on a Samsung Z Fold3, at least) similarly 
maps names to a single stored form for writes, but fully disallows the 
composed form by itself in some contexts.  The latter seems an Android 
11 bug; the failing open() call below succeeds in app-specific storage 
(and on other platforms), and may fail only if both name forms are 
created and mapped together first as follows (the composed form has been
seen to work by itself otherwise).  This bug is also out of scope here, 
but see the Unicode converter note ahead for a work-around if needed:

    # ANDROID 11 AUTO-MAPPING

    >>> os.chdir('/sdcard/temp')
    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    ...
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'

    # ANDROID 11 BUG!

    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    FileNotFoundError: [Errno 2] No such file or directory: 'AAA-Liñux'

    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    ...
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'

Android 10 shared storage, by contrast, happily stores both forms on writes,
and doesn't perform any normalization mapping on its own (per a long story,
Android 11 swapped in a new FUSE-based implementation of shared storage to 
implement access restrictions, so all bets on 10/11 compatibility are off):

    # ANDROID 10 DUPLICATES

    >>> os.chdir('/sdcard/temp')
    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    ...
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'
    AAA-Liñux => b'AAA-Li\xc3\xb1ux'

    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    ...
    AAA-Liñux => b'AAA-Li\xc3\xb1ux'
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'

And on all three devices, the internal code-point representation of a text
literal depends on what is used for the literal.  Hence, what is created
by an open() with a literal varies--what you paste is what you'll get:

    >>> len('AAA-Liñux'), 'AAA-Liñux'.encode('utf8')   # pasted NFD form
    (10, b'AAA-Lin\xcc\x83ux')
    >>> 
    >>> len('AAA-Liñux'), 'AAA-Liñux'.encode('utf8')   # pasted NFC form
    (9, b'AAA-Li\xc3\xb1ux')
    >>>
    >>> open('AAA-Liñux', 'w').close()                 # varies per literal

In fact, it's even possible to generate mixed-format paths on macOS;
the first of the following has both decomposed and composed components,
and was generated by an 'echo' in the Bash shell:

    >>> for f in glob.glob('FROM/dir-Liñux/*'):  print(f, '=>', f.encode('utf8'))
    ... 
    FROM/dir-Liñux/nested3-Liñux.txt => b'FROM/dir-Lin\xcc\x83ux/nested3-Li\xc3\xb1ux.txt'
    FROM/dir-Liñux/nested1-Liñux.txt => b'FROM/dir-Lin\xcc\x83ux/nested1-Lin\xcc\x83ux.txt'

Finally, from the anti-interoperability department, macOS's exFAT driver 
changes composed-form filenames to decomposed form on file writes.  This 
guarantees a perpetual diff in Mergeall 3.2 when composed names are copied
to exFAT (a later sync will remove the decomposed form--and then rewrite it 
again), and fully requires Mergeall 3.3 normalization to ignore the skew:

    >>> os.chdir('/Volumes/T7/_test')
    >>> for f in glob.glob('file3*'): os.remove(f)
    >>> len(b'\xc3\xb1'.decode('utf8'))
    1
    >>> open(b'file3-Li\xc3\xb1ux.txt'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('file3*'):
    ...     print(f, '=>', f.encode('utf8'))
    ... 
    file3-Liñux.txt => b'file3-Lin\xcc\x83ux.txt'

For the record, Windows, Linux, Android 11 app-specific storage, and Pixel
Android 12 shared storage were not involved in the failure, but all four 
store both name forms just like Android 10 shared storage's results above.
Given the reach of these platforms, name mismatches may be common.  Also 
for the record, all devices tested are configured to use US English in 
their settings, though this seems unlikely to influence these results.

Some additional variables (e.g., FAT32 drives) were not tested for this 
analysis, because they are in some sense irrelevant: whatever its causes,
Unicode skew is a vendor-supported reality that must be accommodated.


CONCLUSIONS
===========

In the observed failure, auto-mapping like that demoed in the preceding 
section may have masked filenames' Unicode discord until the forgery was 
taken out of the picture by zipfiles.  But this is clearly a much broader 
platform-, version-, and filesystem-specific nightmare.  Alas, rather than
removing this Babel, the Unicode standard seems to have simply enabled it.

It's less clear how the unique filename forms arose on Android 10 alone.  
The subject files have traversed many platforms and drives--including FAT32
and exFAT, which were both used to sync the Android 10 device in the recent 
past, and the fully decomposed HFS+, which was used on macOS devices in the 
further past.  Android shared storage's quasi emulation of FAT32 is always a 
prime suspect in such border cases too: Android 10's lack of normalization 
may have been a defect, and Android 11 may have been fixed (and broken!), 
but Android's frequent auto-updates make forensics nearly impossible.

Ultimately, though, a change to the files on macOS created the duplicate in
Mergeall 3.2: the macOS and Android 10 filenames did not match while applying
updates recorded in the deltas zipfile, and macOS's version was classified as
unique instead of changed--and thus copied redundantly alongside the original.  

While a direct sync in Mergeall 3.2 would remove the former TO variant and 
so repair this skew specifically, this is not a solution generally (large 
folder mismatches and their copies are still problematic in syncs); won't 
help for blatant skewers like macOS's exFAT driver (whose diffs are forever); 
and is not an available option on Android 11+ (see Android Deltas Sync).

Mergeall 3.3's normalization is the only full fix for this problem.  By 
neutralizing filename skew in comparisons, 3.3 sidesteps an arguable failing
of the Unicode standard, at a small cost in backward-incompatible behavior.


ADDITIONAL NOTES
================

Bytes API: 
    Passing bytes to Python file interfaces to avoid Unicode decoding won't 
    help here.  Filenames that differ are different in both their encoded 
    bytes format, and their decoded code-point format.  This is true 
    regardless of the encoding scheme used to convert to/from bytes.

Nonportable filenames: 
    Unicode skew is similar to the problem of nonportable filename
    characters, but the auto-adjustment here is safer.  Auto-mangling 
    nonportable characters for comparisons can lead to data loss when the 
    mangled form is the same as another filename coincidentally.  No known 
    use case requires both a composed and decomposed Unicode-format name.

Normalization assumption:
    To expand on the prior point, 3.3's Unicode normalization assumes that
    the same name is not intentionally present in multiple Unicode forms 
    in the same folder.  If two equivalent forms are present, syncs will 
    arbitrarily update either of them.  This is potentially lossy, but seems
    far less likely than forms differing across content copies inadvertently.

    Indeed, the Unicode standard defines differing forms to be canonically 
    equivalent, so their redundant presence is really an error state allowed
    by some filesystems.  It seems astronomically unlikely that such a state 
    would be created by design, but please report contrary use cases.
    Note: direct syncs, where possible, can remove doppelgängers, but only
    if you run the syncs with Mergeall 3.2; 3.3 will equate names instead. 

Unicode converter:
    A new utility script which changes the Unicode format of all items in a 
    folder tree (e.g., to NFD) is available in Android Deltas Sync here:
    https://learning-python.com/android-deltas-syncs/_etc/convertunicode.py.

    This isn't shipped in Mergeall, as it's not required for normalization,
    but may be generally useful when propagating to Android 11 shared storage.
    Run it if needed to work around the bug illustrated above (see "BUG!").
    Unlike syncs, this utility script skips files with multiple name forms.
    Note: this script can also be used to display names not in a selected 
    format, and hence isolate any doppelgängers in your content.

Deltas contexts:
    The normalization code here runs twice for deltas: at creation and 
    apply times, which handle skew at the origin and destination, respectively.
    Nit: the resolution phase logic used for both deltas and rollbacks could
    be recoded separately (it simply deletes __added__.txt items and copies 
    over FROM items).  This would skip a pointless comparison phase, make run
    logs simpler, and eliminate some convolution for normal syncs, but would 
    also have to handle normalization and more separately and redundantly.

Change lists: 
    The "diffs" changes list in mergeall.py must now record both FROM and TO
    names.  The original TO is fetched on demand in the comparison phase, but 
    this won't work in the resolution phase because each tree level has it's 
    own originals dicts.  Hence, the changes-save code of deltas.py required 
    mods too (even though it simply saves FROM items to a separate folder, 
    whose content will be mapped here to a TO later).  Unique-item lists 
    required no such structural change for names (they are FROM/TO specific),
    but must record unique names in unnormalized form for use in resolution.
    
Paths-normalization fix: 
    backup.py's removeprioradds() required a related fix, because names saved 
    in __added__.txt at deltas-create time may not match unnormalized TOs at
    deltas-apply time.  As names may differ arbitrarily when applying a deltas 
    set to a different device, that function must inspect every component of 
    a saved pathname and adjust as needed.  In Android Deltas Sync, for 
    example, deltas are created for a proxy drive but applied later to a phone.
    For implementation details, see this file's matchUnicodePathnames() below.

Diffall fix: 
    diffall.py imports and uses this function too, to normalize filenames for
    comparison of directory lists.  It's a very broad issue.
================================================================================
"""




##################
# CODE STARTS HERE
##################



from __future__ import print_function    # Py 2.X compatibility
import sys, os, unicodedata              # stdlib normalizer tool
from fixlongpaths import FWP             # Windows long paths
error = print                            # always display error messages




def normalizeUnicode(name):
    """
    ----------------------------------------------------------------------------
    [3.3] Normalize a decoded Unicode filename string into its composed form 
    (NFC).  This was split off to a separate function from the next, because it
    is also used to compare symlink paths: they may differ between FROM and TO,
    and in both mergeall.py and diffall.py.  Also used by dirdiff.py script mode.

    Caveat: Python 2.X requires decoded unicode in normalize(), not encoded str,
    but 2.X is using str (a.k.a. bytes in 3.X) everywhere, obtained from file 
    APIs.  It's unclear what encoding to use here: UTF-8 is broad but may not 
    match some platform or drive defaults, yet platform defaults may not apply 
    to files on external drives.  This guesses, but Mergeall now very strongly 
    recommends Python 3.X for content with non-ASCII Unicode filenames--and 
    defers to 3.X for proper, or at least best possible, filename decoding.
    ----------------------------------------------------------------------------
    """

    if int(sys.version[0]) >= 3:
        # python 3.X: all unicode
        return unicodedata.normalize('NFC', name)    # play well with others
    else:
        # python 2.X: str=>unicode=>str, partly a guess
       try:
           tryenc = sys.getfilesystemencoding()      # or hardcode utf-8?
           return unicodedata.normalize('NFC', name.decode(tryenc)).encode(tryenc)
       except:
           error('**Error normalizing 2.X Unicode name:', name)
           raise    # reraise error to stop and avoid data loss




def normalizeUnicodeFilenames(names, namesdir, trace=print):
    """
    ----------------------------------------------------------------------------
    [3.3] Normalize all file/folder/link names in the argument list to NFC 
    (composed) Unicode representation, and return a dictionary mapping any
    normalized names back to their original unnormalized forms.  All filename
    comparisons are performed with the normalized (equated) forms, but all 
    file operations map back to the unnormalized forms (for both to and from):

      Name comparison:
          namesto, origtos = normalizeUnicodeFilenames(namesto, dirto)
          # compare namesto

      File access:
          origto = origtos.get(nameto, nameto)
          # access origto

    This function is used by both mergeall.py syncs and diffall.py comparisons.
    Both clients must map back to original names to access the filesystem.

    mergeall.py must also delete a TO file before copying over a matched FROM
    if the two items' names differ; in this case, simply creating the FROM file
    in TO is not enough to replace, and will produce a duplicate.  Important:
    the initial delete should be run only when names differ, which is typically
    rare.  An initial coding that always deleted before copying ran a deltas sync
    some 14X slower on Android 11 shared storage for a comparable changes set 
    (28 minutes versus 2, @10k changes), due to that filesystem's deletion sloth.

    Note: this does not detect form dups, per "Normalization assumption" above;
    it could check for names membership, but the callers' response is unclear.
    See this file's top docstring for more info, and "[3.3]" in code for usage.
    ----------------------------------------------------------------------------
    """

    orignames = {}
    newnames  = []

    for name in names:
        altname = normalizeUnicode(name)            # convert to common form to compare
        if altname == name:                         # unless it's already there
            newnames.append(name)
        else:
            newnames.append(altname)                # save original form for file access
            orignames[altname] = name
            trace('--Unicode normalized for name:', name, 'in:', namesdir)

    return newnames, orignames




def matchUnicodePathnames(delpath, trace=print, verbose=True):
    r"""
    ------------------------------------------------------------------------------------
    [3.3] Normalize Unicode components in delpath to match the same path 
    in TO, so delpath can be deleted from TO on all platforms. 

    Return delpath unchanged and immediately if it exists in TO initially;
    this applies to both paths without any Unicode variants, and platforms 
    which auto-normalize paths.  Else, return a changed delpath with its 
    components morphed to Unicode NFC/NFD variants that match those in the 
    path of the item physically stored in TO.
 
    delpath is the result of joining the TO content-tree's root with a 
    relative pathname in __added__.txt.  __added__.txt paths name a TO 
    item to be deleted (an add for rollbacks, and a unique TO for deltas),
    and are always partial, and relative to both FROM and TO content 
    roots (they record just the path below the content root folders).
    The full delpath may be absolute or relative, and its path separators 
    have already been changed to that of the host platform for portability.

    When applying delta sets with -restore, any part of a to-be-deleted 
    item's pathname in __added__.txt may be equivalent but different from 
    its counterpart in TO.  For example, the filename 'Liñux.png' has two 
    different Unicode formats, and either may appear anywhere along both 
    delpath and its counterpart stored in TO.  This can happen because TO 
    in delta applies is not necessarily the same TO device used to make the 
    __added__.txt paths, and platforms and tools through which content has
    passed may treat such same-but-unequal filenames arbitrarily (more ahead).

    To adjust, scan the path to find the true names of each part stored 
    in TO, by checking existence along the way.  This is a no-op most 
    of the time, and is generally useless but harmless for -restore 
    rollbacks (where, unlike deltas, TO should be the same TO that was 
    changed earlier).  To support both deltas and worst cases, though, the
    algorithm here handles arbitrarily mixed NFC/NFD names anywhere along 
    the same path on both FROM and TO.  See also the docstring at the top
    of this file, and the call to this in backups.removeprioradds().

    ----
    Why this is necessary: 

    Although Unicode-variant skew can arise from program use in general, 
    Android Deltas Sync (ADS) is the main risk.  When applying delta 
    sets for that system with -restore, the unique TO paths stored in 
    __added__.txt will be those of the _proxy_ drive, not the actual TO 
    device.  Because TO (the "phone") may be Android, Windows, or anything,
    Unicode variants in the proxy's saved paths may arbitrarily differ when
    trying to delete them from TO.  The Unicode policies of the FROM platform 
    and the medium used to store proxy paths may both vary from those on TO.

    This skew can arise for delta sets in general because their syncs are 
    transitive and deferred, but ADS is the most common deltas use case.  
    The impact depends on both platform and program Unicode policies and may 
    be rare in practice, but ADS allows any Python-capable platform to serve
    as a later TO.  While it's possible that every platform + filesystem
    combination is use today auto-normalizes paths so existence and opens 
    work in all cases, Mergeall prefers to be proactive on resolution here.

    Update - platform normalization findings:

    - macOS (filesystem APFS) and Android (shared storage) _do_ auto-normalize 
      path filenames, so Unicode variants match automatically without help
    
    - Windows 11 (NTFS) and Linux (ext4) do _not_ auto-normalize paths, and 
      require the filename normalization logic here to match Unicode variants

    The latter seems ample cause for normalizing paths manually here.
    The former seems a bad policy--platform automatic normalization is 
    proprietary and non-interoperable; among other things, it can yield 
    duplicates in syncs more naive than Mergeall 3.3's.

    Update - the Android normalization story:

    Android, being Android, implements rules that vary per storage type, and 
    possibly even vendor.  On a Samsung Android 12L Fold4, both shared storage 
    (e.g., /sdcard/folder) and app-specific storage (e.g., /sdcard/Android/data/app)
    _do_ auto-normalize, but app-private storage (e.g., /data/data/app) does _not_.  
    Background coverage:
 
      learning-python.com/android-deltas-syncs/_README.html#Android%20Storage

    In sum, among the platforms tested for Unicode variants in 2022:

    - macOS, Android shared, and Android app-specific auto-normalize
    - Windows, Linux, and Android app-private do not auto-normalize

    Except where this varies by filesystem, vendor, or act of God...

    ----
    UPDATE, Oct-22: the path-scanning algorithm here was recoded in light 
    of recent Windows failures when files to be deleted in __added__.txt had
    been atypically removed from TO before the run (by an earlier -restore 
    with the same DELTAS).  The new algorithm fixes the Windows failures as
    well a similar potential on Unix, and accommodates all these path types:

        - Windows absolute (C:\xxx)
        - Windows drive-CWD-relative (C:xxx)
        - Windows drive-relative (\xxx)
        - Windows CWD-relative (xxx) 
        - Windows device paths (\\?\C:xxx, \\.\C:xxx)
        - Windows network (\\server\...)
        - Unix absolute (/xxx) 
        - Unix CWD-relative (xxx)
        - Relative path syntax on both platforms: '.', '..'

    Any rare and unusual path not recognized (e.g., Windows "\\?\UNC\") 
    simply skips Unicode path normalization here, and won't be deleted by
    the mergeall -restore run (a log message is generated to alert users).
    Please send user feedback if this policy impacts your use case.

    The prior coding was happily naive about Windows' many flavors of path 
    syntax (which seem ad-hoc alternatives to Unix mount points, and Windows
    drive-letter association).  This code was triggered inadvertently, but 
    can kick in any time components' Unicode differs between that saved in 
    __added__.txt and that stored in the TO tree/device.  For more info, see 
    the prototype for the new coding, especially its py-split-join.py docstring:

      ./test/test-path-normalization-3.3/prototype-recoding-oct22/ 

    Notes:

    - os.path is ntpath on Windows and posixpath on Unix; use the latter 
      two to test platform-specific path handling on either platform.
    
    - Adding tracer prints can help demystify this wack code (Windows 
      makes it bad enough to qualify as an interview question, imho).

    - This cannot simply normalize the full path all at once, because 
      each of its components may have arbitrary Unicode-variant skew.

    - There is no way to tell what the path looks like in TO without 
      walking components to test and possibly normalize each part.
    
    - This will print an arguably confusing Unicode-morph failure message 
      and return delpath unchanged if delpath is fully absent in TO.  This 
      can happen only if delpath was manually deleted outside normal script
      usage, and delpath must ultimately be skipped as a sync error anyhow.

    Testing/Results:

    - debug1/debug2 force the path normalization loop and component mods,
      respectively, on platforms that auto-normalize filenames to match 
      (e.g., macOS, Android shared).  On these, True/False generates 
      a path walk and "--Path okay" messages, and True/True yields the 
      same but with "--Unicode morphed" messages.

      Elsewhere, on platforms that do not auto-normalize (e.g., Windows, 
      Linux), these switches aren't useful; the loop is run for mismatches
      normally, and the switches cause the loop to fail when a matching 
      name is forcibly changed.  Leave them both False, and run the test 
      script that forces path diffs and true Unicode path normalization on
      platforms which require it, and further documents testing, at:

        ./test/test-path-normalization-3.3/test-path-normalization-walks/_TEST.py

      For production use, the switches should always be False/False, and 
      the code here will do the right thing for the platform hosting it.
    ------------------------------------------------------------------------------------
    """    

    # force loop/normalization to test (else all variants exist on macos/android)
    debug1, debug2 = False, False

    # ntpath xor posixpath tools
    exists, join, sep, splitdrive = \
        os.path.exists, os.path.join, os.path.sep, os.path.splitdrive

    if exists(FWP(delpath)) and not debug1:
        return delpath                              # skip the drama for most cases

    else:
        drive, rest = splitdrive(delpath)           # splitdrive for Windows shenanigans
        if drive and rest.startswith(sep):          # drive for abs, drive-relative, etc.
            sofar = drive                           # rest starts with sep iff abs
            parts = rest.split(sep)
        else:
            sofar = ''                              # no drive, normal components
            parts = delpath.split(sep)              # also for Windows drive-relative 

        if parts[0] == '':                          # empty for abs path/rest, win+ux
            sofar += sep                            # make join() work, skip the empty
            parts = parts[1:]
        
        while parts:                                # add each part to sofar and test
            next, *parts = parts                    # normalize each part if needed
            newpath = join(sofar, next)
 
            # test/mod next extension to sofar
            if exists(FWP(newpath)) and not debug2:
                # this part okay
                sofar = newpath
                if verbose: trace('--Path okay:', sofar)

            else:
                # fix this part
                nfc = unicodedata.normalize('NFC', next)
                nfd = unicodedata.normalize('NFD', next)
                if next == nfc:
                    sofar = join(sofar, nfd)        # failed as NFC: replace with NFD
                else:
                    sofar = join(sofar, nfc)        # failed as NFD: replace with NFC

                if exists(FWP(sofar)):
                    trace('--Unicode morphed for name:', next, 'in:', sofar)
                else:
                    error('--Unicode morph failed for name:', next, 'in:', sofar)
                    sofar = delpath 
                    break                           # not found with mod: punt now

        return sofar    # normalized/joined result, or original if part failed



[Home page] Books Code Blog Python Author Train Find ©M.Lutz