File: mergeall-products/unzipped/fixunicodedups.py
# -*- coding: utf-8 -*- r""" ================================================================================ fixunicodedups.py: Normalize Unicode filenames for matching (part of the Mergeall system [3.3]). SYNOPSIS ======== This module defines tools to handle differing code-point representations of the same filename--an odd border case allowed under the Unicode standard. If these equivalent but different variants are not mapped (a.k.a. normalized) to the same common form for comparisons, filename matching may fail, causing potentially lossy skew in syncs. To avoid this, matching now uses normalized filenames, but filesystem access still uses names' original, unnormalized forms. This rule is applied in the mergeall, deltas, and diffall scripts. In addition, symlink paths require a similar adjustment for compares; components in pathnames recorded from one device must be mapped to others; and the "-quiet" flag has been overloaded to also suppress normalization messages in all three scripts. This file largely documents this subtle issue, but also codes fixes. BACKGROUND ========== Name normalization is required because the Unicode standard has a massive hole: it allows the same text to be represented with different code-point strings. This in turn requires equality to be generalized, to avoid issues when the same string uses different Unicode forms across platforms or devices. This impacts all tools that search or compare text, but can lead to content skew in sync tools like Mergeall if Unicode-variant filenames are present and not normalized. As an example, the filename string 'Liñux.png' may take two equivalent but different decoded forms per the Unicode standard, which will not match by simple equality (e.g., '==' and 'in' in Python): >>> l = b'Lin\xcc\x83ux.png'.decode('utf-8') # decomposed (NFD) >>> m = b'Li\xc3\xb1ux.png'.decode('utf-8') # composed (NFC) >>> l, m ('Liñux.png', 'Liñux.png') # equivalent text >>> len(l), len(m) # len(code points) (10, 9) >>> l == m # same but different! False The first of these uses the longer NFD format; notice its encoded 'n\x...'. The second instead uses the shorter NFC form, with a dedicated ñ code point. Because different tools and platforms may use either form, both may be present, which can wreak havoc with programs that process text or filenames. To do better, programs must normalize to a common form prior to matches: >>> from unicodedata import normalize # Python stdlib tool >>> normalize('NFC', l) == normalize('NFC', m) # or 'NFD': any common True Python's unicodedata module provides the required conversions to make text agree. For more background on this very dark Unicode corner, see: https://en.wikipedia.org/wiki/Unicode_equivalence https://docs.python.org/3.6/library/unicodedata.html#unicodedata.normalize This ultimately seems like a failure of the Unicode standard to fix a huge interoperability problem, but is now accommodated in Mergeall 3.3 per above. MERGEALL IMPACT =============== In terms of Mergeall specifically, this issue arose when applying a deltas.py changes set to another device and platform via the Android Deltas Scripts package, which uses Mergeall as a nested tool. See the package's docs here: https://learning-python.com/android-deltas-scripts/_README.html. In that system, deltas are computed between a PC and proxy drive, and later applied on a phone from a copied zipfile. Because filenames on the PC and proxy may be arbitrarily different from those on the phone, Unicode-variant skew is a potential whenever applying deltas between devices with differing policies. Moreover, zipfiles may skirt name mapping used in other contexts. While this issue materialized in delta sets, its risks prior to 3.3's normalization fix vary by the use case for which mergeall.py is used: Basic syncs: In basic runs for direct merges of full content copies, the worst outcome appears to be one-time pointless copies of folders. Basic file changes work in 3.2: skewed names will be classified as unique in FROM and TO, but the FROM version will still replace the TO via a delete and copy, instead of a replace. Folder names are more problematic: a FROM folder with a varying name may be copied in its entirety to TO as a unique item, rather than being traversed as a same-name match to a nested diff. The net result is simply a single pointless copy of a FROM folder to TO: because TO is also deleted as unique, FROM and TO will match thereafter. Even so, this is subpar; although it does not damage content, it may still be costly for large folders. Rollbacks: In -restore runs to rollback prior changes, there is little risk of damage. Name-variant mismatches are not generally possible, because FROM backup sets are applied to the same TO content tree from which they were generated. Although a TO tree may be relocated to a different platform before its rollback, this seems unlikely, and would be equivalent to the next item. Delta sets: In -restore runs to apply change sets created by deltas.py, the risk of content skew in 3.2 looms largest. In this mode, FROM reflects both new items, as well as changes to be applied later to matching TO names. Unlike full syncs, delta applications compare just deltas, not full content. As a result, when applying FROM delta sets, divergent Unicode names in TO will be classified as unique and unchanged, and hence skipped rather than removed. The net effect will be duplicate files and folders for Unicode variants in FROM: the FROM version will be copied over to TO, but the prior version in TO will linger as an out-of-sync copy. Moreover, direct Mergeall 3.2 syncs that could repair the damage by removing the older and now-unique TOs are not available on some platforms (e.g., Android 11+: see Android Deltas Scripts). Subtly, deltas' __added__.txt removal lists must also be adjusted for later TOs, because components of Unicode paths may differ arbitrarily. This cropped up just once after nearly 8 years of Mergeall use, and may require a platform-specific context to appear at all (there's more on the failure below). Nevertheless, Mergeall's developer does not use many non-ASCII filenames, and this may be a larger concern for those who do. By normalizing filenames for matching in the mergeall.py, diffall.py, and deltas.py scripts, 3.3 avoids all the potential 3.2 issues listed above, and ensures data integrity more broadly. Fine print: Mergeall 3.3 uses Unicode's "canonical" equivalence (e.g., NFC) but not its "compatibility" (e.g., NFKC), because the program operates in the realm of engineering, not linguistics. It makes sense to equate code-point sequences defined to stand for the same character. It does not make sense to guess human intent: syncs must be deterministic. For more on this topic, see links above. Also see Additional Notes ahead for more on 3.3's assumptions and choices. FAILURE DETAILS =============== Unicode variants actually caused a doppelgänger (duplicate) for a changed file with a non-ASCII name in the examples/unicode folder of the thumbspage program (at learning-python.com/thumbspage.html). This file's name was the same, but used only _decomposed_ form on both macOS and Android 11, and _composed_ form on Android 10--until a propagate via zipfile from macOS stored the decomposed form redundantly on Android 10 only. Here's the post-propagate scene in the same content folder on all three platforms, as revealed by Python 3.X: On macOS APFS internal drive (MacBook Pro, Catalina): >>> for f in glob.glob('frigcal-L*.png'): # display encoded forms ... print(f, '=>', f.encode('utf8')) # just one name flavor ... frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png' On Android 11 shared storage (Samsung Z Fold3) >>> for f in glob.glob('frigcal-L*.png'): # same as macOS: decomp ... print(f, '=>', f.encode('utf8')) ... frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png' On Android 10 shared storage (Samsung Note20 Ultra): >>> for f in glob.glob('frigcal-L*.png'): # same-but-different names ... print(f, '=>', f.encode('utf8')) ... frigcal-Liññux.png => b'frigcal-Li\xc3\xb1\xc3\xb1ux.png' frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png' >>> for f in glob.glob('frigcal-L*.png'): ... print('-'.join(str(ord(c)) for c in f)) # decoded code points ... 102-114-105-103-99-97-108-45-76-105-241-241-117-120-46-112-110-103 102-114-105-103-99-97-108-45-76-105-110-771-110-771-117-120-46-112-110-103 The same sort of doppelgängers were generated for filename 'pymailgui-spÄÄÄm.png' and the thumbspage HTML files for each in multiple locations--yielding 21 dups after the files in question were changed and synced via a deltas set. PLATFORM VARIATION ================== More generally, all three of the last section's platforms differ in their handling of Unicode filename variants, and some automatically map one form to another. This effectively requires normalization for interoperability. macOS, for example, allows either form, but stores just one, and auto-maps the other via name normalization to the form first stored (at least on the current APFS filesystem; HFS+ formerly enforced a decomposed-only rule): # MACOS AUTO-MAPPING (APFS) >>> os.chdir('/Users/me/Desktop/temp') >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Lin\xcc\x83ux' >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Li\xc3\xb1ux' Android 11 shared storage (on a Samsung Z Fold3, at least) similarly maps names to a single stored form for writes, but fully disallows the composed form by itself in some contexts. The latter seems an Android 11 bug; the failing open() call below succeeds in app-specific storage (and on other platforms), and may fail only if both name forms are created and mapped together first as follows (the composed form has been seen to work by itself otherwise). This bug is also out of scope here, but see the Unicode converter note ahead for a work-around if needed: # ANDROID 11 AUTO-MAPPING >>> os.chdir('/sdcard/temp') >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Lin\xcc\x83ux' # ANDROID 11 BUG! >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() Traceback (most recent call last): File "<stdin>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: 'AAA-Liñux' >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Lin\xcc\x83ux' Android 10 shared storage, by contrast, happily stores both forms on writes, and doesn't perform any normalization mapping on its own (per a long story, Android 11 swapped in a new FUSE-based implementation of shared storage to implement access restrictions, so all bets on 10/11 compatibility are off): # ANDROID 10 DUPLICATES >>> os.chdir('/sdcard/temp') >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Lin\xcc\x83ux' AAA-Liñux => b'AAA-Li\xc3\xb1ux' >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Li\xc3\xb1ux' AAA-Liñux => b'AAA-Lin\xcc\x83ux' And on all three devices, the internal code-point representation of a text literal depends on what is used for the literal. Hence, what is created by an open() with a literal varies--what you paste is what you'll get: >>> len('AAA-Liñux'), 'AAA-Liñux'.encode('utf8') # pasted NFD form (10, b'AAA-Lin\xcc\x83ux') >>> >>> len('AAA-Liñux'), 'AAA-Liñux'.encode('utf8') # pasted NFC form (9, b'AAA-Li\xc3\xb1ux') >>> >>> open('AAA-Liñux', 'w').close() # varies per literal In fact, it's even possible to generate mixed-format paths on macOS; the first of the following has both decomposed and composed components, and was generated by an 'echo' in the Bash shell: >>> for f in glob.glob('FROM/dir-Liñux/*'): print(f, '=>', f.encode('utf8')) ... FROM/dir-Liñux/nested3-Liñux.txt => b'FROM/dir-Lin\xcc\x83ux/nested3-Li\xc3\xb1ux.txt' FROM/dir-Liñux/nested1-Liñux.txt => b'FROM/dir-Lin\xcc\x83ux/nested1-Lin\xcc\x83ux.txt' Finally, from the anti-interoperability department, macOS's exFAT driver changes composed-form filenames to decomposed form on file writes. This guarantees a perpetual diff in Mergeall 3.2 when composed names are copied to exFAT (a later sync will remove the decomposed form--and then rewrite it again), and fully requires Mergeall 3.3 normalization to ignore the skew: >>> os.chdir('/Volumes/T7/_test') >>> for f in glob.glob('file3*'): os.remove(f) >>> len(b'\xc3\xb1'.decode('utf8')) 1 >>> open(b'file3-Li\xc3\xb1ux.txt'.decode('utf8'), 'w').close() >>> for f in glob.glob('file3*'): ... print(f, '=>', f.encode('utf8')) ... file3-Liñux.txt => b'file3-Lin\xcc\x83ux.txt' For the record, Windows, Linux, Android 11 app-specific storage, and Pixel Android 12 shared storage were not involved in the failure, but all four store both name forms just like Android 10 shared storage's results above. Given the reach of these platforms, name mismatches may be common. Also for the record, all devices tested are configured to use US English in their settings, though this seems unlikely to influence these results. Some additional variables (e.g., FAT32 drives) were not tested for this analysis, because they are in some sense irrelevant: whatever its causes, Unicode skew is a vendor-supported reality that must be accommodated. CONCLUSIONS =========== In the observed failure, auto-mapping like that demoed in the preceding section may have masked filenames' Unicode discord until the forgery was taken out of the picture by zipfiles. But this is clearly a much broader platform-, version-, and filesystem-specific nightmare. Alas, rather than removing this Babel, the Unicode standard seems to have simply enabled it. It's less clear how the unique filename forms arose on Android 10 alone. The subject files have traversed many platforms and drives--including FAT32 and exFAT, which were both used to sync the Android 10 device in the recent past, and the fully decomposed HFS+, which was used on macOS devices in the further past. Android shared storage's quasi emulation of FAT32 is always a prime suspect in such border cases too: Android 10's lack of normalization may have been a defect, and Android 11 may have been fixed (and broken!), but Android's frequent auto-updates make forensics nearly impossible. Ultimately, though, a change to the files on macOS created the duplicate in Mergeall 3.2: the macOS and Android 10 filenames did not match while applying updates recorded in the deltas zipfile, and macOS's version was classified as unique instead of changed--and thus copied redundantly alongside the original. While a direct sync in Mergeall 3.2 would remove the former TO variant and so repair this skew specifically, this is not a solution generally (large folder mismatches and their copies are still problematic in syncs); won't help for blatant skewers like macOS's exFAT driver (whose diffs are forever); and is not an available option on Android 11+ (see Android Deltas Scripts). Mergeall 3.3's normalization is the only full fix for this problem. By neutralizing filename skew in comparisons, 3.3 sidesteps an arguable failing of the Unicode standard, at a small cost in backward-incompatible behavior. ADDITIONAL NOTES ================ Bytes API: Passing bytes to Python file interfaces to avoid Unicode decoding won't help here. Filenames that differ are different in both their encoded bytes format, and their decoded code-point format. This is true regardless of the encoding scheme used to convert to/from bytes. Nonportable filenames: Unicode skew is similar to the problem of nonportable filename characters, but the auto-adjustment here is safer. Auto-mangling nonportable characters for comparisons can lead to data loss when the mangled form is the same as another filename coincidentally. No known use case requires both a composed and decomposed Unicode-format name. Normalization assumption: To expand on the prior point, 3.3's Unicode normalization assumes that the same name is not intentionally present in multiple Unicode forms in the same folder. If two equivalent forms are present, syncs will arbitrarily update either of them. This is potentially lossy, but seems far less likely than forms differing across content copies inadvertently. Indeed, the Unicode standard defines differing forms to be canonically equivalent, so their redundant presence is really an error state allowed by some filesystems. It seems astronomically unlikely that such a state would be created by design, but please report contrary use cases. Note: direct syncs, where possible, can remove doppelgängers, but only if you run the syncs with Mergeall 3.2; 3.3 will equate names instead. Unicode converter: A new utility script which changes the Unicode format of all items in a folder tree (e.g., to NFD) is available in Android Deltas Scripts here: https://learning-python.com/android-deltas-scripts/_etc/convertunicode.py. This isn't shipped in Mergeall, as it's not required for normalization, but may be generally useful when propagating to Android 11 shared storage. Run it if needed to work around the bug illustrated above (see "BUG!"). Unlike syncs, this utility script skips files with multiple name forms. Note: this script can also be used to display names not in a selected format, and hence isolate any doppelgängers in your content. Deltas contexts: The normalization code here runs twice for deltas: at creation and apply times, which handle skew at the origin and destination, respectively. Nit: the resolution phase logic used for both deltas and rollbacks could be recoded separately (it simply deletes __added__.txt items and copies over FROM items). This would skip a pointless comparison phase, make run logs simpler, and eliminate some convolution for normal syncs, but would also have to handle normalization and more separately and redundantly. Change lists: The "diffs" changes list in mergeall.py must now record both FROM and TO names. The original TO is fetched on demand in the comparison phase, but this won't work in the resolution phase because each tree level has it's own originals dicts. Hence, the changes-save code of deltas.py required mods too (even though it simply saves FROM items to a separate folder, whose content will be mapped here to a TO later). Unique-item lists required no such structural change for names (they are FROM/TO specific), but must record unique names in unnormalized form for use in resolution. Backups fix: backup.py's removeprioradds() required a related fix, because names saved in __added__.txt at deltas-create time may not match unnormalized TOs at deltas-apply time. As names may differ arbitrarily when applying a deltas set to a different device, that function must inspect every component of a saved pathname and adjust as needed. For implementation details, see this file's new matchUnicodePathnames() below. Diffall fix: diffall.py imports and uses this function too, to normalize filenames for comparison of directory lists. It's a very broad issue. ================================================================================ """ ################## # CODE STARTS HERE ################## from __future__ import print_function # Py 2.X compatibility import sys, os, unicodedata # stdlib normalizer tool from fixlongpaths import FWP # Windows long paths error = print # always display error messages def normalizeUnicode(name): """ ---------------------------------------------------------------------------- [3.3] Normalize a decoded Unicode filename string into its composed form (NFC). This was split off to a separate function from the next, because it is also used to compare symlink paths: they may differ between FROM and TO, and in both mergeall.py and diffall.py. Also used by dirdiff.py script mode. Caveat: Python 2.X requires decoded unicode in normalize(), not encoded str, but 2.X is using str (a.k.a. bytes in 3.X) everywhere, obtained from file APIs. It's unclear what encoding to use here: UTF-8 is broad but may not match some platform or drive defaults, yet platform defaults may not apply to files on external drives. This guesses, but Mergeall now very strongly recommends Python 3.X for content with non-ASCII Unicode filenames--and defers to 3.X for proper, or at least best possible, filename decoding. ---------------------------------------------------------------------------- """ if int(sys.version[0]) >= 3: # python 3.X: all unicode return unicodedata.normalize('NFC', name) # play well with others else: # python 2.X: str=>unicode=>str, partly a guess try: tryenc = sys.getfilesystemencoding() # or hardcode utf-8? return unicodedata.normalize('NFC', name.decode(tryenc)).encode(tryenc) except: error('**Error normalizing 2.X Unicode name:', name) raise # reraise error to stop and avoid data loss def normalizeUnicodeFilenames(names, namesdir, trace=print): """ ---------------------------------------------------------------------------- [3.3] Normalize all file/folder/link names in the argument list to NFC (composed) Unicode representation, and return a dictionary mapping any normalized names back to their original unnormalized forms. All filename comparisons are performed with the normalized (equated) forms, but all file operations map back to the unnormalized forms (for both to and from): Name comparison: namesto, origtos = normalizeUnicodeFilenames(namesto, dirto) # compare namesto File access: origto = origtos.get(nameto, nameto) # access origto This function is used by both mergeall.py syncs and diffall.py comparisons. Both clients must map back to original names to access the filesystem. mergeall.py must also delete a TO file before copying over a matched FROM if the two items' names differ; in this case, simply creating the FROM file in TO is not enough to replace, and will produce a duplicate. Important: the initial delete should be run only when names differ, which is typically rare. An initial coding that always deleted before copying ran a deltas sync some 14X slower on Android 11 shared storage for a comparable changes set (28 minutes versus 2, @10k changes), due to that filesystem's deletion sloth. Note: this does not detect form dups, per "Normalization assumption" above; it could check for names membership, but the callers' response is unclear. See this file's top docstring for more info, and "[3.3]" in code for usage. ---------------------------------------------------------------------------- """ orignames = {} newnames = [] for name in names: altname = normalizeUnicode(name) # convert to common form to compare if altname == name: # unless it's already there newnames.append(name) else: newnames.append(altname) # save original form for file access orignames[altname] = name trace('--Unicode normalized for name:', name, 'in:', namesdir) return newnames, orignames def matchUnicodePathnames(delpath, trace=print, verbose=True): u""" ------------------------------------------------------------------------------------ [3.3] When applying deltas with -restore, any part of an item's pathname in __added__.txt may be equivalent but different from its counterpart in TO. For example, the filename 'Liñux.png' has two different Unicode formats, and may appear anywhere along the listed delpath pathname. This can happen because TO in delta applies is not necessarily the same TO device used to make the deltas, and platforms and tools may treat such same but unequal filenames differently. To adjust, scan the path to find the true names of each part in the new TO. This is a no-op most of the time, and is useless but harmless for -restore rollbacks (where, unlike deltas, TO is always the same TO that was changed earlier). See also the docstring at the top of this file, and the backups.removeprioradds() call. ------------------------------------------------------------------------------------ """ debug1, debug2 = False, False # force loop/norm on macos exists, join = os.path.exists, os.path.join if exists(FWP(delpath)) and not debug1: return delpath # skip the drama for most cases else: delparts = delpath.split(os.sep) # already changed to use os.sep altpath = '' while delparts: front, *delparts = delparts # pop front part off delparts tmppath = join(altpath, front) if exists(FWP(tmppath)) and not debug2: # this part okay altpath = tmppath if verbose: trace('--Path okay:', tmppath) else: # fix this part nfc = unicodedata.normalize('NFC', front) nfd = unicodedata.normalize('NFD', front) if front == nfc: altpath = join(altpath, nfd) # failed as NFC: replace with NFD else: altpath = join(altpath, nfc) # failed as NFD: replace with NFC if exists(FWP(altpath)): trace('--Unicode morphed for name:', front, 'in:', altpath) else: error('--Unicode morph failed for name:', front, 'in:', altpath) altpath = delpath break # still not found?: punt return altpath