# -*- coding: utf-8 -*- r""" ================================================================================ fixunicodedups.py: Normalize Unicode filenames for matching (part of the Mergeall system [3.3]). SYNOPSIS ======== This module defines tools to handle differing code-point representations of the same filename--an odd border case allowed under the Unicode standard. If these equivalent but different variants are not mapped (a.k.a. normalized) to the same common form for comparisons, filename matching may fail, causing potentially lossy skew in syncs. To avoid this, matching now uses normalized filenames, but filesystem access still uses names' original, unnormalized forms. This rule is applied in the mergeall, deltas, and diffall scripts. In addition, symlink paths require a similar adjustment for compares; components in pathnames recorded from one device must be mapped to others; and the "-quiet" flag has been overloaded to also suppress normalization messages in all three scripts. This file largely documents this subtle issue, but also codes fixes, including both name mappers and a path-normalization walker. BACKGROUND ========== Name normalization is required because the Unicode standard has a massive hole: it allows the same text to be represented with different code-point strings. This in turn requires equality to be generalized, to avoid issues when the same string uses different Unicode forms across platforms or devices. This impacts all tools that search or compare text, but can lead to content skew in sync tools like Mergeall if Unicode-variant filenames are present and not normalized. As an example, the filename string 'Liñux.png' may take two equivalent but different decoded forms per the Unicode standard, which will not match by simple equality (e.g., '==' and 'in' in Python): >>> l = b'Lin\xcc\x83ux.png'.decode('utf-8') # decomposed (NFD) >>> m = b'Li\xc3\xb1ux.png'.decode('utf-8') # composed (NFC) >>> l, m ('Liñux.png', 'Liñux.png') # equivalent text >>> len(l), len(m) # len(code points) (10, 9) >>> l == m # same but different! False The first of these uses the longer NFD format; notice its encoded 'n\x...'. The second instead uses the shorter NFC form, with a dedicated ñ code point. Because different tools and platforms may use either form, both may be present, which can wreak havoc with programs that process text or filenames. To do better, programs must normalize to a common form prior to matches: >>> from unicodedata import normalize # Python stdlib tool >>> normalize('NFC', l) == normalize('NFC', m) # or 'NFD': any common True Python's unicodedata module provides the required conversions to make text agree. For more background on this very dark Unicode corner, see: https://en.wikipedia.org/wiki/Unicode_equivalence https://docs.python.org/3.6/library/unicodedata.html#unicodedata.normalize This ultimately seems like a failure of the Unicode standard to fix a huge interoperability problem, but is now accommodated in Mergeall 3.3 per above. MERGEALL IMPACT =============== In terms of Mergeall specifically, this issue arose when applying a deltas.py changes set to another device and platform via the Android Deltas Sync package, which uses Mergeall as a nested tool. See the package's docs here: https://learning-python.com/android-deltas-syncs/_README.html. In that system, deltas are computed between a PC and proxy drive, and later applied on a phone from a copied zipfile. Because filenames on the PC and proxy may be arbitrarily different from those on the phone, Unicode-variant skew is a potential whenever applying deltas between devices with differing policies. Moreover, zipfiles may skirt name mapping used in other contexts. While this issue materialized in delta sets, its risks prior to 3.3's normalization fix vary by the use case for which mergeall.py is used: Basic syncs: In basic runs for direct merges of full content copies, the worst outcome appears to be one-time pointless copies of folders. Basic file changes work in 3.2: skewed names will be classified as unique in FROM and TO, but the FROM version will still replace the TO via a delete and copy, instead of a replace. Folder names are more problematic: a FROM folder with a varying name may be copied in its entirety to TO as a unique item, rather than being traversed as a same-name match to a nested diff. The net result is simply a single pointless copy of a FROM folder to TO: because TO is also deleted as unique, FROM and TO will match thereafter. Even so, this is subpar; although it does not damage content, it may still be costly for large folders. Rollbacks: In -restore runs to rollback prior changes, there is little risk of damage. Name-variant mismatches are not generally possible, because FROM backup sets are applied to the same TO content tree from which they were generated. Although a TO tree may be relocated to a different platform before its rollback, this seems unlikely, and would be equivalent to the next item. Delta sets: In -restore runs to apply change sets created by deltas.py, the risk of content skew in 3.2 looms largest. In this mode, FROM reflects both new items, as well as changes to be applied later to matching TO names. Unlike full syncs, delta applications compare just deltas, not full content. As a result, when applying FROM delta sets, divergent Unicode names in TO will be classified as unique and unchanged, and hence skipped rather than removed. The net effect will be duplicate files and folders for Unicode variants in FROM: the FROM version will be copied over to TO, but the prior version in TO will linger as an out-of-sync copy. Moreover, direct Mergeall 3.2 syncs that could repair the damage by removing the older and now-unique TOs are not available on some platforms (e.g., Android 11+: see Android Deltas Sync). Subtly, deltas' __added__.txt removal lists must also be adjusted for later TOs, because components of Unicode paths may differ arbitrarily. This cropped up just once after nearly 8 years of Mergeall use, and may require a platform-specific context to appear at all (there's more on the failure below). Nevertheless, Mergeall's developer does not use many non-ASCII filenames, and this may be a larger concern for those who do. By normalizing filenames for matching in the mergeall.py, diffall.py, and deltas.py scripts, 3.3 avoids all the potential 3.2 issues listed above, and ensures data integrity more broadly. Fine print: Mergeall 3.3 uses Unicode's "canonical" equivalence (e.g., NFC) but not its "compatibility" (e.g., NFKC), because the program operates in the realm of engineering, not linguistics. It makes sense to equate code-point sequences defined to stand for the same character. It does not make sense to guess human intent: syncs must be deterministic. For more on this topic, see links above. Also see Additional Notes ahead for more on 3.3's assumptions and choices. FAILURE DETAILS =============== Unicode variants actually caused a doppelgänger (duplicate) for a changed file with a non-ASCII name in the examples/unicode folder of the thumbspage program (at learning-python.com/thumbspage.html). This file's name was the same, but used only _decomposed_ form on both macOS and Android 11, and _composed_ form on Android 10--until a propagate via zipfile from macOS stored the decomposed form redundantly on Android 10 only. Here's the post-propagate scene in the same content folder on all three platforms, as revealed by Python 3.X: On macOS APFS internal drive (MacBook Pro, Catalina): >>> for f in glob.glob('frigcal-L*.png'): # display encoded forms ... print(f, '=>', f.encode('utf8')) # just one name flavor ... frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png' On Android 11 shared storage (Samsung Z Fold3) >>> for f in glob.glob('frigcal-L*.png'): # same as macOS: decomp ... print(f, '=>', f.encode('utf8')) ... frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png' On Android 10 shared storage (Samsung Note20 Ultra): >>> for f in glob.glob('frigcal-L*.png'): # same-but-different names ... print(f, '=>', f.encode('utf8')) ... frigcal-Liññux.png => b'frigcal-Li\xc3\xb1\xc3\xb1ux.png' frigcal-Liññux.png => b'frigcal-Lin\xcc\x83n\xcc\x83ux.png' >>> for f in glob.glob('frigcal-L*.png'): ... print('-'.join(str(ord(c)) for c in f)) # decoded code points ... 102-114-105-103-99-97-108-45-76-105-241-241-117-120-46-112-110-103 102-114-105-103-99-97-108-45-76-105-110-771-110-771-117-120-46-112-110-103 The same sort of doppelgängers were generated for filename 'pymailgui-spÄÄÄm.png' and the thumbspage HTML files for each in multiple locations--yielding 21 dups after the files in question were changed and synced via a deltas set. PLATFORM VARIATION ================== More generally, all three of the last section's platforms differ in their handling of Unicode filename variants, and some automatically map one form to another. This effectively requires normalization for interoperability. macOS, for example, allows either form, but stores just one, and auto-maps the other via name normalization to the form first stored (at least on the current APFS filesystem; HFS+ formerly enforced a decomposed-only rule): # MACOS AUTO-MAPPING (APFS) >>> os.chdir('/Users/me/Desktop/temp') >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Lin\xcc\x83ux' >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Li\xc3\xb1ux' Android 11 shared storage (on a Samsung Z Fold3, at least) similarly maps names to a single stored form for writes, but fully disallows the composed form by itself in some contexts. The latter seems an Android 11 bug; the failing open() call below succeeds in app-specific storage (and on other platforms), and may fail only if both name forms are created and mapped together first as follows (the composed form has been seen to work by itself otherwise). This bug is also out of scope here, but see the Unicode converter note ahead for a work-around if needed: # ANDROID 11 AUTO-MAPPING >>> os.chdir('/sdcard/temp') >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Lin\xcc\x83ux' # ANDROID 11 BUG! >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() Traceback (most recent call last): File "", line 1, in FileNotFoundError: [Errno 2] No such file or directory: 'AAA-Liñux' >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Lin\xcc\x83ux' Android 10 shared storage, by contrast, happily stores both forms on writes, and doesn't perform any normalization mapping on its own (per a long story, Android 11 swapped in a new FUSE-based implementation of shared storage to implement access restrictions, so all bets on 10/11 compatibility are off): # ANDROID 10 DUPLICATES >>> os.chdir('/sdcard/temp') >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Lin\xcc\x83ux' AAA-Liñux => b'AAA-Li\xc3\xb1ux' >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Li\xc3\xb1ux' AAA-Liñux => b'AAA-Lin\xcc\x83ux' And on all three devices, the internal code-point representation of a text literal depends on what is used for the literal. Hence, what is created by an open() with a literal varies--what you paste is what you'll get: >>> len('AAA-Liñux'), 'AAA-Liñux'.encode('utf8') # pasted NFD form (10, b'AAA-Lin\xcc\x83ux') >>> >>> len('AAA-Liñux'), 'AAA-Liñux'.encode('utf8') # pasted NFC form (9, b'AAA-Li\xc3\xb1ux') >>> >>> open('AAA-Liñux', 'w').close() # varies per literal In fact, it's even possible to generate mixed-format paths on macOS; the first of the following has both decomposed and composed components, and was generated by an 'echo' in the Bash shell: >>> for f in glob.glob('FROM/dir-Liñux/*'): print(f, '=>', f.encode('utf8')) ... FROM/dir-Liñux/nested3-Liñux.txt => b'FROM/dir-Lin\xcc\x83ux/nested3-Li\xc3\xb1ux.txt' FROM/dir-Liñux/nested1-Liñux.txt => b'FROM/dir-Lin\xcc\x83ux/nested1-Lin\xcc\x83ux.txt' Finally, from the anti-interoperability department, macOS's exFAT driver changes composed-form filenames to decomposed form on file writes. This guarantees a perpetual diff in Mergeall 3.2 when composed names are copied to exFAT (a later sync will remove the decomposed form--and then rewrite it again), and fully requires Mergeall 3.3 normalization to ignore the skew: >>> os.chdir('/Volumes/T7/_test') >>> for f in glob.glob('file3*'): os.remove(f) >>> len(b'\xc3\xb1'.decode('utf8')) 1 >>> open(b'file3-Li\xc3\xb1ux.txt'.decode('utf8'), 'w').close() >>> for f in glob.glob('file3*'): ... print(f, '=>', f.encode('utf8')) ... file3-Liñux.txt => b'file3-Lin\xcc\x83ux.txt' For the record, Windows, Linux, Android 11 app-specific storage, and Pixel Android 12 shared storage were not involved in the failure, but all four store both name forms just like Android 10 shared storage's results above. Given the reach of these platforms, name mismatches may be common. Also for the record, all devices tested are configured to use US English in their settings, though this seems unlikely to influence these results. Some additional variables (e.g., FAT32 drives) were not tested for this analysis, because they are in some sense irrelevant: whatever its causes, Unicode skew is a vendor-supported reality that must be accommodated. CONCLUSIONS =========== In the observed failure, auto-mapping like that demoed in the preceding section may have masked filenames' Unicode discord until the forgery was taken out of the picture by zipfiles. But this is clearly a much broader platform-, version-, and filesystem-specific nightmare. Alas, rather than removing this Babel, the Unicode standard seems to have simply enabled it. It's less clear how the unique filename forms arose on Android 10 alone. The subject files have traversed many platforms and drives--including FAT32 and exFAT, which were both used to sync the Android 10 device in the recent past, and the fully decomposed HFS+, which was used on macOS devices in the further past. Android shared storage's quasi emulation of FAT32 is always a prime suspect in such border cases too: Android 10's lack of normalization may have been a defect, and Android 11 may have been fixed (and broken!), but Android's frequent auto-updates make forensics nearly impossible. Ultimately, though, a change to the files on macOS created the duplicate in Mergeall 3.2: the macOS and Android 10 filenames did not match while applying updates recorded in the deltas zipfile, and macOS's version was classified as unique instead of changed--and thus copied redundantly alongside the original. While a direct sync in Mergeall 3.2 would remove the former TO variant and so repair this skew specifically, this is not a solution generally (large folder mismatches and their copies are still problematic in syncs); won't help for blatant skewers like macOS's exFAT driver (whose diffs are forever); and is not an available option on Android 11+ (see Android Deltas Sync). Mergeall 3.3's normalization is the only full fix for this problem. By neutralizing filename skew in comparisons, 3.3 sidesteps an arguable failing of the Unicode standard, at a small cost in backward-incompatible behavior. ADDITIONAL NOTES ================ Bytes API: Passing bytes to Python file interfaces to avoid Unicode decoding won't help here. Filenames that differ are different in both their encoded bytes format, and their decoded code-point format. This is true regardless of the encoding scheme used to convert to/from bytes. Nonportable filenames: Unicode skew is similar to the problem of nonportable filename characters, but the auto-adjustment here is safer. Auto-mangling nonportable characters for comparisons can lead to data loss when the mangled form is the same as another filename coincidentally. No known use case requires both a composed and decomposed Unicode-format name. Normalization assumption: To expand on the prior point, 3.3's Unicode normalization assumes that the same name is not intentionally present in multiple Unicode forms in the same folder. If two equivalent forms are present, syncs will arbitrarily update either of them. This is potentially lossy, but seems far less likely than forms differing across content copies inadvertently. Indeed, the Unicode standard defines differing forms to be canonically equivalent, so their redundant presence is really an error state allowed by some filesystems. It seems astronomically unlikely that such a state would be created by design, but please report contrary use cases. Note: direct syncs, where possible, can remove doppelgängers, but only if you run the syncs with Mergeall 3.2; 3.3 will equate names instead. Unicode converter: A new utility script which changes the Unicode format of all items in a folder tree (e.g., to NFD) is available in Android Deltas Sync here: https://learning-python.com/android-deltas-syncs/_etc/convertunicode.py. This isn't shipped in Mergeall, as it's not required for normalization, but may be generally useful when propagating to Android 11 shared storage. Run it if needed to work around the bug illustrated above (see "BUG!"). Unlike syncs, this utility script skips files with multiple name forms. Note: this script can also be used to display names not in a selected format, and hence isolate any doppelgängers in your content. Deltas contexts: The normalization code here runs twice for deltas: at creation and apply times, which handle skew at the origin and destination, respectively. Nit: the resolution phase logic used for both deltas and rollbacks could be recoded separately (it simply deletes __added__.txt items and copies over FROM items). This would skip a pointless comparison phase, make run logs simpler, and eliminate some convolution for normal syncs, but would also have to handle normalization and more separately and redundantly. Change lists: The "diffs" changes list in mergeall.py must now record both FROM and TO names. The original TO is fetched on demand in the comparison phase, but this won't work in the resolution phase because each tree level has it's own originals dicts. Hence, the changes-save code of deltas.py required mods too (even though it simply saves FROM items to a separate folder, whose content will be mapped here to a TO later). Unique-item lists required no such structural change for names (they are FROM/TO specific), but must record unique names in unnormalized form for use in resolution. Paths-normalization fix: backup.py's removeprioradds() required a related fix, because names saved in __added__.txt at deltas-create time may not match unnormalized TOs at deltas-apply time. As names may differ arbitrarily when applying a deltas set to a different device, that function must inspect every component of a saved pathname and adjust as needed. In Android Deltas Sync, for example, deltas are created for a proxy drive but applied later to a phone. For implementation details, see this file's matchUnicodePathnames() below. Diffall fix: diffall.py imports and uses this function too, to normalize filenames for comparison of directory lists. It's a very broad issue. ================================================================================ """ ################## # CODE STARTS HERE ################## from __future__ import print_function # Py 2.X compatibility import sys, os, unicodedata # stdlib normalizer tool from fixlongpaths import FWP # Windows long paths error = print # always display error messages def normalizeUnicode(name): """ ---------------------------------------------------------------------------- [3.3] Normalize a decoded Unicode filename string into its composed form (NFC). This was split off to a separate function from the next, because it is also used to compare symlink paths: they may differ between FROM and TO, and in both mergeall.py and diffall.py. Also used by dirdiff.py script mode. Caveat: Python 2.X requires decoded unicode in normalize(), not encoded str, but 2.X is using str (a.k.a. bytes in 3.X) everywhere, obtained from file APIs. It's unclear what encoding to use here: UTF-8 is broad but may not match some platform or drive defaults, yet platform defaults may not apply to files on external drives. This guesses, but Mergeall now very strongly recommends Python 3.X for content with non-ASCII Unicode filenames--and defers to 3.X for proper, or at least best possible, filename decoding. ---------------------------------------------------------------------------- """ if int(sys.version[0]) >= 3: # python 3.X: all unicode return unicodedata.normalize('NFC', name) # play well with others else: # python 2.X: str=>unicode=>str, partly a guess try: tryenc = sys.getfilesystemencoding() # or hardcode utf-8? return unicodedata.normalize('NFC', name.decode(tryenc)).encode(tryenc) except: error('**Error normalizing 2.X Unicode name:', name) raise # reraise error to stop and avoid data loss def normalizeUnicodeFilenames(names, namesdir, trace=print): """ ---------------------------------------------------------------------------- [3.3] Normalize all file/folder/link names in the argument list to NFC (composed) Unicode representation, and return a dictionary mapping any normalized names back to their original unnormalized forms. All filename comparisons are performed with the normalized (equated) forms, but all file operations map back to the unnormalized forms (for both to and from): Name comparison: namesto, origtos = normalizeUnicodeFilenames(namesto, dirto) # compare namesto File access: origto = origtos.get(nameto, nameto) # access origto This function is used by both mergeall.py syncs and diffall.py comparisons. Both clients must map back to original names to access the filesystem. mergeall.py must also delete a TO file before copying over a matched FROM if the two items' names differ; in this case, simply creating the FROM file in TO is not enough to replace, and will produce a duplicate. Important: the initial delete should be run only when names differ, which is typically rare. An initial coding that always deleted before copying ran a deltas sync some 14X slower on Android 11 shared storage for a comparable changes set (28 minutes versus 2, @10k changes), due to that filesystem's deletion sloth. Note: this does not detect form dups, per "Normalization assumption" above; it could check for names membership, but the callers' response is unclear. See this file's top docstring for more info, and "[3.3]" in code for usage. ---------------------------------------------------------------------------- """ orignames = {} newnames = [] for name in names: altname = normalizeUnicode(name) # convert to common form to compare if altname == name: # unless it's already there newnames.append(name) else: newnames.append(altname) # save original form for file access orignames[altname] = name trace('--Unicode normalized for name:', name, 'in:', namesdir) return newnames, orignames def matchUnicodePathnames(delpath, trace=print, verbose=True): r""" ------------------------------------------------------------------------------------ [3.3] Normalize Unicode components in delpath to match the same path in TO, so delpath can be deleted from TO on all platforms. Return delpath unchanged and immediately if it exists in TO initially; this applies to both paths without any Unicode variants, and platforms which auto-normalize paths. Else, return a changed delpath with its components morphed to Unicode NFC/NFD variants that match those in the path of the item physically stored in TO. delpath is the result of joining the TO content-tree's root with a relative pathname in __added__.txt. __added__.txt paths name a TO item to be deleted (an add for rollbacks, and a unique TO for deltas), and are always partial, and relative to both FROM and TO content roots (they record just the path below the content root folders). The full delpath may be absolute or relative, and its path separators have already been changed to that of the host platform for portability. When applying delta sets with -restore, any part of a to-be-deleted item's pathname in __added__.txt may be equivalent but different from its counterpart in TO. For example, the filename 'Liñux.png' has two different Unicode formats, and either may appear anywhere along both delpath and its counterpart stored in TO. This can happen because TO in delta applies is not necessarily the same TO device used to make the __added__.txt paths, and platforms and tools through which content has passed may treat such same-but-unequal filenames arbitrarily (more ahead). To adjust, scan the path to find the true names of each part stored in TO, by checking existence along the way. This is a no-op most of the time, and is generally useless but harmless for -restore rollbacks (where, unlike deltas, TO should be the same TO that was changed earlier). To support both deltas and worst cases, though, the algorithm here handles arbitrarily mixed NFC/NFD names anywhere along the same path on both FROM and TO. See also the docstring at the top of this file, and the call to this in backups.removeprioradds(). ---- Why this is necessary: Although Unicode-variant skew can arise from program use in general, Android Deltas Sync (ADS) is the main risk. When applying delta sets for that system with -restore, the unique TO paths stored in __added__.txt will be those of the _proxy_ drive, not the actual TO device. Because TO (the "phone") may be Android, Windows, or anything, Unicode variants in the proxy's saved paths may arbitrarily differ when trying to delete them from TO. The Unicode policies of the FROM platform and the medium used to store proxy paths may both vary from those on TO. This skew can arise for delta sets in general because their syncs are transitive and deferred, but ADS is the most common deltas use case. The impact depends on both platform and program Unicode policies and may be rare in practice, but ADS allows any Python-capable platform to serve as a later TO. While it's possible that every platform + filesystem combination is use today auto-normalizes paths so existence and opens work in all cases, Mergeall prefers to be proactive on resolution here. Update - platform normalization findings: - macOS (filesystem APFS) and Android (shared storage) _do_ auto-normalize path filenames, so Unicode variants match automatically without help - Windows 11 (NTFS) and Linux (ext4) do _not_ auto-normalize paths, and require the filename normalization logic here to match Unicode variants The latter seems ample cause for normalizing paths manually here. The former seems a bad policy--platform automatic normalization is proprietary and non-interoperable; among other things, it can yield duplicates in syncs more naive than Mergeall 3.3's. Update - the Android normalization story: Android, being Android, implements rules that vary per storage type, and possibly even vendor. On a Samsung Android 12L Fold4, both shared storage (e.g., /sdcard/folder) and app-specific storage (e.g., /sdcard/Android/data/app) _do_ auto-normalize, but app-private storage (e.g., /data/data/app) does _not_. Background coverage: learning-python.com/android-deltas-syncs/_README.html#Android%20Storage In sum, among the platforms tested for Unicode variants in 2022: - macOS, Android shared, and Android app-specific auto-normalize - Windows, Linux, and Android app-private do not auto-normalize Except where this varies by filesystem, vendor, or act of God... ---- UPDATE, Oct-22: the path-scanning algorithm here was recoded in light of recent Windows failures when files to be deleted in __added__.txt had been atypically removed from TO before the run (by an earlier -restore with the same DELTAS). The new algorithm fixes the Windows failures as well a similar potential on Unix, and accommodates all these path types: - Windows absolute (C:\xxx) - Windows drive-CWD-relative (C:xxx) - Windows drive-relative (\xxx) - Windows CWD-relative (xxx) - Windows device paths (\\?\C:xxx, \\.\C:xxx) - Windows network (\\server\...) - Unix absolute (/xxx) - Unix CWD-relative (xxx) - Relative path syntax on both platforms: '.', '..' Any rare and unusual path not recognized (e.g., Windows "\\?\UNC\") simply skips Unicode path normalization here, and won't be deleted by the mergeall -restore run (a log message is generated to alert users). Please send user feedback if this policy impacts your use case. The prior coding was happily naive about Windows' many flavors of path syntax (which seem ad-hoc alternatives to Unix mount points, and Windows drive-letter association). This code was triggered inadvertently, but can kick in any time components' Unicode differs between that saved in __added__.txt and that stored in the TO tree/device. For more info, see the prototype for the new coding, especially its py-split-join.py docstring: ./test/test-path-normalization-3.3/prototype-recoding-oct22/ Notes: - os.path is ntpath on Windows and posixpath on Unix; use the latter two to test platform-specific path handling on either platform. - Adding tracer prints can help demystify this wack code (Windows makes it bad enough to qualify as an interview question, imho). - This cannot simply normalize the full path all at once, because each of its components may have arbitrary Unicode-variant skew. - There is no way to tell what the path looks like in TO without walking components to test and possibly normalize each part. - This will print an arguably confusing Unicode-morph failure message and return delpath unchanged if delpath is fully absent in TO. This can happen only if delpath was manually deleted outside normal script usage, and delpath must ultimately be skipped as a sync error anyhow. Testing/Results: - debug1/debug2 force the path normalization loop and component mods, respectively, on platforms that auto-normalize filenames to match (e.g., macOS, Android shared). On these, True/False generates a path walk and "--Path okay" messages, and True/True yields the same but with "--Unicode morphed" messages. Elsewhere, on platforms that do not auto-normalize (e.g., Windows, Linux), these switches aren't useful; the loop is run for mismatches normally, and the switches cause the loop to fail when a matching name is forcibly changed. Leave them both False, and run the test script that forces path diffs and true Unicode path normalization on platforms which require it, and further documents testing, at: ./test/test-path-normalization-3.3/test-path-normalization-walks/_TEST.py For production use, the switches should always be False/False, and the code here will do the right thing for the platform hosting it. ------------------------------------------------------------------------------------ """ # force loop/normalization to test (else all variants exist on macos/android) debug1, debug2 = False, False # ntpath xor posixpath tools exists, join, sep, splitdrive = \ os.path.exists, os.path.join, os.path.sep, os.path.splitdrive if exists(FWP(delpath)) and not debug1: return delpath # skip the drama for most cases else: drive, rest = splitdrive(delpath) # splitdrive for Windows shenanigans if drive and rest.startswith(sep): # drive for abs, drive-relative, etc. sofar = drive # rest starts with sep iff abs parts = rest.split(sep) else: sofar = '' # no drive, normal components parts = delpath.split(sep) # also for Windows drive-relative if parts[0] == '': # empty for abs path/rest, win+ux sofar += sep # make join() work, skip the empty parts = parts[1:] while parts: # add each part to sofar and test next, *parts = parts # normalize each part if needed newpath = join(sofar, next) # test/mod next extension to sofar if exists(FWP(newpath)) and not debug2: # this part okay sofar = newpath if verbose: trace('--Path okay:', sofar) else: # fix this part nfc = unicodedata.normalize('NFC', next) nfd = unicodedata.normalize('NFD', next) if next == nfc: sofar = join(sofar, nfd) # failed as NFC: replace with NFD else: sofar = join(sofar, nfc) # failed as NFD: replace with NFC if exists(FWP(sofar)): trace('--Unicode morphed for name:', next, 'in:', sofar) else: error('--Unicode morph failed for name:', next, 'in:', sofar) sofar = delpath break # not found with mod: punt now return sofar # normalized/joined result, or original if part failed