File: mergeall-products/unzipped/docetc/MoreDocs/fix-nonportable-filenames-orig.txt
# Original documentation of the macOS munge, trimmed from fix-nonportable-filenames.py This script mostly applies to macOS users. The need for it is complex, and arose only after using Mergeall extensively and cross platform for 7 years without the issue. Normally, intermediate drives used to transfer content between platforms mute filename interoperability issues. In fact, both exFAT and FAT32 drives on macOS seem to allow writing otherwise-invalid characters, which typically crop up in saved web pages. An issue has been seen to arise, however, when both: 1) Content is being synced between macOS and a file server. This was observed for a WebDAV server running on an Android phone, but it's likely also problematic for both some other file server types, and Windows an Linux server hosts. 2) The content being synced originated from a FAT32 or exFAT drive populated by Mac OS, and was propagated from there to the platform running the server. For FAT32 and exFAT drives, macOS follows Cygwin's model of automatically mapping forbidden filename characters to and from Unicode private characters on writes and reads, respectively. This magic mapping works well if the drives are only ever used on macOS, but fails if the mapped filenames are propagated to another platform: the forbidden characters are stored elsewhere as their mapped Unicode characters (not the original illegals), and will not match the unmapped originals on the Mac if returned by a server running elsewhere. For example, a filenames with '|' on macOS sync across platforms using a FAT32 drive without error: when macOS stores on the drive it replaces the '|' with Unicode '\uf027', which is also propagated and stored on other platforms transitively. In later syncs, other platforms' '\uf027' match the same on FAT32 drives, and the drive's '\uf027' is mapped back to '|' when read on macOS. It looks like the FAT32 drive harbors a '|' on Macs alone, but it's really a '\uf027' outside the Mac clubhouse, and other illegals are smuggled the same way (e.g., '"' becomes '\uf020'). Though scant, you can find evidence of macOS's munge (called "SFM") in the code of Linux's 'cifs_unicode.h' on the web. Or, watch it live in Python 3.X on macOS: # EXFAT AND FAT32 USB DRIVES ARE MUNGED >>> import os >>> os.chdir('/Volumes/SSDT3') >>> os.mkdir('test') >>> os.chdir('test') >>> open('file\uf027name\uf020here', 'w').write('hmm') # write munged chars 3 >>> os.listdir('.') # receive illegals ['file|name"here'] >>> >>> open('file|name"here', 'r').read() # either name works 'hmm' >>> open('file\uf027name\uf020here', 'r').read() # but only on macOS 'hmm' >>> >>> open('file|name"here', 'w').write('hmm more') # write different name 8 >>> open('file|name"here', 'r').read() 'hmm more' >>> open('file\uf027name\uf020here', 'r').read() # two names for same file 'hmm more' >>> os.listdir() # this is not cool ['file|name"here'] # BUT INTERNAL APFS DRIVES ARE NOT >>> os.chdir('/Users/me/Documents') >>> os.mkdir('test') >>> os.chdir('test') >>> open('file\uf027name\uf020here', 'w').write('hmm') 3 >>> os.listdir('.') ['file\uf027name\uf020here'] # what most servers return >>> >>> open('file\uf027name\uf020here', 'r').read() 'hmm' >>> open('file|name"here', 'r').read() # behavior diff: surprise! Traceback (most recent call last): File "<stdin>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: 'file|name"here' The net effect is that syncs work between macOS and the drives (due to the magic mapping); as well as between the drives and other platforms (both of which store and compare the mapped Unicode characters). Syncs fail, however, when filenames are served from another platform and compared to macOS: though this may vary per server (e.g., cifs on Linux undoes Mac filename mapping for Samba drives), the Unicode characters from the other platform generally won't match the unmapped originals on macOS. In Mergeall, the result is that all filenames with forbidden characters mapped by macOS register as unique in both FROM and TO when served from another platform, and will be deleted as unique TOs, and recopied pointlessly from FROM. Depending on the server's behavior, these diffs may trigger outright failures or repeated updates. I.e., magic kills. macOS should instead raise an error for invalid filename characters, so users have a chance to avoid this interoperability collision. Cloaking problems is not the same as fixing them. As is, other platforms silently receive and record what looks like garbage bytes; may propagated them elsewhere; and fail server-based syncs back to macOS - only. Barring a macOS policy change, its users must be diligent to remove illegal characters before propagation, with tools like this script. WHY NOT IN MERGEALL: NAME COLLISIONS ------------------------------------ Running this script to sanitize such filenames fixes this problem, but it's a manual step, and its benefit lasts only until the next nonportable name is added. This is subpar, but supporting such filenames in Mergeall itself is complex, and would be a heuristic gamble. Filename comparisons would need to be generalized to ignore nonportable characters, but the original filename would still need to be used for writes to avoid penalizing platforms and filesystems that allow more characters (Mergeall's cardinal rule is to do content no harm--including its names). But even if Mergeall were changed this way, there is a remote but real risk that a filename munged for comparisons would match another in syncs only coincidentally, resulting in erroneous update. For instance, a 'a_b' or other created internally to compare a Unix 'a|b' may be the name of entirely different content in another copy. Because Mergeall generally avoids such "almost okay" policies, some user intervention is warranted, and manual sanitizing with this script (or similar techniques) is the better solution today. CAUTION REDUX ------------- This script avoids overwriting same-named files in the local content copy by making filenames unique as noted earlier, but it's still not impossible that a sanitized filename may unintentionally match an unrelated file in a different copy. Though unlikely, this could yield invalid changes on syncs. Because of this risk, you should always inspect the results of this script, and run Mergeall initially in report-only mode when propagating files renamed here to other copies of your content. This risk is the same as munging names within Mergeall, but you at least control its impacts. This script may also replace filename characters you deliberately use for effect or testing (e.g., '"' for inches, which works on Unix web servers). Be sure to ALWAYS INSPECT ITS INTENTIONS with a second argument before renaming your files. Some users may very well prefer to avoid this script after previewing its changes. REFERENCES ---------- Drives FAT32: https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits exFAT: https://docs.microsoft.com/en-us/windows/win32/fileio/exfat-specification#773-filename-field The munge Samba: https://www.google.com/search?q=%22cifs_unicode.h%22 Samba: https://wiki.samba.org/index.php/SMB3-Linux#No_reserved_path_characters Cygwin: https://www.cygwin.com/cygwin-ug-net/using-specialnames.html Servers Samba: https://play.google.com/store/search?q=samba%20sever&c=apps WebDAV: https://play.google.com/store/search?q=webdav%20sever&c=apps Tested: https://play.google.com/store/apps/details?id=com.theolivetree.webdavserverpro MacOS: https://duckduckgo.com/?q=access+webdav+on+mac+os Linux: https://savannah.nongnu.org/projects/davfs2 Mount: http://manpages.ubuntu.com/manpages/precise/man8/mount.davfs.8.html