File: mergeall-products/unzipped/docetc/MoreDocs/fix-nonportable-filenames-orig.txt

# Original documentation of the macOS munge, trimmed from fix-nonportable-filenames.py

This script mostly applies to macOS users.  The need for it is complex, and arose only 
after using Mergeall extensively and cross platform for 7 years without the issue.  
Normally, intermediate drives used to transfer content between platforms mute filename
interoperability issues.  In fact, both exFAT and FAT32 drives on macOS seem to allow
writing otherwise-invalid characters, which typically crop up in saved web pages.

An issue has been seen to arise, however, when both:

1) Content is being synced between macOS and a file server.  This was observed for 
   a WebDAV server running on an Android phone, but it's likely also problematic for 
   both some other file server types, and Windows an Linux server hosts.  

2) The content being synced originated from a FAT32 or exFAT drive populated by Mac 
   OS, and was propagated from there to the platform running the server. 

For FAT32 and exFAT drives, macOS follows Cygwin's model of automatically mapping 
forbidden filename characters to and from Unicode private characters on writes and 
reads, respectively.  This magic mapping works well if the drives are only ever used 
on macOS, but fails if the mapped filenames are propagated to another platform: the 
forbidden characters are stored elsewhere as their mapped Unicode characters (not the 
original illegals), and will not match the unmapped originals on the Mac if returned 
by a server running elsewhere.

For example, a filenames with '|' on macOS sync across platforms using a FAT32 drive
without error: when macOS stores on the drive it replaces the '|' with Unicode '\uf027',
which is also propagated and stored on other platforms transitively.  In later syncs, 
other platforms' '\uf027' match the same on FAT32 drives, and the drive's '\uf027' is 
mapped back to '|' when read on macOS.  It looks like the FAT32 drive harbors a '|' on 
Macs alone, but it's really a '\uf027' outside the Mac clubhouse, and other illegals are 
smuggled the same way (e.g., '"' becomes '\uf020').  

Though scant, you can find evidence of macOS's munge (called "SFM") in the code of 
Linux's 'cifs_unicode.h' on the web.  Or, watch it live in Python 3.X on macOS:

# EXFAT AND FAT32 USB DRIVES ARE MUNGED

  >>> import os
  >>> os.chdir('/Volumes/SSDT3')
  >>> os.mkdir('test')
  >>> os.chdir('test')
  >>> open('file\uf027name\uf020here', 'w').write('hmm')    # write munged chars
  3
  >>> os.listdir('.')                                       # receive illegals
  ['file|name"here']
  >>>
  >>> open('file|name"here', 'r').read()                    # either name works
  'hmm'
  >>> open('file\uf027name\uf020here', 'r').read()          # but only on macOS
  'hmm'
  >>> 
  >>> open('file|name"here', 'w').write('hmm more')         # write different name
  8
  >>> open('file|name"here', 'r').read()
  'hmm more'
  >>> open('file\uf027name\uf020here', 'r').read()          # two names for same file
  'hmm more'
  >>> os.listdir()                                          # this is not cool
  ['file|name"here']

# BUT INTERNAL APFS DRIVES ARE NOT

  >>> os.chdir('/Users/me/Documents')
  >>> os.mkdir('test')
  >>> os.chdir('test')
  >>> open('file\uf027name\uf020here', 'w').write('hmm')
  3
  >>> os.listdir('.')
  ['file\uf027name\uf020here']                              # what most servers return
  >>> 
  >>> open('file\uf027name\uf020here', 'r').read()
  'hmm'
  >>> open('file|name"here', 'r').read()                    # behavior diff: surprise!
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  FileNotFoundError: [Errno 2] No such file or directory: 'file|name"here'

The net effect is that syncs work between macOS and the drives (due to the magic 
mapping); as well as between the drives and other platforms (both of which store and 
compare the mapped Unicode characters).  Syncs fail, however, when filenames are served
from another platform and compared to macOS: though this may vary per server (e.g., 
cifs on Linux undoes Mac filename mapping for Samba drives), the Unicode characters 
from the other platform generally won't match the unmapped originals on macOS.

In Mergeall, the result is that all filenames with forbidden characters mapped by
macOS register as unique in both FROM and TO when served from another platform, and 
will be deleted as unique TOs, and recopied pointlessly from FROM.  Depending on the
server's behavior, these diffs may trigger outright failures or repeated updates.

I.e., magic kills.  macOS should instead raise an error for invalid filename characters,
so users have a chance to avoid this interoperability collision.  Cloaking problems is 
not the same as fixing them.  As is, other platforms silently receive and record what 
looks like garbage bytes; may propagated them elsewhere; and fail server-based syncs 
back to macOS - only.  Barring a macOS policy change, its users must be diligent to
remove illegal characters before propagation, with tools like this script.  


WHY NOT IN MERGEALL: NAME COLLISIONS
------------------------------------
Running this script to sanitize such filenames fixes this problem, but it's a manual 
step, and its benefit lasts only until the next nonportable name is added.  

This is subpar, but supporting such filenames in Mergeall itself is complex, and 
would be a heuristic gamble.  Filename comparisons would need to be generalized
to ignore nonportable characters, but the original filename would still need to be 
used for writes to avoid penalizing platforms and filesystems that allow more 
characters (Mergeall's cardinal rule is to do content no harm--including its names).

But even if Mergeall were changed this way, there is a remote but real risk that a 
filename munged for comparisons would match another in syncs only coincidentally, 
resulting in erroneous update.  For instance, a 'a_b' or other created internally to 
compare a Unix 'a|b' may be the name of entirely different content in another copy.  
Because Mergeall generally avoids such "almost okay" policies, some user intervention 
is warranted, and manual sanitizing with this script (or similar techniques) is the 
better solution today. 


CAUTION REDUX
-------------
This script avoids overwriting same-named files in the local content copy by making 
filenames unique as noted earlier, but it's still not impossible that a sanitized 
filename may unintentionally match an unrelated file in a different copy.  Though 
unlikely, this could yield invalid changes on syncs.  

Because of this risk, you should always inspect the results of this script, and run
Mergeall initially in report-only mode when propagating files renamed here to other
copies of your content.  This risk is the same as munging names within Mergeall, 
but you at least control its impacts.

This script may also replace filename characters you deliberately use for effect 
or testing (e.g., '"' for inches, which works on Unix web servers).  Be sure to 
ALWAYS INSPECT ITS INTENTIONS with a second argument before renaming your files.
Some users may very well prefer to avoid this script after previewing its changes.


REFERENCES
----------
Drives
  FAT32:  https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits
  exFAT:  https://docs.microsoft.com/en-us/windows/win32/fileio/exfat-specification#773-filename-field
 
The munge 
  Samba:  https://www.google.com/search?q=%22cifs_unicode.h%22
  Samba:  https://wiki.samba.org/index.php/SMB3-Linux#No_reserved_path_characters
  Cygwin: https://www.cygwin.com/cygwin-ug-net/using-specialnames.html

Servers
  Samba:  https://play.google.com/store/search?q=samba%20sever&c=apps
  WebDAV: https://play.google.com/store/search?q=webdav%20sever&c=apps
  Tested: https://play.google.com/store/apps/details?id=com.theolivetree.webdavserverpro
  MacOS:  https://duckduckgo.com/?q=access+webdav+on+mac+os
  Linux:  https://savannah.nongnu.org/projects/davfs2
  Mount:  http://manpages.ubuntu.com/manpages/precise/man8/mount.davfs.8.html



[Home page] Books Code Blog Python Author Train Find ©M.Lutz