File: mergeall-products/unzipped/fix-nonportable-filenames.py

#!/usr/bin/env python3
"""
=======================================================================================
fix-nonportable-filenames.py:
    Replace all nonportable characters with "_" in all file, folder, and 
    symlink names in an entire folder tree, or list nonportable items only.

Version:  a Mergeall and ziptools utility, Sep-26-2021
License:  provided freely, but with no warranties of any kind
Author:   © M. Lutz (https://learning-python.com), 2021
Runs on:  any Python 3.X, and any host platform
Status:   available in all Mergeall packages, as well as ziptools

IMPORTANT:
    It is strongly recommended that this script be run before propagating 
    content from Unix to platforms and drives which limit filename characters.
    This includes transfers to Windows, some Androids' shared storage, and 
    FAT32, exFAT, and BDR drives.  

    Else, nonportable filenames may fail and be skipped, both in Mergeall and 
    other tools.  Worse, some tools' automatic handling causes subtle problems: 
    backslashes in Unix filenames may generate unintended folders on Windows; 
    and filenames automatically mangled on Windows and drives to enable saves 
    may both trigger file overwrites, and fail to match their originals on the 
    source in later syncs unless source files are also mangled the same way.

    To avoid all such issues, run this script to make filenames portable before 
    transfers to Windows and other limited contexts.  This satisfies Mergeall's
    filename assumptions, and sidesteps issues inherent in the auto-mangling
    of names performed by the embedded ziptools when its -nomangle is not used. 

USAGE: 
    python3 fix-nonportable-filenames.py folderrootpath [listonlyany]

SUMMARY: 
    This script replaces any of [\x00 / \ | < > ? * : "] with [_] in all
    file, folder, and symlink names in a folder tree.  Use it to report or fix
    nonportable filenames for interoperability, before transferring content from
    Unix (e.g., macOS and Linux) to platforms and filesystems with character 
    constraints.  This includes transfers to Windows, but also some Androids' 
    shared-storage (e.g., at /sdcard), as well as FAT32, exFAT, and BDR drives. 

UPDATE: 
    This script was originally coded to address a macOS auto-mangle issue (see
    ahead), but it is broadly useful before transferring content from Unix to 
    more restrictive platforms or drives, and is recommended by both ziptools 
    and Mergeall.  ziptools mangles have overwrite potential and are not applied
    in all use cases, and Mergeall assumes that FROM and TO names have both been
    adjusted as needed to match; running this script satisfies both apps' goals.

CAUTION:
    When run with one argument, this script may rename files in a folder tree 
    and has no automatic rollback of its changes.  Read this docstring before 
    use, and always test with a second argument for list-only mode before 
    renames.  Because cross-copy name collisions cannot be ruled out (see the 
    next section), also run Mergeall in its "-report" report-only mode initially
    when propagating changes to contexts with filename restrictions.


USAGE DETAILS
-------------
This script replaces nonportable characters with a single '_' in all file, folder,
and symlink names in an entire folder tree.  Run it with one or two command-line 
arguments: pass the folder tree's pathname as the first argument; pass an optional
second argument (of any value) to list rename candidates but not rename them.  
This script also verifies the run with a prompt to and input from standard streams.

All characters in the _nonportables_ string below are considered nonportable and
are replaced.  On Unix, only NULL ('\x00') and '/' are invalid, but this varies per
filesystem and OS; _nonportables_ accommodates the filename rules of Windows and 
various filesystems, including FAT32 and exFAT, and some Android's shared storage. 

This script also avoids generating duplicate filenames in the subject content copy
detected during its run.  It does so by appending duplicate ID numbers if needed 
to make changed names unique with their folder.  For example, 'a|b' and 'a:b' both 
map to 'a_b' which would cause overwrites or failures unless made unique; this is 
resolved here by saving them as 'a_b' and 'a_b__2', respectively.

The ID number prevents overwrites in this script's run, but it's not impossible that
names changed here may be coincidentally the same as others in another content copy. 
This is astronomically unlikely, but run tools in report-only mode to be sure before
syncing content between different copies.  Also run this script's list-only mode 
first to preview its intentions; some nonportable characters may be deliberate.


USAGE EXAMPLES
--------------

==List-only mode==

  /Code/mergeall$ python3 fix-nonportable-filenames.py ~/testfolder -
  FINDING the following without making changes: ['\x00', '/', '\\', '|', '<', '>', '?', '*', ':', '"']
  Continue (y or n)? y

   /Users/me/testfolder/Subfolder/file?name|here.txt =>
   /Users/me/testfolder/Subfolder/file_name_here.txt

   /Users/me/testfolder/Subfolder/duptest/a<b|c =>
   /Users/me/testfolder/Subfolder/duptest/a_b_c__2

   ****Duplicate to be resolved by filename

  Visited 12 files and 3 folders
  Total nonportable names found but unchanged: 2


==Replacements mode==

  ~/MY-STUFF/Websites$ python3 $C/mergeall/fix-nonportable-filenames.py UNION
  REPLACING the following with "_" in all names: ['\x00', '/', '\\', '|', '<', '>', '?', '*', ':', '"']
  Continue (y or n)? y

   UNION/android-tkinter/etc/query-[Tkinter-discuss] tkinter on android?.html =>
   UNION/android-tkinter/etc/query-[Tkinter-discuss] tkinter on android_.html

   UNION/site-mobile-screenshots/8-ios-4"-safari.PNG =>
   UNION/site-mobile-screenshots/8-ios-4_-safari.PNG

   ...etc...

  Visited 11477 files and 1275 folders
  Total nonportable names found and changed: 11


THE macOS MUNGE
---------------
This script's original motivation was a macOS issue: nonportable filename characters
are silently mapped to and from Unicode private codes on FAT32 and exFAT drives by 
macOS, but won't match names on macOS if served from another platform, thereby 
breaking back syncs.  Because this script is now used more broadly, this original 
rationale's description has been trimmed here; see its online coverage at:

  learning-python.com/post-release-updates.html#nonportablefilenames

The original macOS munge coverage trimmed here is also available in Mergeall at:
    
  ./docetc/MoreDocs/fix-nonportable-filenames-orig.txt

Similar name mangling occurs when writing to BDR optical drives, and Linux 
raises errors and refuses to copy nonportable filenames to Windows-filesystem
drives (e.g., exFAT) in both file explorers and command lines.  Run this script
before content copies to avoid all such issues.


OTHER MANGLERS
--------------
The related ziptools program (learning-python.com/ziptools.html) also auto 
mangles names that fail on unzips, but only on Windows.  Names listed for 
removal by Mergeall's deltas.py are also auto mangled on Windows if they fail
in unmangled form when applied.  Both use cases have rare data-loss (overwrite)
risks that can be avoided by running this script before transferring content 
to platforms with filename constraints.  See ziptools' related coverage here:

  learning-python.com/ziptools/ziptools/_README.html#nomangle

Mergeall itself does not mangle names of files copied from FROM to TO by syncs, 
only names deleted from deltas.py __added__.txt lists.  This policy avoids 
out-of-sync and data-loss potentials.  Instead, Mergeall assumes that both FROM 
and TO content names have been mangled as needed, by this script; transfer to an 
external drive; copies in file explorers, or unzipping with tools like ziptools.
See Mergeall's brief related coverage here:

  learning-python.com/mergeall-products/unzipped/UserGuide.html#filenameportability

=======================================================================================
"""

# CODE

import sys, os
help = 'Usage: python3 fix-nonportable-filenames.py folderrootpath [listonlyany]'

nonportables = ' \x00 / \\ | < > ? * : " '.replace(' ', '')    # tbd: % \' + [ ] (^=fat?)
replacements = {ord(c): ord('_') for c in nonportables}

# get args
try:
    root = sys.argv[1]
    assert os.path.isdir(root)
    listonly = len(sys.argv) > 2
except:
    print(help)
    sys.exit(1)

# verify run
display = [str(c) for c in nonportables]
if listonly:
    print('FINDING the following without making changes: %s' % display)
else:
    print('REPLACING the following with "_" in all names: %s' % display)
if input('Continue (y or n)? ').lower() not in ['y', 'yes']:
    print('Run aborted.')
    sys.exit(1)
print()

# walk folder tree
numrenames = numfiles = numdirs = 0

for (thisdir, subshere, fileshere) in os.walk(root):       # for all folders in tree 
    numdirs  += len(subshere)                              # subs/fileshere include links
    numfiles += len(fileshere)

    for name in subshere + fileshere:                      # for all subfolders and files
        if any(c in name for c in nonportables):

            # replace illegals
            newname = name.translate(replacements)         # re.sub would work here too
            newpath = os.path.join(thisdir, newname)       # topdown: mods parents first
            oldpath = os.path.join(thisdir, name)

            # avoid duplicates: 'a|b' and 'a:b' both map to 'a_b'
            numdup = 1
            newbase, newext = os.path.splitext(newpath)
            trypath = newpath
            while os.path.exists(trypath):
                numdup += 1
                trypath = newbase + '__' + str(numdup) + newext    # __dup# before .ext
            newpath = trypath

            print('', oldpath, '=>\n', newpath, end='\n\n')
            if numdup > 1:
                # this won't always print if listonly: no writes gen dups yet
                when = 'to be' if listonly else 'was'
                print('', '****Duplicate %s resolved by filename\n' % when)

            # rename file or dir
            numrenames += 1
            if not listonly:
                os.rename(oldpath, newpath)

            # tell the walker about the new name for the next step
            if name in subshere and not listonly:
                subshere.remove(name)
                subshere.append(newname)    # okay to change in-place: "+" made a copy

action = 'but unchanged' if listonly else 'and changed'
print('Visited %d files and %d folders' % (numfiles, numdirs))
print('Total nonportable names found %s: %d' % (action, numrenames))



[Home page] Books Code Blog Python Author Train Find ©M.Lutz