File: android-deltas-sync/_etc/convertunicode.py

#!/usr/bin/env python3
"""
=======================================================================================
convertunicode.py:
    Rename all items in a folder tree which are not already in a chosen Unicode format.

Version:  an Android Deltas Sync utility, added Dec-2021 [1.1]
License:  provided freely, but with no warranties of any kind
Author:   © M. Lutz (https://learning-python.com), 2021
Runs on:  any Python 3.X, and any host platform

USAGE
    $ python3 convertunicode.py folderrootpath unicodeformat listonly?

    Where:
    - folderrootpath is the pathname of the tree to be converted
    - unicodeformat is the required Unicode format: NFC, NFD, NFKC, or NFKD
    - listonly is any text; if present, show names to be renamed without renaming

PURPOSE

This is a rarely needed utility script.  It's a work-around for an atypical
use case in which non-ASCII filenames may generate errors on Android 11.

Per recent research conducted for the underlying Mergeall system (at
https://learning-python.com/mergeall.html), shared storage in Android 11 may
be unable to store files having non-ASCII Unicode filenames recorded in the 
composed (e.g., NFC) code-point format.  For background coverage, see Mergeall 
3.3's new module and release note, as well the overview page on Wikipedia:

    https://learning-python.com/mergeall-products/unzipped/fixunicodedups.py
    https://learning-python.com/mergeall-products/unzipped/docetc/MoreDocs/Revisions.html#version33
    https://en.wikipedia.org/wiki/Unicode_equivalence

Search for "# ANDROID 11 BUG!" in the first of these.  In short, Unicode 
allows the same name to be represented by different code-point sequences:

    >>> l = b'Lin\xcc\x83ux.png'.decode('utf-8')   # decomposed form (NFD)
    >>> m = b'Li\xc3\xb1ux.png'.decode('utf-8')    # composed form (NFC)
    >>> l, m
    ('Liñux.png', 'Liñux.png')
    >>> l == m                    # equivalent but different
    False

Android 11 shared storage, unlike both app storage and other platforms, 
may fail when writing the composed version of such filenames:

    # equivalents are equated
    >>> os.chdir('/sdcard/temp')
    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
    ...
    AAA-Liñux => b'AAA-Lin\xcc\x83ux'

    # composed form alone fails
    >>> for f in glob.glob('AAA*'): os.remove(f)
    >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    FileNotFoundError: [Errno 2] No such file or directory: 'AAA-Liñux'

This is a defect of unknown scope in Android 11 (it was seen only on a Samsung 
device), and does not occur in app-specific storage (which is a fallback option). 
It also may occur only if both forms are first created and mapped together as 
above; oddly, the composed form has been seen to work by itself otherwise.

But if your non-ASCII filenames fail in copies or syncs to shared storage, run 
this script from a console command line as described above, to adjust filenames 
as needed for a target device prior to copies or syncs.  For instance, run to 
convert any NFC names in your content to NFD for Android 11 shared storage. 

This script changes only filenames, not file content.  When run with the optional 
listonly argument, it displays filenames that require conversion, without making
changes; this might be used to check for duplicates that differ only in Unicode
filename form.  Also note that running this is not required for Mergeall 3.3's 
automatic Unicode normalization for filename matching used during syncs.

Border case: this script skips (with a message) the unconverted-name version of 
files having multiple differing name forms in the same folder.  An NFD-form name, 
for example, is skipped if an equivalent NFC-form name already exists during a 
conversion to NFC.  Else, the rename may fail on some platforms, and overwrite
content elsewhere.  This is technically a filesystem error state, because the 
Unicode standard defines such multiply-represented filenames to be canonically 
equivalent.  If this occurs anyhow, this script avoids overwrites.

Caveat: filename behavior is platform dependent, but this script has proved to
work well in usage so far (see its macOS results at the bottom of this file).
The Android 11 bug may also be temporary, but such repairs tend to languish.
=======================================================================================
"""


# this is based on Mergeall's fix-nonportable-filenames.py

import sys, os
from unicodedata import normalize

help = 'Usage: python3 convertunicode.py folderrootpath unicodeformat listonly?'
normalforms = ('NFC', 'NFD', 'NFKC', 'NFKD')

# get args
try:
    root = sys.argv[1]
    assert os.path.isdir(root)
    norm = sys.argv[2]
    assert norm in normalforms
    listonly = len(sys.argv) > 3
except:
    print(help)
    sys.exit(1)

# verify run
intent = 'filenames not in %s Unicode format' % norm
if listonly:
    print('FINDING ' + intent)
else:
    print('REPLACING ' + intent)
if input('Continue (y or n)? ').lower() not in ['y', 'yes']:
    print('Run aborted.')
    sys.exit(1)
print()

# walk folder tree
numrenames = numfiles = numdirs = 0

for (thisdir, subshere, fileshere) in os.walk(root):      # for all folders in tree 
    numdirs  += len(subshere)                             # fileshere includes symlinks
    numfiles += len(fileshere)

    allhere = subshere + fileshere                        # subfolders + (files+links)
    for name in allhere:                                  # for all items at this level
        normname = normalize(norm, name)
        if name != normname:

            # skip variants in same folder
            if normname in allhere:
                msg = 'Skipped same name in multiple forms' 
                print('%s: %s in: %s' % 
                      (msg, name, thisdir), end='\n\n')
                continue

            # rename to target format
            newpath = os.path.join(thisdir, normname)     # topdown: mods parents first
            oldpath = os.path.join(thisdir, name)
            print('', oldpath.encode('utf8'), '=>\n', 
                      newpath.encode('utf8'), end='\n\n')

            # rename file or dir
            numrenames += 1
            if not listonly:
                os.rename(oldpath, newpath)

            # tell the walker about the new name for the next step
            if name in subshere and not listonly:
                subshere.remove(name)
                subshere.append(normname)    # okay to change in-place: "+" made a copy

action = 'but unchanged' if listonly else 'and changed'
print('Visited %d file(s) and %d folder(s)' % (numfiles, numdirs))
print('Total variant names found %s: %d' % (action, numrenames))


"""
===============================================================================

EXAMPLE OUTPUT:


# Convert to composed
$ python3 $C/android-deltas-sync/_etc/convertunicode.py FROM NFC
REPLACING filenames not in NFC Unicode format
Continue (y or n)? y

 b'FROM/dir-Lin\xcc\x83ux' =>
 b'FROM/dir-Li\xc3\xb1ux'

 b'FROM/file1-Lin\xcc\x83ux.txt' =>
 b'FROM/file1-Li\xc3\xb1ux.txt'

 b'FROM/file3-Lin\xcc\x83ux.txt' =>
 b'FROM/file3-Li\xc3\xb1ux.txt'

 b'FROM/link-Lin\xcc\x83ux' =>
 b'FROM/link-Li\xc3\xb1ux'

 b'FROM/dir-Li\xc3\xb1ux/nested3-Lin\xcc\x83ux.txt' =>
 b'FROM/dir-Li\xc3\xb1ux/nested3-Li\xc3\xb1ux.txt'

 b'FROM/dir-Li\xc3\xb1ux/nested1-Lin\xcc\x83ux.txt' =>
 b'FROM/dir-Li\xc3\xb1ux/nested1-Li\xc3\xb1ux.txt'

Visited 7 file(s) and 1 folder(s)
Total variant names found and changed: 6


# List non-decomposed
$ python3 $C/android-deltas-sync/_etc/convertunicode.py FROM NFD -
FINDING filenames not in NFD Unicode format
Continue (y or n)? y

 b'FROM/dir-Li\xc3\xb1ux' =>
 b'FROM/dir-Lin\xcc\x83ux'

 b'FROM/file1-Li\xc3\xb1ux.txt' =>
 b'FROM/file1-Lin\xcc\x83ux.txt'

 b'FROM/file3-Li\xc3\xb1ux.txt' =>
 b'FROM/file3-Lin\xcc\x83ux.txt'

 b'FROM/link-Li\xc3\xb1ux' =>
 b'FROM/link-Lin\xcc\x83ux'

 b'FROM/dir-Li\xc3\xb1ux/nested3-Li\xc3\xb1ux.txt' =>
 b'FROM/dir-Li\xc3\xb1ux/nested3-Lin\xcc\x83ux.txt'

 b'FROM/dir-Li\xc3\xb1ux/nested1-Li\xc3\xb1ux.txt' =>
 b'FROM/dir-Li\xc3\xb1ux/nested1-Lin\xcc\x83ux.txt'

Visited 7 file(s) and 1 folder(s)
Total variant names found but unchanged: 6


# Forced in code on macOS (which maps equivalent names together)
$ python3 $C/android-deltas-sync/_etc/convertunicode.py FROM NFC
REPLACING filenames not in NFC Unicode format
Continue (y or n)? y

Skipped same name in multiple forms: dir-Liñux in: FROM

Visited 7 file(s) and 1 folder(s)
Total variant names found and changed: 0

===============================================================================
"""



[Home page] Books Code Blog Python Author Train Find ©M.Lutz