File: android-deltas-sync/_etc/convertunicode.py
#!/usr/bin/env python3
"""
=======================================================================================
convertunicode.py:
Rename all items in a folder tree which are not already in a chosen Unicode format.
Version: an Android Deltas Sync utility, added Dec-2021 [1.1]
License: provided freely, but with no warranties of any kind
Author: © M. Lutz (https://learning-python.com), 2021
Runs on: any Python 3.X, and any host platform
USAGE
$ python3 convertunicode.py folderrootpath unicodeformat listonly?
Where:
- folderrootpath is the pathname of the tree to be converted
- unicodeformat is the required Unicode format: NFC, NFD, NFKC, or NFKD
- listonly is any text; if present, show names to be renamed without renaming
PURPOSE
This is a rarely needed utility script. It's a work-around for an atypical
use case in which non-ASCII filenames may generate errors on Android 11.
Per recent research conducted for the underlying Mergeall system (at
https://learning-python.com/mergeall.html), shared storage in Android 11 may
be unable to store files having non-ASCII Unicode filenames recorded in the
composed (e.g., NFC) code-point format. For background coverage, see Mergeall
3.3's new module and release note, as well the overview page on Wikipedia:
https://learning-python.com/mergeall-products/unzipped/fixunicodedups.py
https://learning-python.com/mergeall-products/unzipped/docetc/MoreDocs/Revisions.html#version33
https://en.wikipedia.org/wiki/Unicode_equivalence
Search for "# ANDROID 11 BUG!" in the first of these. In short, Unicode
allows the same name to be represented by different code-point sequences:
>>> l = b'Lin\xcc\x83ux.png'.decode('utf-8') # decomposed form (NFD)
>>> m = b'Li\xc3\xb1ux.png'.decode('utf-8') # composed form (NFC)
>>> l, m
('Liñux.png', 'Liñux.png')
>>> l == m # equivalent but different
False
Android 11 shared storage, unlike both app storage and other platforms,
may fail when writing the composed version of such filenames:
# equivalents are equated
>>> os.chdir('/sdcard/temp')
>>> for f in glob.glob('AAA*'): os.remove(f)
>>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close()
>>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
>>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8'))
...
AAA-Liñux => b'AAA-Lin\xcc\x83ux'
# composed form alone fails
>>> for f in glob.glob('AAA*'): os.remove(f)
>>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'AAA-Liñux'
This is a defect of unknown scope in Android 11 (it was seen only on a Samsung
device), and does not occur in app-specific storage (which is a fallback option).
It also may occur only if both forms are first created and mapped together as
above; oddly, the composed form has been seen to work by itself otherwise.
But if your non-ASCII filenames fail in copies or syncs to shared storage, run
this script from a console command line as described above, to adjust filenames
as needed for a target device prior to copies or syncs. For instance, run to
convert any NFC names in your content to NFD for Android 11 shared storage.
This script changes only filenames, not file content. When run with the optional
listonly argument, it displays filenames that require conversion, without making
changes; this might be used to check for duplicates that differ only in Unicode
filename form. Also note that running this is not required for Mergeall 3.3's
automatic Unicode normalization for filename matching used during syncs.
Border case: this script skips (with a message) the unconverted-name version of
files having multiple differing name forms in the same folder. An NFD-form name,
for example, is skipped if an equivalent NFC-form name already exists during a
conversion to NFC. Else, the rename may fail on some platforms, and overwrite
content elsewhere. This is technically a filesystem error state, because the
Unicode standard defines such multiply-represented filenames to be canonically
equivalent. If this occurs anyhow, this script avoids overwrites.
Caveat: filename behavior is platform dependent, but this script has proved to
work well in usage so far (see its macOS results at the bottom of this file).
The Android 11 bug may also be temporary, but such repairs tend to languish.
=======================================================================================
"""
# this is based on Mergeall's fix-nonportable-filenames.py
import sys, os
from unicodedata import normalize
help = 'Usage: python3 convertunicode.py folderrootpath unicodeformat listonly?'
normalforms = ('NFC', 'NFD', 'NFKC', 'NFKD')
# get args
try:
root = sys.argv[1]
assert os.path.isdir(root)
norm = sys.argv[2]
assert norm in normalforms
listonly = len(sys.argv) > 3
except:
print(help)
sys.exit(1)
# verify run
intent = 'filenames not in %s Unicode format' % norm
if listonly:
print('FINDING ' + intent)
else:
print('REPLACING ' + intent)
if input('Continue (y or n)? ').lower() not in ['y', 'yes']:
print('Run aborted.')
sys.exit(1)
print()
# walk folder tree
numrenames = numfiles = numdirs = 0
for (thisdir, subshere, fileshere) in os.walk(root): # for all folders in tree
numdirs += len(subshere) # fileshere includes symlinks
numfiles += len(fileshere)
allhere = subshere + fileshere # subfolders + (files+links)
for name in allhere: # for all items at this level
normname = normalize(norm, name)
if name != normname:
# skip variants in same folder
if normname in allhere:
msg = 'Skipped same name in multiple forms'
print('%s: %s in: %s' %
(msg, name, thisdir), end='\n\n')
continue
# rename to target format
newpath = os.path.join(thisdir, normname) # topdown: mods parents first
oldpath = os.path.join(thisdir, name)
print('', oldpath.encode('utf8'), '=>\n',
newpath.encode('utf8'), end='\n\n')
# rename file or dir
numrenames += 1
if not listonly:
os.rename(oldpath, newpath)
# tell the walker about the new name for the next step
if name in subshere and not listonly:
subshere.remove(name)
subshere.append(normname) # okay to change in-place: "+" made a copy
action = 'but unchanged' if listonly else 'and changed'
print('Visited %d file(s) and %d folder(s)' % (numfiles, numdirs))
print('Total variant names found %s: %d' % (action, numrenames))
"""
===============================================================================
EXAMPLE OUTPUT:
# Convert to composed
$ python3 $C/android-deltas-sync/_etc/convertunicode.py FROM NFC
REPLACING filenames not in NFC Unicode format
Continue (y or n)? y
b'FROM/dir-Lin\xcc\x83ux' =>
b'FROM/dir-Li\xc3\xb1ux'
b'FROM/file1-Lin\xcc\x83ux.txt' =>
b'FROM/file1-Li\xc3\xb1ux.txt'
b'FROM/file3-Lin\xcc\x83ux.txt' =>
b'FROM/file3-Li\xc3\xb1ux.txt'
b'FROM/link-Lin\xcc\x83ux' =>
b'FROM/link-Li\xc3\xb1ux'
b'FROM/dir-Li\xc3\xb1ux/nested3-Lin\xcc\x83ux.txt' =>
b'FROM/dir-Li\xc3\xb1ux/nested3-Li\xc3\xb1ux.txt'
b'FROM/dir-Li\xc3\xb1ux/nested1-Lin\xcc\x83ux.txt' =>
b'FROM/dir-Li\xc3\xb1ux/nested1-Li\xc3\xb1ux.txt'
Visited 7 file(s) and 1 folder(s)
Total variant names found and changed: 6
# List non-decomposed
$ python3 $C/android-deltas-sync/_etc/convertunicode.py FROM NFD -
FINDING filenames not in NFD Unicode format
Continue (y or n)? y
b'FROM/dir-Li\xc3\xb1ux' =>
b'FROM/dir-Lin\xcc\x83ux'
b'FROM/file1-Li\xc3\xb1ux.txt' =>
b'FROM/file1-Lin\xcc\x83ux.txt'
b'FROM/file3-Li\xc3\xb1ux.txt' =>
b'FROM/file3-Lin\xcc\x83ux.txt'
b'FROM/link-Li\xc3\xb1ux' =>
b'FROM/link-Lin\xcc\x83ux'
b'FROM/dir-Li\xc3\xb1ux/nested3-Li\xc3\xb1ux.txt' =>
b'FROM/dir-Li\xc3\xb1ux/nested3-Lin\xcc\x83ux.txt'
b'FROM/dir-Li\xc3\xb1ux/nested1-Li\xc3\xb1ux.txt' =>
b'FROM/dir-Li\xc3\xb1ux/nested1-Lin\xcc\x83ux.txt'
Visited 7 file(s) and 1 folder(s)
Total variant names found but unchanged: 6
# Forced in code on macOS (which maps equivalent names together)
$ python3 $C/android-deltas-sync/_etc/convertunicode.py FROM NFC
REPLACING filenames not in NFC Unicode format
Continue (y or n)? y
Skipped same name in multiple forms: dir-Liñux in: FROM
Visited 7 file(s) and 1 folder(s)
Total variant names found and changed: 0
===============================================================================
"""