File: android-deltas-sync/_etc/convertunicode.py
#!/usr/bin/env python3 """ ======================================================================================= convertunicode.py: Rename all items in a folder tree which are not already in a chosen Unicode format. Version: an Android Deltas Sync utility, added Dec-2021 [1.1] License: provided freely, but with no warranties of any kind Author: © M. Lutz (https://learning-python.com), 2021 Runs on: any Python 3.X, and any host platform USAGE $ python3 convertunicode.py folderrootpath unicodeformat listonly? Where: - folderrootpath is the pathname of the tree to be converted - unicodeformat is the required Unicode format: NFC, NFD, NFKC, or NFKD - listonly is any text; if present, show names to be renamed without renaming PURPOSE This is a rarely needed utility script. It's a work-around for an atypical use case in which non-ASCII filenames may generate errors on Android 11. Per recent research conducted for the underlying Mergeall system (at https://learning-python.com/mergeall.html), shared storage in Android 11 may be unable to store files having non-ASCII Unicode filenames recorded in the composed (e.g., NFC) code-point format. For background coverage, see Mergeall 3.3's new module and release note, as well the overview page on Wikipedia: https://learning-python.com/mergeall-products/unzipped/fixunicodedups.py https://learning-python.com/mergeall-products/unzipped/docetc/MoreDocs/Revisions.html#version33 https://en.wikipedia.org/wiki/Unicode_equivalence Search for "# ANDROID 11 BUG!" in the first of these. In short, Unicode allows the same name to be represented by different code-point sequences: >>> l = b'Lin\xcc\x83ux.png'.decode('utf-8') # decomposed form (NFD) >>> m = b'Li\xc3\xb1ux.png'.decode('utf-8') # composed form (NFC) >>> l, m ('Liñux.png', 'Liñux.png') >>> l == m # equivalent but different False Android 11 shared storage, unlike both app storage and other platforms, may fail when writing the composed version of such filenames: # equivalents are equated >>> os.chdir('/sdcard/temp') >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Lin\xcc\x83ux'.decode('utf8'), 'w').close() >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() >>> for f in glob.glob('AAA*'): print(f, '=>', f.encode('utf8')) ... AAA-Liñux => b'AAA-Lin\xcc\x83ux' # composed form alone fails >>> for f in glob.glob('AAA*'): os.remove(f) >>> open(b'AAA-Li\xc3\xb1ux'.decode('utf8'), 'w').close() Traceback (most recent call last): File "<stdin>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directory: 'AAA-Liñux' This is a defect of unknown scope in Android 11 (it was seen only on a Samsung device), and does not occur in app-specific storage (which is a fallback option). It also may occur only if both forms are first created and mapped together as above; oddly, the composed form has been seen to work by itself otherwise. But if your non-ASCII filenames fail in copies or syncs to shared storage, run this script from a console command line as described above, to adjust filenames as needed for a target device prior to copies or syncs. For instance, run to convert any NFC names in your content to NFD for Android 11 shared storage. This script changes only filenames, not file content. When run with the optional listonly argument, it displays filenames that require conversion, without making changes; this might be used to check for duplicates that differ only in Unicode filename form. Also note that running this is not required for Mergeall 3.3's automatic Unicode normalization for filename matching used during syncs. Border case: this script skips (with a message) the unconverted-name version of files having multiple differing name forms in the same folder. An NFD-form name, for example, is skipped if an equivalent NFC-form name already exists during a conversion to NFC. Else, the rename may fail on some platforms, and overwrite content elsewhere. This is technically a filesystem error state, because the Unicode standard defines such multiply-represented filenames to be canonically equivalent. If this occurs anyhow, this script avoids overwrites. Caveat: filename behavior is platform dependent, but this script has proved to work well in usage so far (see its macOS results at the bottom of this file). The Android 11 bug may also be temporary, but such repairs tend to languish. ======================================================================================= """ # this is based on Mergeall's fix-nonportable-filenames.py import sys, os from unicodedata import normalize help = 'Usage: python3 convertunicode.py folderrootpath unicodeformat listonly?' normalforms = ('NFC', 'NFD', 'NFKC', 'NFKD') # get args try: root = sys.argv[1] assert os.path.isdir(root) norm = sys.argv[2] assert norm in normalforms listonly = len(sys.argv) > 3 except: print(help) sys.exit(1) # verify run intent = 'filenames not in %s Unicode format' % norm if listonly: print('FINDING ' + intent) else: print('REPLACING ' + intent) if input('Continue (y or n)? ').lower() not in ['y', 'yes']: print('Run aborted.') sys.exit(1) print() # walk folder tree numrenames = numfiles = numdirs = 0 for (thisdir, subshere, fileshere) in os.walk(root): # for all folders in tree numdirs += len(subshere) # fileshere includes symlinks numfiles += len(fileshere) allhere = subshere + fileshere # subfolders + (files+links) for name in allhere: # for all items at this level normname = normalize(norm, name) if name != normname: # skip variants in same folder if normname in allhere: msg = 'Skipped same name in multiple forms' print('%s: %s in: %s' % (msg, name, thisdir), end='\n\n') continue # rename to target format newpath = os.path.join(thisdir, normname) # topdown: mods parents first oldpath = os.path.join(thisdir, name) print('', oldpath.encode('utf8'), '=>\n', newpath.encode('utf8'), end='\n\n') # rename file or dir numrenames += 1 if not listonly: os.rename(oldpath, newpath) # tell the walker about the new name for the next step if name in subshere and not listonly: subshere.remove(name) subshere.append(normname) # okay to change in-place: "+" made a copy action = 'but unchanged' if listonly else 'and changed' print('Visited %d file(s) and %d folder(s)' % (numfiles, numdirs)) print('Total variant names found %s: %d' % (action, numrenames)) """ =============================================================================== EXAMPLE OUTPUT: # Convert to composed $ python3 $C/android-deltas-sync/_etc/convertunicode.py FROM NFC REPLACING filenames not in NFC Unicode format Continue (y or n)? y b'FROM/dir-Lin\xcc\x83ux' => b'FROM/dir-Li\xc3\xb1ux' b'FROM/file1-Lin\xcc\x83ux.txt' => b'FROM/file1-Li\xc3\xb1ux.txt' b'FROM/file3-Lin\xcc\x83ux.txt' => b'FROM/file3-Li\xc3\xb1ux.txt' b'FROM/link-Lin\xcc\x83ux' => b'FROM/link-Li\xc3\xb1ux' b'FROM/dir-Li\xc3\xb1ux/nested3-Lin\xcc\x83ux.txt' => b'FROM/dir-Li\xc3\xb1ux/nested3-Li\xc3\xb1ux.txt' b'FROM/dir-Li\xc3\xb1ux/nested1-Lin\xcc\x83ux.txt' => b'FROM/dir-Li\xc3\xb1ux/nested1-Li\xc3\xb1ux.txt' Visited 7 file(s) and 1 folder(s) Total variant names found and changed: 6 # List non-decomposed $ python3 $C/android-deltas-sync/_etc/convertunicode.py FROM NFD - FINDING filenames not in NFD Unicode format Continue (y or n)? y b'FROM/dir-Li\xc3\xb1ux' => b'FROM/dir-Lin\xcc\x83ux' b'FROM/file1-Li\xc3\xb1ux.txt' => b'FROM/file1-Lin\xcc\x83ux.txt' b'FROM/file3-Li\xc3\xb1ux.txt' => b'FROM/file3-Lin\xcc\x83ux.txt' b'FROM/link-Li\xc3\xb1ux' => b'FROM/link-Lin\xcc\x83ux' b'FROM/dir-Li\xc3\xb1ux/nested3-Li\xc3\xb1ux.txt' => b'FROM/dir-Li\xc3\xb1ux/nested3-Lin\xcc\x83ux.txt' b'FROM/dir-Li\xc3\xb1ux/nested1-Li\xc3\xb1ux.txt' => b'FROM/dir-Li\xc3\xb1ux/nested1-Lin\xcc\x83ux.txt' Visited 7 file(s) and 1 folder(s) Total variant names found but unchanged: 6 # Forced in code on macOS (which maps equivalent names together) $ python3 $C/android-deltas-sync/_etc/convertunicode.py FROM NFC REPLACING filenames not in NFC Unicode format Continue (y or n)? y Skipped same name in multiple forms: dir-Liñux in: FROM Visited 7 file(s) and 1 folder(s) Total variant names found and changed: 0 =============================================================================== """