File: mergeall-products/unzipped/skipcruft.py
""" ================================================================================== skipcruft.py: skip system cruft files in both FROM and TO (part of the mergeall system [3.0]) --See mergeall_configs.py for the cruft patterns used here-- Given a folder listing (in one of two forms), filter out filenames matching cruft (a.k.a. metadata) patterns defined in mergeall_configs.py. This code is factored here because it is used by programs mergeall, diffall, and cpall (including the latter's copytree() called from mergeall for bulk copies of unique folders in FROM). Any tuning here will be picked up by all three. Provides variants for the two common directory listers. mergeall uses either os.listdir() or os.scandir(), depending on Python version. diffall uses only os.listdir(), as its time is dominated by reading files in full. cpall also uses os.listdir() because its file writes far outweigh its code's speed. CODING NOTES: Uses comprehensions for speed; any() is a nested loop, but exits (i.e., short circuits) on the first True result from an iterator: >>> def f1(): print('F1'); return False >>> def f2(): print('F2'); return True >>> def f3(): print('F3'); return True >>> x = any(f() for f in [f1, f2, f3]) F1 F2 >>> x True Speed matters here, because every filename in both trees must be run through the filters (there can be over 50k files per tree in an author's use case). List comprehensions and built-ins like any() generally beat manual code, but performance tuning for alternative codings is an open exercise. If this proves slow, fnmatch can be replaced by '==' for literal filenames. Most likely, file-system call time will overshadow gains from tuning here. UPDATE: in mergeall, an 87G 58k-file archive compares in 8 seconds without cruft skipping and 11 with - a trivial and acceptable 3-second penalty. There is no noticeable speed difference in the IO-bound diffall or cpall. Code of the following sort could be used to print filtered cruft item names, but was deemed to be useful for testing only, given the many client (TBD?): if filtered != filedirentrys: print('???', sorted( set(dirent.name for dirent in filedirentrys) - set(dirent.name for dirent in filtered))) CASE SENSITIVITY: By itself, fnmatch.fnmatch() is case-insensitive on Windows, but not on Mac or Linux: "Desktop.ini" matches "desktop.ini" on Windows only. When running on Unix, this doesn't make sense for detecting Windows cruft; it may for Unix's own cruft files, but it seems very unlikely that a cruft filename like ".Trash" would be used for non-cruft roles with a different case. Here, case-insensitive matching is forced by default, by converting both filename and pattern to lower case, and using fnmatch.fnmatchcase() which does no case mapping itself. Using different matchers for Unix and Windows patterns may work too, but seems complex overkill here. If this proves to be an issue, set CRUFTCASENEUTRAL to False, and code Windows patterns to allow for all cases on Unix too (e.g., "[dD]esktop.ini"). SEE ALSO: See mergeall_configs.py for more details on cruft processing and patterns. See nuke-cruft-files.py for an on-demand, brute-force cruft cleaner script. ================================================================================== """ # # Get changeable patterns from user-configs file # try: from mergeall_configs import CRUFTCASENEUTRAL from mergeall_configs import CRUFT_SKIP, CRUFT_KEEP except Exception as why: # this must be fatal, else mergeall may copy/delete/remove cruft files print('**Error in config file:', why) print('Cruft file patterns cannot be used - exiting program.') raise # # Neutralize case: always use case-insensitive matching on all platforms. # from fnmatch import fnmatch # case-mapping, but Windows only! from fnmatch import fnmatchcase # non-case-mapping version if CRUFTCASENEUTRAL: def match(filename, cruftpatt): """ neutralize case differences here, on all platforms """ return fnmatchcase(filename.lower(), cruftpatt.lower()) else: """ neutralize case differences in fnmatch, on Windows only """ match = fnmatch # # Main tools # def filterCruftNames(filenames): """ filter out cruft, for file/folder name strings returned by os.listdir(); returns list of non-cruft items: either on keep or not on skip lists; """ return [filename for filename in filenames if any(match(filename, cruftpatt) for cruftpatt in CRUFT_KEEP) or not any(match(filename, cruftpatt) for cruftpatt in CRUFT_SKIP) ] def filterCruftDirentrys(filedirentrys): """ filter out cruft, for file/folder dirent objects returned by 3.5+ os.scandir(); returns list of non-cruft items: either on keep or not on skip lists; """ return [filedirentry for filedirentry in filedirentrys if any(match(filedirentry.name, cruftpatt) for cruftpatt in CRUFT_KEEP) or not any(match(filedirentry.name, cruftpatt) for cruftpatt in CRUFT_SKIP) ]