File: mergeall-products/unzipped/skipcruft.py

"""
==================================================================================
skipcruft.py:
  skip system cruft files in both FROM and TO (part of the mergeall system [3.0])

--See mergeall_configs.py for the cruft patterns used here--

Given a folder listing (in one of two forms), filter out filenames matching
cruft (a.k.a. metadata) patterns defined in mergeall_configs.py.  This code
is factored here because it is used by programs mergeall, diffall, and cpall
(including the latter's copytree() called from mergeall for bulk copies of
unique folders in FROM).  Any tuning here will be picked up by all three.

Provides variants for the two common directory listers.  mergeall uses either
os.listdir() or os.scandir(), depending on Python version.  diffall uses only
os.listdir(), as its time is dominated by reading files in full.  cpall also
uses os.listdir() because its file writes far outweigh its code's speed.

CODING NOTES:

Uses comprehensions for speed; any() is a nested loop, but exits (i.e., short
circuits) on the first True result from an iterator:

>>> def f1(): print('F1'); return False
>>> def f2(): print('F2'); return True
>>> def f3(): print('F3'); return True

>>> x = any(f() for f in [f1, f2, f3])
F1
F2
>>> x
True

Speed matters here, because every filename in both trees must be run through
the filters (there can be over 50k files per tree in an author's use case).
List comprehensions and built-ins like any() generally beat manual code,
but performance tuning for alternative codings is an open exercise.  If
this proves slow, fnmatch can be replaced by '==' for literal filenames.
Most likely, file-system call time will overshadow gains from tuning here.

UPDATE: in mergeall, an 87G 58k-file archive compares in 8 seconds without
cruft skipping and 11 with - a trivial and acceptable 3-second penalty.
There is no noticeable speed difference in the IO-bound diffall or cpall.

Code of the following sort could be used to print filtered cruft item names,
but was deemed to be useful for testing only, given the many client (TBD?):
    if filtered != filedirentrys:
        print('???', sorted(
            set(dirent.name for dirent in filedirentrys) -
            set(dirent.name for dirent in filtered)))

CASE SENSITIVITY:

By itself, fnmatch.fnmatch() is case-insensitive on Windows, but not on
Mac or Linux: "Desktop.ini" matches "desktop.ini" on Windows only.  When
running on Unix, this doesn't make sense for detecting Windows cruft;
it may for Unix's own cruft files, but it seems very unlikely that a cruft
filename like ".Trash" would be used for non-cruft roles with a different case.

Here, case-insensitive matching is forced by default, by converting both
filename and pattern to lower case, and using fnmatch.fnmatchcase() which
does no case mapping itself.  Using different matchers for Unix and Windows
patterns may work too, but seems complex overkill here.  If this proves to
be an issue, set CRUFTCASENEUTRAL to False, and code Windows patterns to
allow for all cases on Unix too (e.g., "[dD]esktop.ini").

SEE ALSO:

See mergeall_configs.py for more details on cruft processing and patterns.
See nuke-cruft-files.py for an on-demand, brute-force cruft cleaner script.
==================================================================================
"""

#
# Get changeable patterns from user-configs file
#

try:
    from mergeall_configs import CRUFTCASENEUTRAL
    from mergeall_configs import CRUFT_SKIP, CRUFT_KEEP
except Exception as why:
    # this must be fatal, else mergeall may copy/delete/remove cruft files
    print('**Error in config file:', why)
    print('Cruft file patterns cannot be used - exiting program.')
    raise


#
# Neutralize case: always use case-insensitive matching on all platforms.
#

from fnmatch import fnmatch       # case-mapping, but Windows only!
from fnmatch import fnmatchcase   # non-case-mapping version

if CRUFTCASENEUTRAL:
    def match(filename, cruftpatt):
        """
        neutralize case differences here, on all platforms
        """
        return fnmatchcase(filename.lower(), cruftpatt.lower())
else:
    """
    neutralize case differences in fnmatch, on Windows only
    """
    match = fnmatch


#
# Main tools
#


def filterCruftNames(filenames):
    """
    filter out cruft, for file/folder name strings returned by os.listdir();
    returns list of non-cruft items: either on keep or not on skip lists;
    """
    return [filename for filename in filenames 
                if any(match(filename, cruftpatt) for cruftpatt in CRUFT_KEEP)
                or
                not any(match(filename, cruftpatt) for cruftpatt in CRUFT_SKIP)
           ]


def filterCruftDirentrys(filedirentrys):
    """
    filter out cruft, for file/folder dirent objects returned by 3.5+ os.scandir();
    returns list of non-cruft items: either on keep or not on skip lists;
    """
    return [filedirentry for filedirentry in filedirentrys 
                if any(match(filedirentry.name, cruftpatt) for cruftpatt in CRUFT_KEEP)
                or
                not any(match(filedirentry.name, cruftpatt) for cruftpatt in CRUFT_SKIP)
           ]



[Home page] Books Code Blog Python Author Train Find ©M.Lutz