Identifying and packaging only the necessary Python files from a large, un-packaged library for distribution can be achieved through static code analysis using Python’s Abstract Syntax Tree (AST) module. This approach robustly extracts explicit module dependencies, resolving them to specific file paths within your defined library structure.

The Problem

Developers often manage a collection of utility scripts or tools that draw functionality from a shared codebase, which, for convenience, has not been formally structured into Python packages (i.e., without __init__.py files or setup.py definitions). The challenge arises when needing to distribute individual tools, requiring only the specific subsets of the shared library that each tool directly or transitively imports. Manually tracking these dependencies is error-prone, and solutions like Nuitka or trydep may not align with the distribution goals or the existing un-packaged structure.

The Solution

The following Python script performs static analysis using the ast module to traverse a target script’s code, identify its dependencies, and recursively resolve them within a specified library root directory. It outputs a set of absolute paths for all dependent Python files.

import ast
import os
import sys
from collections import deque

def resolve_module_path(base_dir, import_name):
    """
    Attempts to resolve an import name (e.g., 'my_lib.utils', 'helper')
    to an absolute file path within the specified base directory.
    This function mimics Python's import logic for simple file-based modules.
    """
    parts = import_name.split('.')
    potential_path = os.path.join(base_dir, *parts) + '.py'
    if os.path.exists(potential_path):
        return os.path.abspath(potential_path)

    # Check for package-like structure (e.g., my_package/module.py)
    # Even without __init__.py, Python will find my_package/module.py
    # if my_package is on PYTHONPATH or discovered via relative import.
    # We are simulating this discovery by looking for subdirectories.
    package_path = os.path.join(base_dir, *parts)
    if os.path.isdir(package_path):
        # This implies it might be a package itself or a directory where modules reside.
        # For simplicity, we prioritize file resolution. If 'my_lib' is imported,
        # and my_lib/my_lib.py exists, that's what we look for first.
        # If my_lib/__init__.py existed, it would be my_lib/__init__.py
        pass # We already checked for .py at the end of parts.

    return None

class DependencyCollector(ast.NodeVisitor):
    """
    Collects all 'import' and 'from ... import ...' statements.
    """
    def __init__(self, library_root):
        self.library_root = library_root
        self.dependencies = set()
        self.resolved_files = set()
        self.system_modules = set(sys.builtin_module_names) # Cache for faster lookup

    def visit_Import(self, node):
        for alias in node.names:
            self._process_import_name(alias.name)
        self.generic_visit(node)

    def visit_ImportFrom(self, node):
        # Handle relative imports if the node has a level attribute
        # For this specific problem (flat library, finding files),
        # we're primarily interested in absolute imports from the library_root.
        # Relative imports would require knowing the current file's path for full resolution.
        # Here we assume absolute imports relative to library_root.
        if node.module:
            # If it's `from . import foo`, node.module is None.
            # If it's `from .my_module import foo`, node.module is 'my_module'.
            # If it's `from ..my_package import foo`, node.module is 'my_package'.
            # Given the problem constraints ("library of around 71 Python files arranged as modules in child folders"),
            # we assume module names directly map to paths from library_root.
            self._process_import_name(node.module)
        self.generic_visit(node)

    def _process_import_name(self, import_name):
        # Ignore standard library and built-in modules
        if '.' not in import_name and import_name in self.system_modules:
            return

        # Heuristic to check if it's likely a standard/third-party library
        # This is not foolproof but reduces false positives for user's library
        # A more robust solution would involve checking sys.path or installing modulefinder
        try:
            # Try to import it; if it's a built-in or installed package, it'll work
            # This is a runtime check, which the user implied they want to avoid.
            # Sticking to static path resolution instead.
            pass
        except ImportError:
            pass # Not found as an installed module

        resolved_path = resolve_module_path(self.library_root, import_name)
        if resolved_path:
            self.dependencies.add(resolved_path)

def discover_dependencies(start_file_path, library_root_path):
    """
    Discovers all direct and transitive Python file dependencies for a given
    start_file_path within the specified library_root_path.
    """
    start_file_path = os.path.abspath(start_file_path)
    library_root_path = os.path.abspath(library_root_path)

    if not os.path.exists(start_file_path):
        raise FileNotFoundError(f"Start file not found: {start_file_path}")
    if not os.path.isdir(library_root_path):
        raise NotADirectoryError(f"Library root not found: {library_root_path}")

    queue = deque([start_file_path])
    processed_files = set()
    all_dependencies = set([start_file_path]) # Include the starting file itself

    while queue:
        current_file = queue.popleft()
        if current_file in processed_files:
            continue
        processed_files.add(current_file)

        try:
            with open(current_file, 'r', encoding='utf-8') as f:
                tree = ast.parse(f.read(), filename=current_file)
        except Exception as e:
            print(f"Error parsing {current_file}: {e}", file=sys.stderr)
            continue

        collector = DependencyCollector(library_root_path)
        collector.visit(tree)

        for dep_path in collector.dependencies:
            if dep_path not in all_dependencies:
                all_dependencies.add(dep_path)
                queue.append(dep_path)

    return sorted(list(all_dependencies))

# Example Usage:
if __name__ == "__main__":
    # Create a dummy library and tool for demonstration
    # You would replace these with your actual paths.

    # Setup dummy environment:
    # my_library/
    #   utils.py
    #     - imports my_library.helpers
    #   helpers.py
    #     - imports my_library.data_ops
    #   data_ops.py
    #   subfolder/
    #     another_util.py
    #
    # my_tool.py
    #   - imports my_library.utils
    #   - imports my_library.subfolder.another_util

    temp_dir = "temp_library_and_tool_for_demo"
    os.makedirs(os.path.join(temp_dir, "my_library", "subfolder"), exist_ok=True)

    with open(os.path.join(temp_dir, "my_library", "utils.py"), "w") as f:
        f.write("import os\nfrom my_library import helpers\ndef do_something(): return helpers.helper_func()")
    with open(os.path.join(temp_dir, "my_library", "helpers.py"), "w") as f:
        f.write("from my_library import data_ops\ndef helper_func(): return data_ops.get_data()")
    with open(os.path.join(temp_dir, "my_library", "data_ops.py"), "w") as f:
        f.write("def get_data(): return 'sample data'")
    with open(os.path.join(temp_dir, "my_library", "subfolder", "another_util.py"), "w") as f:
        f.write("def do_more(): return 'more stuff'")
    with open(os.path.join(temp_dir, "my_tool.py"), "w") as f:
        f.write("from my_library import utils\nfrom my_library.subfolder import another_util\ndef main(): print(utils.do_something(), another_util.do_more())\nif __name__ == '__main__': main()")

    library_root = os.path.join(temp_dir, "my_library")
    tool_path = os.path.join(temp_dir, "my_tool.py")

    print(f"Analyzing tool: {tool_path}")
    print(f"Using library root: {library_root}\n")

    try:
        required_files = discover_dependencies(tool_path, library_root)
        print("Required files for distribution:")
        for f in required_files:
            print(f"- {f}")

        # Example: Create a metadata file listing dependencies
        metadata_file = os.path.join(temp_dir, "my_tool_dependencies.txt")
        with open(metadata_file, "w") as meta_f:
            for f in required_files:
                meta_f.write(f + "\n")
        print(f"\nMetadata file created: {metadata_file}")

        # Clean up dummy environment
        import shutil
        shutil.rmtree(temp_dir)

    except (FileNotFoundError, NotADirectoryError) as e:
        print(f"Error: {e}")

Why It Works

  • Abstract Syntax Tree (AST): The ast module parses Python source code into a tree of ast.Node objects. This tree represents the code’s structure, allowing programmatic inspection of language constructs like import statements, rather than relying on less robust text-based pattern matching (e.g., regular expressions).
  • ast.NodeVisitor: By subclassing ast.NodeVisitor and overriding methods like visit_Import and visit_ImportFrom, the script efficiently traverses the AST. This precisely identifies where modules are being imported, distinguishing them from similar strings in comments or literal values.
  • Recursive Dependency Resolution: The discover_dependencies function utilizes a queue-based approach (breadth-first search) to identify not just the direct dependencies of the starting tool file but also the transitive dependencies of those imported modules. Each newly discovered user-defined module is added to the queue for further analysis until all reachable local dependencies are mapped.
  • Path Resolution (resolve_module_path): This critical helper function translates the module names found in import statements (e.g., my_library.utils) into actual file system paths (e.g., my_library/utils.py). It approximates Python’s import mechanism by joining the module name parts with the library_root and appending .py. This enables the script to locate the physical files corresponding to the imports, even without formal package declarations (__init__.py).
  • Filtering System Modules: The DependencyCollector avoids processing standard library or built-in modules (e.g., os, sys) by checking sys.builtin_module_names. This ensures that only files from your specified library_root are collected for distribution.

Reference