3 Technical_Decomp_Harvest
Miguel Jacq edited this page 2025-12-27 20:59:16 -06:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

enroll/harvest.py

harvest.harvest() is the producer:

  • uses UnitInfo / TimerInfo (systemd introspection)
  • uses IgnorePolicy + PathFilter/CompiledPathPattern to decide what files are safe to copy
  • emits ServiceSnapshot, PackageSnapshot, UsersSnapshot, etc.
  • emits ManagedFile and ExcludedFile entries everywhere
  • writes everything into state.json, and file copies into artifacts/<role>/...

ManagedFile (dataclass)

Purpose: describes one file that harvest successfully copied into the bundle.

Fields:

  • path: absolute original path on host
  • src_rel: relative path used inside artifacts/<role>/... (almost always path.lstrip("/"))
  • owner, group, mode: captured from stat_triplet()
  • reason: classification string explaining why it was captured (examples):
  • systemd_dropin, systemd_envfile
  • modified_conffile, modified_packaged_file
  • custom_unowned, custom_specific_path
  • authorized_keys, ssh_public_key
  • usr_local_bin_script, usr_local_etc_custom
  • user_include (from --include-path)

Where its used:

Written into snapshots in state.json.

manifest.py reads these to generate Ansible tasks (copy/template actions).

diff.py reads these to detect changes and to locate the artifact content.


ExcludedFile (dataclass)

Purpose: records a file that was considered but not included, plus why.

Fields:

  • path
  • reason: a concise reason code, typically:
  • user_excluded (PathFilter)
  • ignore policy reasons like denied_path, binary_like, sensitive_content, too_large, unreadable, etc.

Where its used:

Stored in each snapshots excluded list in state.json.

Mostly informational (helps explain why something didnt get harvested).


ServiceSnapshot (dataclass)

Purpose: captures everything enroll learned about one enabled systemd service unit.

Fields:

  • unit: e.g. nginx.service
  • role_name: derived role name (sanitized service-ish identifier)
  • packages: Debian package names inferred as belonging to the service
  • active_state, sub_state, unit_file_state, condition_result:
  • copied from systemctl show fields via systemd.get_unit_info()
  • managed_files: list of ManagedFile harvested for this role
  • excluded: list of ExcludedFile not harvested
  • notes: warnings or anomalies (e.g. failure to query unit info)

How its “computed” in harvest:

  • Enumerate enabled services: systemd.list_enabled_services().
  • For each unit:
    • gather unit metadata (fragment file, dropins, env files, exec paths)
    • infer owning packages via dpkg_owner() on:
      • the unit fragment
      • ExecStart paths
    • consider candidate /etc files from:
      • systemd dropins/envfiles (only under /etc)
      • modified dpkg conffiles or packaged files under /etc (by md5 compare)
      • service-specific “unowned” files under /etc/<hint> trees
    • filter each candidate through:
      • user exclude patterns (PathFilter.is_excluded)
      • IgnorePolicy.deny_reason
      • readability + regular-file checks
    • copy accepted files into artifacts/<role>/<src_rel>

Why this class matters:

It is the core unit of “role inference” for running services.


PackageSnapshot (dataclass)

Purpose: captures “manual packages” (from apt-mark showmanual) that werent already covered by any service snapshot.

Fields:

  • package: package name
  • role_name: computed role name (e.g. pkg_postfix)
  • managed_files, excluded, notes

How its computed:

  • list_manual_packages() returns “manually installed”.
  • Anything already mentioned in any ServiceSnapshot.packages is skipped (recorded in manual_packages_skipped in state.json).
  • For remaining packages:
    • detect modified conffiles / modified packaged files under /etc via hashes
    • capture associated timer overrides if the timer is attributable to that package
    • scan for custom/unowned files under /etc/<topdir> trees for the package

UsersSnapshot (dataclass)

Purpose: captures non-system users and safe SSH public artifacts.

Fields:

  • role_name: always "users" in current code
  • users: list of dicts derived from UserRecord
  • managed_files: copied ssh public material (as ManagedFile)
  • excluded: skipped ssh files (as ExcludedFile)
  • notes: errors (e.g. couldnt enumerate users)

AptConfigSnapshot (dataclass)

Purpose: captures APT configuration and key material.

Fields:

  • role_name: "apt_config"
  • managed_files, excluded, notes

How its populated:

  • Uses _iter_apt_capture_paths() (in harvest.py) to produce specific key APT paths (sources lists, keyrings, etc.).
  • Each candidate is filtered via PathFilter + IgnorePolicy, then copied.

EtcCustomSnapshot (dataclass)

Purpose: “catch-all” role for remaining config-ish files under /etc that werent already attributed to a service/package/users/apt.

Fields:

  • role_name: "etc_custom"
  • managed_files, excluded, notes

How its populated:

  • Build a set of “already captured” files from other roles.
  • Add certain “system essentials” even if package-owned (_iter_system_capture_paths()).
  • Walk /etc and include unowned files that look “config-ish” (_is_confish()), subject to caps.
  • Extra logic: if a file is in a shared snippet dir like /etc/cron.d/ or /etc/logrotate.d/, it attempts to re-attach it to an existing role by filename matching (so it doesnt pollute etc_custom).

UsrLocalCustomSnapshot (dataclass)

Purpose: captures custom local admin content from /usr/local.

Fields:

  • role_name: "usr_local_custom"
  • managed_files, excluded, notes

How its populated:

  • Scans /usr/local/etc (collect regular files, subject to IgnorePolicy)
  • Scans /usr/local/bin but only collects executable files (checks mode has any execute bit)
  • Caps per scan to avoid explosion.

ExtraPathsSnapshot (dataclass)

Purpose: captures user-requested extra files from --include-path (and records include/exclude patterns used).

Fields:

  • role_name: "extra_paths"
  • include_patterns, exclude_patterns: as provided on CLI
  • managed_files, excluded, notes

How its populated:

  • Uses PathFilter.iter_include_patterns() + expand_includes() to turn patterns into concrete file paths.
  • For each included file not already captured elsewhere:
    • filter via exclude + IgnorePolicy
    • copy into artifacts/extra_paths/...
    • record ManagedFile(reason="user_include")