From 3a21e25d27ecd459659c43bf4f0d7166d3239475 Mon Sep 17 00:00:00 2001 From: Miguel Jacq Date: Sat, 27 Dec 2025 20:45:00 -0600 Subject: [PATCH] Add Technical_Decomp_Harvest --- Technical_Decomp_Harvest.md | 191 ++++++++++++++++++++++++++++++++++++ 1 file changed, 191 insertions(+) create mode 100644 Technical_Decomp_Harvest.md diff --git a/Technical_Decomp_Harvest.md b/Technical_Decomp_Harvest.md new file mode 100644 index 0000000..c02db6a --- /dev/null +++ b/Technical_Decomp_Harvest.md @@ -0,0 +1,191 @@ +## enroll/harvest.py + +All of these are dataclasses that act as the schema for state.json. harvest.harvest() creates them, then serializes them with asdict(). + +### ManagedFile (dataclass) + +#### Purpose: describes one file that harvest successfully copied into the bundle. + +#### Fields: + +- path: absolute original path on host +- src_rel: relative path used inside artifacts//... (almost always path.lstrip("/")) +- owner, group, mode: captured from stat_triplet() +- reason: classification string explaining why it was captured (examples): +- systemd_dropin, systemd_envfile +- modified_conffile, modified_packaged_file +- custom_unowned, custom_specific_path +- authorized_keys, ssh_public_key +- usr_local_bin_script, usr_local_etc_custom +- user_include (from --include-path) + +#### Where it’s used: + +Written into snapshots in state.json. + +manifest.py reads these to generate Ansible tasks (copy/template actions). + +diff.py reads these to detect changes and to locate the artifact content. +___________________ + +### ExcludedFile (dataclass) + +#### Purpose: records a file that was considered but not included, plus why. + +#### Fields: + +- path +- reason: a concise reason code, typically: +- user_excluded (PathFilter) +- ignore policy reasons like denied_path, binary_like, sensitive_content, too_large, unreadable, etc. + +#### Where it’s used: + +Stored in each snapshot’s excluded list in state.json. + +Mostly informational (helps explain why something didn’t get harvested). + +_____________________ + +### ServiceSnapshot (dataclass) + +#### Purpose: captures everything enroll learned about one enabled systemd service unit. + +#### Fields: + +- unit: e.g. nginx.service +- role_name: derived role name (sanitized service-ish identifier) +- packages: Debian package names inferred as belonging to the service +- active_state, sub_state, unit_file_state, condition_result: +- copied from systemctl show fields via systemd.get_unit_info() +- managed_files: list of ManagedFile harvested for this role +- excluded: list of ExcludedFile not harvested +- notes: warnings or anomalies (e.g. failure to query unit info) + +#### How it’s “computed” in harvest: + +- Enumerate enabled services: systemd.list_enabled_services(). +- For each unit: + - gather unit metadata (fragment file, dropins, env files, exec paths) + - infer owning packages via dpkg_owner() on: + - the unit fragment + - ExecStart paths + - consider candidate /etc files from: + - systemd dropins/envfiles (only under /etc) + - modified dpkg conffiles or packaged files under /etc (by md5 compare) + - service-specific “unowned” files under /etc/ trees + - filter each candidate through: + - user exclude patterns (PathFilter.is_excluded) + - IgnorePolicy.deny_reason + - readability + regular-file checks + - copy accepted files into artifacts// + +#### Why this class matters: + +It is the core unit of “role inference” for running services. + +______________________ + +### PackageSnapshot (dataclass) + +#### Purpose: captures “manual packages” (from apt-mark showmanual) that weren’t already covered by any service snapshot. + +#### Fields: + +- package: package name +- role_name: computed role name (e.g. pkg_postfix) +- managed_files, excluded, notes + +#### How it’s computed: + +- list_manual_packages() returns “manually installed”. +- Anything already mentioned in any ServiceSnapshot.packages is skipped (recorded in manual_packages_skipped in state.json). +- For remaining packages: + - detect modified conffiles / modified packaged files under /etc via hashes + - capture associated timer overrides if the timer is attributable to that package + - scan for custom/unowned files under /etc/ trees for the package + +______________ + +### UsersSnapshot (dataclass) + +#### Purpose: captures non-system users and safe SSH public artifacts. + +#### Fields: + +- role_name: always "users" in current code +- users: list of dicts derived from UserRecord +- managed_files: copied ssh public material (as ManagedFile) +- excluded: skipped ssh files (as ExcludedFile) +- notes: errors (e.g. couldn’t enumerate users) + +__________________ + +### AptConfigSnapshot (dataclass) + +#### Purpose: captures APT configuration and key material. + +#### Fields: + +- role_name: "apt_config" +- managed_files, excluded, notes + +#### How it’s populated: + +- Uses _iter_apt_capture_paths() (in harvest.py) to produce specific key APT paths (sources lists, keyrings, etc.). +- Each candidate is filtered via PathFilter + IgnorePolicy, then copied. + +__________________ + +### EtcCustomSnapshot (dataclass) + +#### Purpose: “catch-all” role for remaining config-ish files under /etc that weren’t already attributed to a service/package/users/apt. + +#### Fields: + +- role_name: "etc_custom" +- managed_files, excluded, notes + +#### How it’s populated: + +- Build a set of “already captured” files from other roles. +- Add certain “system essentials” even if package-owned (_iter_system_capture_paths()). +- Walk /etc and include unowned files that look “config-ish” (_is_confish()), subject to caps. +- Extra logic: if a file is in a shared snippet dir like /etc/cron.d/ or /etc/logrotate.d/, it attempts to re-attach it to an existing role by filename matching (so it doesn’t pollute etc_custom). + +______________ + +### UsrLocalCustomSnapshot (dataclass) + +#### Purpose: captures custom local admin content from /usr/local. + +#### Fields: + +- role_name: "usr_local_custom" +- managed_files, excluded, notes + +#### How it’s populated: + +- Scans /usr/local/etc (collect regular files, subject to IgnorePolicy) +- Scans /usr/local/bin but only collects executable files (checks mode has any execute bit) +- Caps per scan to avoid explosion. + +_______________ + +### ExtraPathsSnapshot (dataclass) + +#### Purpose: captures user-requested extra files from --include-path (and records include/exclude patterns used). + +#### Fields: + +- role_name: "extra_paths" +- include_patterns, exclude_patterns: as provided on CLI +- managed_files, excluded, notes + +#### How it’s populated: + +- Uses PathFilter.iter_include_patterns() + expand_includes() to turn patterns into concrete file paths. +- For each included file not already captured elsewhere: + - filter via exclude + IgnorePolicy + - copy into artifacts/extra_paths/... + - record ManagedFile(reason="user_include") \ No newline at end of file