Add Technical_Decomp_Harvest

Miguel Jacq 2025-12-27 20:45:00 -06:00
parent 81e29bf75a
commit 3a21e25d27

191
Technical_Decomp_Harvest.md Normal file

@ -0,0 +1,191 @@
## enroll/harvest.py
All of these are dataclasses that act as the schema for state.json. harvest.harvest() creates them, then serializes them with asdict().
### ManagedFile (dataclass)
#### Purpose: describes one file that harvest successfully copied into the bundle.
#### Fields:
- path: absolute original path on host
- src_rel: relative path used inside artifacts/<role>/... (almost always path.lstrip("/"))
- owner, group, mode: captured from stat_triplet()
- reason: classification string explaining why it was captured (examples):
- systemd_dropin, systemd_envfile
- modified_conffile, modified_packaged_file
- custom_unowned, custom_specific_path
- authorized_keys, ssh_public_key
- usr_local_bin_script, usr_local_etc_custom
- user_include (from --include-path)
#### Where its used:
Written into snapshots in state.json.
manifest.py reads these to generate Ansible tasks (copy/template actions).
diff.py reads these to detect changes and to locate the artifact content.
___________________
### ExcludedFile (dataclass)
#### Purpose: records a file that was considered but not included, plus why.
#### Fields:
- path
- reason: a concise reason code, typically:
- user_excluded (PathFilter)
- ignore policy reasons like denied_path, binary_like, sensitive_content, too_large, unreadable, etc.
#### Where its used:
Stored in each snapshots excluded list in state.json.
Mostly informational (helps explain why something didnt get harvested).
_____________________
### ServiceSnapshot (dataclass)
#### Purpose: captures everything enroll learned about one enabled systemd service unit.
#### Fields:
- unit: e.g. nginx.service
- role_name: derived role name (sanitized service-ish identifier)
- packages: Debian package names inferred as belonging to the service
- active_state, sub_state, unit_file_state, condition_result:
- copied from systemctl show fields via systemd.get_unit_info()
- managed_files: list of ManagedFile harvested for this role
- excluded: list of ExcludedFile not harvested
- notes: warnings or anomalies (e.g. failure to query unit info)
#### How its “computed” in harvest:
- Enumerate enabled services: systemd.list_enabled_services().
- For each unit:
- gather unit metadata (fragment file, dropins, env files, exec paths)
- infer owning packages via dpkg_owner() on:
- the unit fragment
- ExecStart paths
- consider candidate /etc files from:
- systemd dropins/envfiles (only under /etc)
- modified dpkg conffiles or packaged files under /etc (by md5 compare)
- service-specific “unowned” files under /etc/<hint> trees
- filter each candidate through:
- user exclude patterns (PathFilter.is_excluded)
- IgnorePolicy.deny_reason
- readability + regular-file checks
- copy accepted files into artifacts/<role>/<src_rel>
#### Why this class matters:
It is the core unit of “role inference” for running services.
______________________
### PackageSnapshot (dataclass)
#### Purpose: captures “manual packages” (from apt-mark showmanual) that werent already covered by any service snapshot.
#### Fields:
- package: package name
- role_name: computed role name (e.g. pkg_postfix)
- managed_files, excluded, notes
#### How its computed:
- list_manual_packages() returns “manually installed”.
- Anything already mentioned in any ServiceSnapshot.packages is skipped (recorded in manual_packages_skipped in state.json).
- For remaining packages:
- detect modified conffiles / modified packaged files under /etc via hashes
- capture associated timer overrides if the timer is attributable to that package
- scan for custom/unowned files under /etc/<topdir> trees for the package
______________
### UsersSnapshot (dataclass)
#### Purpose: captures non-system users and safe SSH public artifacts.
#### Fields:
- role_name: always "users" in current code
- users: list of dicts derived from UserRecord
- managed_files: copied ssh public material (as ManagedFile)
- excluded: skipped ssh files (as ExcludedFile)
- notes: errors (e.g. couldnt enumerate users)
__________________
### AptConfigSnapshot (dataclass)
#### Purpose: captures APT configuration and key material.
#### Fields:
- role_name: "apt_config"
- managed_files, excluded, notes
#### How its populated:
- Uses _iter_apt_capture_paths() (in harvest.py) to produce specific key APT paths (sources lists, keyrings, etc.).
- Each candidate is filtered via PathFilter + IgnorePolicy, then copied.
__________________
### EtcCustomSnapshot (dataclass)
#### Purpose: “catch-all” role for remaining config-ish files under /etc that werent already attributed to a service/package/users/apt.
#### Fields:
- role_name: "etc_custom"
- managed_files, excluded, notes
#### How its populated:
- Build a set of “already captured” files from other roles.
- Add certain “system essentials” even if package-owned (_iter_system_capture_paths()).
- Walk /etc and include unowned files that look “config-ish” (_is_confish()), subject to caps.
- Extra logic: if a file is in a shared snippet dir like /etc/cron.d/ or /etc/logrotate.d/, it attempts to re-attach it to an existing role by filename matching (so it doesnt pollute etc_custom).
______________
### UsrLocalCustomSnapshot (dataclass)
#### Purpose: captures custom local admin content from /usr/local.
#### Fields:
- role_name: "usr_local_custom"
- managed_files, excluded, notes
#### How its populated:
- Scans /usr/local/etc (collect regular files, subject to IgnorePolicy)
- Scans /usr/local/bin but only collects executable files (checks mode has any execute bit)
- Caps per scan to avoid explosion.
_______________
### ExtraPathsSnapshot (dataclass)
#### Purpose: captures user-requested extra files from --include-path (and records include/exclude patterns used).
#### Fields:
- role_name: "extra_paths"
- include_patterns, exclude_patterns: as provided on CLI
- managed_files, excluded, notes
#### How its populated:
- Uses PathFilter.iter_include_patterns() + expand_includes() to turn patterns into concrete file paths.
- For each included file not already captured elsewhere:
- filter via exclude + IgnorePolicy
- copy into artifacts/extra_paths/...
- record ManagedFile(reason="user_include")