Compare commits

..

10 commits
0.1.0 ... main

Author SHA1 Message Date
a4a15fba57
Add URL to pyproject
All checks were successful
CI / test (push) Successful in 2m28s
Lint / test (push) Successful in 28s
Trivy / test (push) Successful in 22s
2026-01-02 14:21:40 +11:00
4315503d19
ssh bandit 2026-01-02 14:18:24 +11:00
55a815564f
* Add --bypass-csp option to ignore an existing enforcing CSP to avoid it skewing results
Some checks failed
CI / test (push) Successful in 2m29s
Lint / test (push) Failing after 29s
Trivy / test (push) Successful in 23s
* Add `--evaluate` option to test a proposed CSP without needing to install it (best to use in conjunction with --bypass-csp`)
2026-01-02 14:09:56 +11:00
16cd1e4b40
Update README
All checks were successful
CI / test (push) Successful in 2m26s
Lint / test (push) Successful in 30s
Trivy / test (push) Successful in 23s
2026-01-02 11:04:39 +11:00
052187d308
0.1.1
All checks were successful
Lint / test (push) Successful in 29s
Trivy / test (push) Successful in 23s
CI / test (push) Successful in 2m30s
2026-01-02 10:56:05 +11:00
25d50c375b
nosec
Some checks failed
CI / test (push) Has been cancelled
Lint / test (push) Successful in 27s
Trivy / test (push) Has been cancelled
2026-01-02 10:55:08 +11:00
9c9ab92a8d
update README 2026-01-02 10:53:18 +11:00
2f2eccf053
Fix detection of Python for AppImage if it needs to install browsers via playwright
Some checks failed
CI / test (push) Successful in 2m29s
Lint / test (push) Failing after 29s
Trivy / test (push) Successful in 23s
2026-01-02 10:50:53 +11:00
bfa16a145a
Add --ignore-non-html option to skip pages that weren't HTML (which might trigger Chromium's 'sha256-4Su6mBWzEIFnH4pAGMOuaeBrstwJN4Z3pq/s1Kn4/KQ=' hash)
All checks were successful
CI / test (push) Successful in 2m48s
Lint / test (push) Successful in 31s
Trivy / test (push) Successful in 23s
2026-01-02 10:41:57 +11:00
09aa2ded5e
Fix prog name 2026-01-02 10:28:46 +11:00
6 changed files with 403 additions and 37 deletions

14
CHANGELOG.md Normal file
View file

@ -0,0 +1,14 @@
## 0.1.2
* Add `--bypass-csp` option to ignore an existing enforcing CSP to avoid it skewing results
* Add `--evaluate` option to test a proposed CSP without needing to install it (best to use in conjunction with --bypass-csp`)
## 0.1.1
* Fix prog name
* Add --ignore-non-html option to skip pages that weren't HTML (which might trigger Chromium's 'sha256-4Su6mBWzEIFnH4pAGMOuaeBrstwJN4Z3pq/s1Kn4/KQ=' hash)
* Fix detection of Python for AppImage if it needs to install browsers via playwright
## 0.1.0
* Initial release

View file

@ -18,11 +18,14 @@ This is meant as a **starting point**. Review and tighten the resulting policy b
## Requirements ## Requirements
- Python 3.10+ - Python 3.10+
- Poetry
- Playwright's Chromium browser binaries (auto-installed by this tool if missing) - Playwright's Chromium browser binaries (auto-installed by this tool if missing)
## Install ## Install
If using my artifacts from the Releases page, you may wish to verify the GPG signatures with the key.
It can be found at https://mig5.net/static/mig5.asc . The fingerprint is `00AE817C24A10C2540461A9C1D7CDE0234DB458D`.
### Poetry ### Poetry
```bash ```bash
@ -42,7 +45,7 @@ Download the CSPresso.AppImage from the releases page, make it executable with `
## Run ## Run
```bash ```bash
poetry run cspresso https://example.com --max-pages 10 cspresso https://example.com --max-pages 10
``` ```
The tool will: The tool will:
@ -51,6 +54,15 @@ The tool will:
3) crawl same-origin links up to the page limit 3) crawl same-origin links up to the page limit
4) print the visited URLs and a CSP header 4) print the visited URLs and a CSP header
### Avoiding an existing enforcing CSP header during analysis
**NOTE**: If you have an existing CSP header in place on your site, this could negatively influence
`cspresso`'s ability to evaluate what's on the page. Consider adding `--bypass-csp` to ignore the
current CSP (noting that if your site is compromised, doing so could put your machine at risk if
it evaluates malicious javascript/css etc).
See also the `--evaluate` option below.
## Where Playwright installs browsers ## Where Playwright installs browsers
By default, this project installs Playwright browsers into a local folder: `./.pw-browsers`. By default, this project installs Playwright browsers into a local folder: `./.pw-browsers`.
@ -63,7 +75,7 @@ You can override with `--browsers-path` or by setting `PLAYWRIGHT_BROWSERS_PATH`
If Chromium fails to start due to missing system libraries, try: If Chromium fails to start due to missing system libraries, try:
```bash ```bash
poetry run cspresso https://example.com --with-deps cspresso https://example.com --with-deps
``` ```
That runs `python -m playwright install --with-deps chromium` (may require sudo depending on your environment). That runs `python -m playwright install --with-deps chromium` (may require sudo depending on your environment).
@ -75,14 +87,65 @@ Default output is a single CSP header line.
For JSON: For JSON:
```bash ```bash
poetry run cspresso https://example.com --json cspresso https://example.com --json
```
## Evaluate a proposed CSP without installing it
You can use `cspresso` to evaluate a *proposed* CSP against a site. When you do this, cspresso converts
the response from the website to implant `Content-Security-Policy-Report-Only` headers using the CSP
you supplied to `--evaluate`. If it detects any violations, it will report them and exit with code 1,
which may be useful for CSP.
**NOTE**: It is highly recommended to use `--bypass-csp` in addition to `--evaluate`, so that your
results are not influenced by any existing CSP's enforcement.
**Example:**
```bash
poetry run cspresso https://mig5.net --evaluate "default-src 'none'" --bypass-csp --json
{
"csp": "base-uri 'self'; default-src 'self'; form-action 'self'; frame-ancestors 'self'; object-src 'none'; style-src 'self' 'sha256-4Su6mBWzEIFnH4pAGMOuaeBrstwJN4Z3pq/s1Kn4/KQ=' 'unsafe-hashes'; style-src-attr 'sha256-4Su6mBWzEIFnH4pAGMOuaeBrstwJN4Z3pq/s1Kn4/KQ=' 'unsafe-hashes';",
"directives": {},
"evaluated_policy": "default-src 'none'",
"nonce_detected": false,
"notes": [
"Detected inline attribute code (style=\"...\" and/or on*=\"...\"). Hashes for these require 'unsafe-hashes' (and modern browsers may use style-src-attr/script-src-attr)."
],
"violations": [
{
"console": true,
"disposition": "report",
"documentURI": "https://mig5.net/",
"text": "Loading the stylesheet 'https://mig5.net/style.css' violates the following Content Security Policy directive: \"default-src 'none'\". Note that 'style-src-elem' was not explicitly set, so 'default-src' is used as a fallback. The policy is report-only, so the violation has been logged but no further action has been taken.",
"type": "info"
},
{
"console": true,
"disposition": "report",
"documentURI": "https://mig5.net/static/mig5.asc",
"text": "Applying inline style violates the following Content Security Policy directive 'default-src 'none''. Either the 'unsafe-inline' keyword, a hash ('sha256-4Su6mBWzEIFnH4pAGMOuaeBrstwJN4Z3pq/s1Kn4/KQ='), or a nonce ('nonce-...') is required to enable inline execution. Note that hashes do not apply to event handlers, style attributes and javascript: navigations unless the 'unsafe-hashes' keyword is present. Note also that 'style-src' was not explicitly set, so 'default-src' is used as a fallback. The policy is report-only, so the violation has been logged but no further action has been taken.",
"type": "info"
}
],
"visited": [
"https://mig5.net",
"https://mig5.net/",
"https://mig5.net/static/mig5.asc"
]
}
cspresso on  main [!] via 🐍 v3.13.5 took 18s
echo $?
1
``` ```
## Full usage info ## Full usage info
``` ```
usage: csp-crawl [-h] [--max-pages MAX_PAGES] [--timeout-ms TIMEOUT_MS] [--settle-ms SETTLE_MS] [--headed] [--no-install] [--with-deps] [--browsers-path BROWSERS_PATH] [--allow-blob] [--unsafe-eval] usage: cspresso [-h] [--max-pages MAX_PAGES] [--timeout-ms TIMEOUT_MS] [--settle-ms SETTLE_MS] [--headed] [--no-install] [--with-deps] [--browsers-path BROWSERS_PATH] [--allow-blob] [--unsafe-eval]
[--upgrade-insecure-requests] [--include-sourcemaps] [--json] [--upgrade-insecure-requests] [--include-sourcemaps] [--bypass-csp] [--evaluate CSP] [--ignore-non-html] [--json]
url url
Crawl up to N pages (same-origin) with Playwright and generate a draft CSP. Crawl up to N pages (same-origin) with Playwright and generate a draft CSP.
@ -108,5 +171,8 @@ options:
--upgrade-insecure-requests --upgrade-insecure-requests
Add upgrade-insecure-requests directive Add upgrade-insecure-requests directive
--include-sourcemaps Analyze JS/CSS for sourceMappingURL and add map origins to connect-src --include-sourcemaps Analyze JS/CSS for sourceMappingURL and add map origins to connect-src
--bypass-csp Strip any existing CSP/CSP-Report-Only response headers from HTML documents (useful for discovery or evaluation).
--evaluate CSP Inject the provided CSP string as Content-Security-Policy-Report-Only on HTML documents and exit 1 if any Report-Only violations are detected. Quote the value.
--ignore-non-html Ignore non-HTML pages that get crawled (which might trigger Chromium's word-wrap hash: https://stackoverflow.com/a/69838710)
--json Output JSON instead of a header line --json Output JSON instead of a header line
``` ```

View file

@ -1,11 +1,12 @@
[tool.poetry] [tool.poetry]
name = "cspresso" name = "cspresso"
version = "0.1.0" version = "0.1.2"
description = "Crawl a website with a headless browser and generate a draft Content-Security-Policy (CSP)." description = "Crawl a website with a headless browser and generate a draft Content-Security-Policy (CSP)."
authors = ["Miguel Jacq <mig@mig5.net>"] authors = ["Miguel Jacq <mig@mig5.net>"]
readme = "README.md" readme = "README.md"
packages = [{ include = "cspresso", from = "src" }] packages = [{ include = "cspresso", from = "src" }]
license = "GPL-3.0-or-later" license = "GPL-3.0-or-later"
homepage = "https://cspresso.cafe"
repository = "https://git.mig5.net/mig5/cspresso" repository = "https://git.mig5.net/mig5/cspresso"
[tool.poetry.dependencies] [tool.poetry.dependencies]

View file

@ -1,4 +1,5 @@
import sys
from .crawl import main from .crawl import main
if __name__ == "__main__": if __name__ == "__main__":
main() sys.exit(main())

View file

@ -48,6 +48,13 @@ def sha256_base64(s: str) -> str:
return base64.b64encode(h).decode("ascii") return base64.b64encode(h).decode("ascii")
def normalize_csp_string(csp: str) -> str:
s = (csp or "").strip()
if not s:
return s
return s if s.endswith(";") else s + ";"
async def collect_inline(page, *, max_attr_hashes: int = 2000): async def collect_inline(page, *, max_attr_hashes: int = 2000):
""" """
Collect inline <script> (no src), <style> blocks, plus: Collect inline <script> (no src), <style> blocks, plus:
@ -291,6 +298,7 @@ class CrawlResult:
nonce_detected: bool nonce_detected: bool
directives: dict[str, list[str]] directives: dict[str, list[str]]
notes: list[str] notes: list[str]
violations: list[dict]
async def crawl_and_generate_csp( async def crawl_and_generate_csp(
@ -307,6 +315,9 @@ async def crawl_and_generate_csp(
allow_unsafe_eval: bool = False, allow_unsafe_eval: bool = False,
upgrade_insecure_requests: bool = False, upgrade_insecure_requests: bool = False,
include_sourcemaps: bool = False, include_sourcemaps: bool = False,
ignore_non_html: bool = False,
bypass_csp: bool = False,
evaluate: str | None = None, # CSP string to inject as Report-Only and evaluate
) -> CrawlResult: ) -> CrawlResult:
start_url, _ = urldefrag(start_url) start_url, _ = urldefrag(start_url)
base_origin = origin_of(start_url) base_origin = origin_of(start_url)
@ -334,10 +345,48 @@ async def crawl_and_generate_csp(
allow_data_font = False allow_data_font = False
notes: list[str] = [] notes: list[str] = []
evaluate_policy = normalize_csp_string(evaluate) if evaluate else None
# Captured CSP violations (Report-Only) when --evaluate is used.
violations: list[dict] = []
async with async_playwright() as p: async with async_playwright() as p:
browser = await p.chromium.launch(headless=headless) browser = await p.chromium.launch(headless=headless)
context = await browser.new_context() context = await browser.new_context()
# Optionally strip any existing CSP headers, and/or inject a Report-Only CSP for evaluation.
# NOTE: This operates on *document response headers* only.
if bypass_csp or evaluate_policy:
async def _route_handler(route, request):
try:
if request.resource_type != "document":
return await route.continue_()
resp = await route.fetch()
hdrs = {k.lower(): v for k, v in (resp.headers or {}).items()}
if bypass_csp:
hdrs.pop("content-security-policy", None)
hdrs.pop("content-security-policy-report-only", None)
if evaluate_policy:
hdrs["content-security-policy-report-only"] = evaluate_policy
try:
return await route.fulfill(response=resp, headers=hdrs)
except TypeError:
body = await resp.body()
return await route.fulfill(
status=resp.status, headers=hdrs, body=body
)
except Exception:
try:
return await route.continue_()
except Exception:
return
await context.route("**/*", _route_handler)
def on_request(req): def on_request(req):
""" """
Playwright sometimes classifies "connect-like" activity as resource_type == "other". Playwright sometimes classifies "connect-like" activity as resource_type == "other".
@ -379,6 +428,59 @@ async def crawl_and_generate_csp(
page = await context.new_page() page = await context.new_page()
# If evaluating a candidate CSP, capture Report-Only violations.
if evaluate_policy:
def _record_violation(_source, payload):
try:
if (
isinstance(payload, dict)
and payload.get("disposition") == "report"
):
violations.append(payload)
except Exception:
return
try:
await page.expose_binding("__cspresso_violation", _record_violation)
await page.add_init_script(
"() => { try { window.addEventListener('securitypolicyviolation', (e) => { "
"const payload = {documentURI:e.documentURI, referrer:e.referrer, blockedURI:e.blockedURI, "
"violatedDirective:e.violatedDirective, effectiveDirective:e.effectiveDirective, originalPolicy:e.originalPolicy, "
"disposition:e.disposition, sourceFile:e.sourceFile, lineNumber:e.lineNumber, columnNumber:e.columnNumber, "
"statusCode:e.statusCode, sample:e.sample}; "
"if (typeof window.__cspresso_violation === 'function') { window.__cspresso_violation(payload); }"
"}, true); } catch(_){} }"
)
except Exception:
pass # nosec
def _on_console(msg):
try:
t = msg.text or ""
tl = t.lower()
if (
"content security policy" in tl
or "content-security-policy" in tl
) and (
"would violate" in tl
or "report-only" in tl
or "report only" in tl
):
violations.append(
{
"console": True,
"type": msg.type,
"text": t,
"documentURI": page.url,
"disposition": "report",
}
)
except Exception:
return
page.on("console", _on_console)
pending: set[asyncio.Task] = set() pending: set[asyncio.Task] = set()
if include_sourcemaps: if include_sourcemaps:
@ -402,7 +504,6 @@ async def crawl_and_generate_csp(
directives.setdefault("connect-src", set()).add(o) directives.setdefault("connect-src", set()).add(o)
except Exception: except Exception:
# If you want to debug failures, print(traceback.format_exc())
return return
def on_response(resp): def on_response(resp):
@ -413,7 +514,18 @@ async def crawl_and_generate_csp(
page.on("response", on_response) page.on("response", on_response)
try: try:
await page.goto(url, wait_until="networkidle", timeout=timeout_ms) resp = await page.goto(
url, wait_until="networkidle", timeout=timeout_ms
)
ct = ""
if resp is not None:
ct = (await resp.header_value("content-type") or "").lower()
is_html = ("text/html" in ct) or ("application/xhtml+xml" in ct)
if not is_html and ignore_non_html:
# Still count as visited, but don't hash inline attrs / don't extract links.
continue
# Give the page a moment to run hydration / delayed fetches. # Give the page a moment to run hydration / delayed fetches.
if settle_ms > 0: if settle_ms > 0:
@ -488,18 +600,41 @@ async def crawl_and_generate_csp(
) )
directives_out = {k: sorted(v) for k, v in directives.items() if v} directives_out = {k: sorted(v) for k, v in directives.items() if v}
# De-duplicate violations (same doc+directive+blocked URI) to keep output stable.
if violations:
seen = set()
uniq: list[dict] = []
for v in violations:
if not isinstance(v, dict):
continue
key = (
v.get("documentURI"),
v.get("effectiveDirective") or v.get("violatedDirective"),
v.get("blockedURI"),
v.get("sourceFile"),
v.get("lineNumber"),
v.get("columnNumber"),
)
if key in seen:
continue
seen.add(key)
uniq.append(v)
violations = uniq
return CrawlResult( return CrawlResult(
visited=sorted(visited), visited=sorted(visited),
csp=csp, csp=csp,
nonce_detected=nonce_detected, nonce_detected=nonce_detected,
directives=directives_out, directives=directives_out,
notes=notes, notes=notes,
violations=violations,
) )
def _parse_args(argv: list[str] | None = None) -> argparse.Namespace: def _parse_args(argv: list[str] | None = None) -> argparse.Namespace:
ap = argparse.ArgumentParser( ap = argparse.ArgumentParser(
prog="csp-crawl", prog="cspresso",
description="Crawl up to N pages (same-origin) with Playwright and generate a draft CSP.", description="Crawl up to N pages (same-origin) with Playwright and generate a draft CSP.",
) )
ap.add_argument("url", help="Start URL (e.g. https://example.com)") ap.add_argument("url", help="Start URL (e.g. https://example.com)")
@ -565,13 +700,31 @@ def _parse_args(argv: list[str] | None = None) -> argparse.Namespace:
default=False, default=False,
help="Analyze JS/CSS for sourceMappingURL and add map origins to connect-src", help="Analyze JS/CSS for sourceMappingURL and add map origins to connect-src",
) )
ap.add_argument(
"--bypass-csp",
action="store_true",
help="Strip any existing CSP/CSP-Report-Only response headers from HTML documents (useful for discovery or evaluation).",
)
ap.add_argument(
"--evaluate",
metavar="CSP",
default=None,
help="Inject the provided CSP string as Content-Security-Policy-Report-Only on HTML documents and exit 1 if any Report-Only violations are detected. Quote the value.",
)
ap.add_argument(
"--ignore-non-html",
action="store_true",
default=False,
help="Ignore non-HTML pages that get crawled (which might trigger Chromium's word-wrap hash: https://stackoverflow.com/a/69838710)",
)
ap.add_argument( ap.add_argument(
"--json", action="store_true", help="Output JSON instead of a header line" "--json", action="store_true", help="Output JSON instead of a header line"
) )
return ap.parse_args(argv) return ap.parse_args(argv)
def main(argv: list[str] | None = None) -> None: def main(argv: list[str] | None = None) -> int:
args = _parse_args(argv) args = _parse_args(argv)
browsers_path = Path(args.browsers_path).resolve() if args.browsers_path else None browsers_path = Path(args.browsers_path).resolve() if args.browsers_path else None
@ -589,6 +742,9 @@ def main(argv: list[str] | None = None) -> None:
allow_unsafe_eval=args.unsafe_eval, allow_unsafe_eval=args.unsafe_eval,
upgrade_insecure_requests=args.upgrade_insecure_requests, upgrade_insecure_requests=args.upgrade_insecure_requests,
include_sourcemaps=args.include_sourcemaps, include_sourcemaps=args.include_sourcemaps,
bypass_csp=args.bypass_csp,
evaluate=args.evaluate,
ignore_non_html=args.ignore_non_html,
) )
) )
@ -601,12 +757,14 @@ def main(argv: list[str] | None = None) -> None:
"csp": result.csp, "csp": result.csp,
"directives": result.directives, "directives": result.directives,
"notes": result.notes, "notes": result.notes,
"violations": result.violations,
"evaluated_policy": args.evaluate,
}, },
indent=2, indent=2,
sort_keys=True, sort_keys=True,
) )
) )
return return 1 if (args.evaluate and result.violations) else 0
# Default: print header + visited pages as comments. # Default: print header + visited pages as comments.
for u in result.visited: for u in result.visited:
@ -615,6 +773,24 @@ def main(argv: list[str] | None = None) -> None:
print(f"# NOTE: {n}") print(f"# NOTE: {n}")
print("Content-Security-Policy:", result.csp) print("Content-Security-Policy:", result.csp)
if args.evaluate:
if result.violations:
print("# CSP Report-Only violations detected:")
for v in result.violations:
try:
blocked = v.get("blockedURI")
eff = v.get("effectiveDirective") or v.get("violatedDirective")
doc = v.get("documentURI")
print(f"# - {eff} blocked={blocked} on {doc}")
except Exception:
print(f"# - {v}")
return 1
return 0
return 0
if __name__ == "__main__": if __name__ == "__main__":
main() import sys
sys.exit(main())

View file

@ -1,14 +1,18 @@
from __future__ import annotations from __future__ import annotations
import os import os
import sys import shutil
import time
import subprocess # nosec import subprocess # nosec
import sys
import tempfile
import time
from dataclasses import dataclass from dataclasses import dataclass
from pathlib import Path from pathlib import Path
from playwright.async_api import async_playwright, Error as PlaywrightError from playwright.async_api import async_playwright, Error as PlaywrightError
__all__ = ["EnsureResult", "ensure_chromium_installed"]
@dataclass(frozen=True) @dataclass(frozen=True)
class EnsureResult: class EnsureResult:
@ -16,9 +20,93 @@ class EnsureResult:
installed: bool installed: bool
def _user_cache_dir() -> Path:
"""
Cross-platform cache dir without extra deps.
Linux: $XDG_CACHE_HOME or ~/.cache
macOS: ~/Library/Caches
Windows: %LOCALAPPDATA%
"""
if os.name == "nt":
base = os.environ.get("LOCALAPPDATA") or str(Path.home() / "AppData" / "Local")
return Path(base)
if sys.platform == "darwin":
return Path.home() / "Library" / "Caches"
return Path(os.environ.get("XDG_CACHE_HOME", str(Path.home() / ".cache")))
def _default_browsers_path() -> Path: def _default_browsers_path() -> Path:
# Project-local by default. Override with PLAYWRIGHT_BROWSERS_PATH or CLI flag. """
return Path(__file__).resolve().parents[2] / ".pw-browsers" If PLAYWRIGHT_BROWSERS_PATH is set, honor it (Playwright-standard).
Otherwise use a user-writable cache path (safe for AppImage/pip installs).
"""
env = os.environ.get("PLAYWRIGHT_BROWSERS_PATH")
if env and env.strip() and env.strip() != "0":
return Path(env).expanduser()
return _user_cache_dir() / "cspresso" / "pw-browsers"
def _looks_like_python(path: str) -> bool:
p = Path(path)
name = p.name.lower()
return (
p.exists()
and os.access(str(p), os.X_OK)
and (
name == "python" or name.startswith("python3") or name.startswith("python")
)
)
def _find_python_executable() -> str:
"""
In AppImage bundles, sys.executable may be the AppImage itself.
We need the embedded python binary so we can run: python -m playwright install chromium
"""
# 1) Normal venv/system case
if _looks_like_python(sys.executable):
return sys.executable
# 2) Sometimes present
base = getattr(sys, "_base_executable", None)
if base and _looks_like_python(base):
return base
# 3) Embedded python typically lives under sys.prefix/bin
bindir = "Scripts" if os.name == "nt" else "bin"
candidates = [
Path(sys.prefix)
/ bindir
/ f"python{sys.version_info.major}.{sys.version_info.minor}",
Path(sys.prefix) / bindir / f"python{sys.version_info.major}",
Path(sys.prefix) / bindir / "python3",
Path(sys.prefix) / bindir / "python",
Path(sys.base_prefix)
/ bindir
/ f"python{sys.version_info.major}.{sys.version_info.minor}",
Path(sys.base_prefix) / bindir / f"python{sys.version_info.major}",
Path(sys.base_prefix) / bindir / "python3",
Path(sys.base_prefix) / bindir / "python",
]
for c in candidates:
if _looks_like_python(str(c)):
return str(c)
# 4) Last resort: host python on PATH
for name in (
f"python{sys.version_info.major}.{sys.version_info.minor}",
"python3",
"python",
):
p = shutil.which(name)
if p and _looks_like_python(p):
return p
# Fallback (won't fix AppImage, but avoids crashing)
return sys.executable
def _env_with_browsers_path(browsers_path: Path) -> dict[str, str]: def _env_with_browsers_path(browsers_path: Path) -> dict[str, str]:
@ -27,14 +115,20 @@ def _env_with_browsers_path(browsers_path: Path) -> dict[str, str]:
return env return env
def _is_writable_dir(path: Path) -> bool:
try:
path.mkdir(parents=True, exist_ok=True)
probe = path / ".write_probe"
probe.write_text("x", encoding="utf-8")
probe.unlink(missing_ok=True)
return True
except OSError:
return False
def _acquire_install_lock( def _acquire_install_lock(
lock_path: Path, timeout_s: float = 120.0, poll_s: float = 0.2 lock_path: Path, timeout_s: float = 120.0, poll_s: float = 0.2
) -> None: ) -> None:
"""Very small cross-platform lock using atomic file creation.
Avoids concurrent Playwright installs when multiple processes start at once.
Not perfect, but good enough for most CLI usage.
"""
start = time.time() start = time.time()
while True: while True:
try: try:
@ -49,14 +143,16 @@ def _acquire_install_lock(
def _release_install_lock(lock_path: Path) -> None: def _release_install_lock(lock_path: Path) -> None:
try: try:
lock_path.unlink(missing_ok=True) # Python 3.8+ lock_path.unlink(missing_ok=True)
except Exception: except Exception:
pass # nosec pass # nosec
def _install_chromium(browsers_path: Path, with_deps: bool = False) -> None: def _install_chromium(browsers_path: Path, with_deps: bool = False) -> None:
env = _env_with_browsers_path(browsers_path) env = _env_with_browsers_path(browsers_path)
cmd = [sys.executable, "-m", "playwright", "install"] py = _find_python_executable()
cmd = [py, "-m", "playwright", "install"]
if with_deps: if with_deps:
cmd.append("--with-deps") cmd.append("--with-deps")
cmd.append("chromium") cmd.append("chromium")
@ -65,7 +161,6 @@ def _install_chromium(browsers_path: Path, with_deps: bool = False) -> None:
async def _can_launch_chromium(browsers_path: Path) -> bool: async def _can_launch_chromium(browsers_path: Path) -> bool:
# Ensure this process uses the same path too.
os.environ["PLAYWRIGHT_BROWSERS_PATH"] = str(browsers_path) os.environ["PLAYWRIGHT_BROWSERS_PATH"] = str(browsers_path)
try: try:
async with async_playwright() as p: async with async_playwright() as p:
@ -82,23 +177,36 @@ async def ensure_chromium_installed(
with_deps: bool = False, with_deps: bool = False,
lock_timeout_s: float = 120.0, lock_timeout_s: float = 120.0,
) -> EnsureResult: ) -> EnsureResult:
"""Ensure Playwright's Chromium is installed and launchable.
Strategy:
- Attempt a tiny headless launch.
- If it fails, acquire a lock and run `python -m playwright install chromium` (optionally --with-deps).
- Retry launch once.
""" """
bp = browsers_path or _default_browsers_path() Ensure Playwright Chromium is installed and launchable.
bp.mkdir(parents=True, exist_ok=True)
- Honors PLAYWRIGHT_BROWSERS_PATH if set.
- Defaults to a user cache dir (safe for AppImage readonly mounts).
- Uses embedded python to run playwright installer when sys.executable is the AppImage.
"""
explicit = browsers_path is not None
bp = browsers_path or _default_browsers_path()
# If it already works, do nothing.
if await _can_launch_chromium(bp): if await _can_launch_chromium(bp):
return EnsureResult(browsers_path=bp, installed=False) return EnsureResult(browsers_path=bp, installed=False)
# If we need to install and the chosen dir isn't writable, fall back (unless explicit).
if not explicit and not _is_writable_dir(bp):
bp = _user_cache_dir() / "cspresso" / "pw-browsers"
if not _is_writable_dir(bp):
bp = Path(tempfile.gettempdir()) / "cspresso" / "pw-browsers"
bp.mkdir(parents=True, exist_ok=True)
if explicit and not _is_writable_dir(bp):
raise OSError(
f"Browsers path is not writable: {bp}\n"
"Choose a writable directory via --browsers-path or set PLAYWRIGHT_BROWSERS_PATH."
)
lock_path = bp / ".install.lock" lock_path = bp / ".install.lock"
_acquire_install_lock(lock_path, timeout_s=lock_timeout_s) _acquire_install_lock(lock_path, timeout_s=lock_timeout_s)
try: try:
# Another process might have installed while we waited; check again.
if await _can_launch_chromium(bp): if await _can_launch_chromium(bp):
return EnsureResult(browsers_path=bp, installed=False) return EnsureResult(browsers_path=bp, installed=False)
@ -106,7 +214,7 @@ async def ensure_chromium_installed(
if not await _can_launch_chromium(bp): if not await _can_launch_chromium(bp):
raise RuntimeError( raise RuntimeError(
"Playwright Chromium install completed, but Chromium still failed to launch. " "Chromium install completed, but Chromium still failed to launch. "
"On Linux, you may need additional system dependencies." "On Linux, you may need additional system dependencies."
) )