diff --git a/.gitignore b/.gitignore index 1d4390d..191b622 100644 --- a/.gitignore +++ b/.gitignore @@ -100,3 +100,5 @@ dist /README-temp.md .DS_Store /.vscode/snipsnap.code-snippets +.vercel +.env*.local diff --git a/docs/plans/app-discovery-d1-automation.md b/docs/plans/app-discovery-d1-automation.md new file mode 100644 index 0000000..addef70 --- /dev/null +++ b/docs/plans/app-discovery-d1-automation.md @@ -0,0 +1,216 @@ +# Original Prompt + +> Let's start a draft plan in docs/plans to get the app to a place where we can pull in these app data from sources automaically. +> +> We'll need to migrate all the existing and new data from Google Sheets to Cloudflare D1. +> +> We'll need to setup and run the app discovery pipeline locally on this machine. +> +> We'll need to test if we can get the app canner to be isomorphic so that we can scan from both node/bun and the browser. +> +> We'll need to audit and fix the pending app compatibility issue for the scanner. +> +> We'll need to setup an automated app discovery either on GitHub Actions or Cloudflare. +> +> Read through our docs and let me know any other work we need to add to this list + +# Goal + +Get Does It ARM to a local-first, automatable app discovery and scan pipeline that uses Cloudflare D1 as the canonical data store, can be run on this machine, can backfill legacy data from Google Sheets and other current feeds, can later run on a scheduler without breaking the existing site build, and lives inside a maintainable public monorepo with explicit app, package, database, and infrastructure boundaries. + +# Non-Goals + +- Rebuild the frontend or replace Astro/Netlify in the first pass. +- Automate every source class on day one. +- Force full browser and Node/Bun archive parity before the feasibility spike is complete. +- Remove the manual README/list flow before D1-backed equivalents exist. +- Switch package manager/runtime just because the inspiration repo uses Bun. + +# Repo Findings + +- There is already one scanner-focused plan in `docs/plans/app-test-typescript-refactor.md`, but it is narrower than this request and mainly covers Playwright safety rails plus incremental TypeScript conversion. +- The repo is still organized as a flat root with mixed app, build, worker, helper, and infra concerns rather than as a workspace monorepo. +- The app-test UI lives in `pages/apple-silicon-app-test.vue` and is mounted by `src/pages/apple-silicon-app-test.astro`. +- There are two scanner surfaces today: +- `helpers/app-files-scanner.js` is the legacy path still used by the page by default. +- `helpers/scanner/scan.ts`, `helpers/scanner/client.ts`, and `helpers/scanner/worker.ts` implement the newer worker-based scanner exposed behind `?version=2`. +- Browser coverage already exists for both scanner variants in `test/playwright/apple-silicon-app-test.playwright.ts`. +- That browser coverage is not currently green in local execution: on April 4, 2026, `pnpm exec vitest run --config vitest.playwright.config.mjs test/playwright/apple-silicon-app-test.playwright.ts` timed out waiting for the Astro dev server. +- Worker-scanner coverage already exists in `test/scanner/client.test.ts`. +- The site build still assembles static output from remote/env-backed sources rather than a local database: +- `helpers/build-app-list.js` reads `COMMITS_SOURCE`, `SCANS_SOURCE`, and `VFUNCTIONS_URL`. +- `helpers/build-homebrew-list.js` reads `HOMEBREW_SOURCE`. +- `helpers/build-game-list.js` reads `GAMES_SOURCE`. +- `helpers/build-device-list.js` reads `VFUNCTIONS_URL`. +- `build-lists.js` composes the generated app, game, homebrew, device, and video outputs from those inputs. +- `scripts/scan-new-apps.js` is not a working discovery pipeline yet. It has `runScans = false`, exits on Linux, and only spins up a local server stub. +- The repo has Cloudflare deployment automation, but not app-level D1 plumbing in source control. `.github/workflows/deploy-cloudflare-workers.yml` deploys `doesitarm-default/` and `workers/analytics/`, and writes Wrangler config from GitHub secrets at CI time. +- That same Cloudflare workflow currently prints secret-derived `.env` and `wrangler.toml` files to CI logs, which should be treated as an immediate security fix. +- There is a local `.env` with the current runtime contract, but there is no checked-in bootstrap/setup doc for another machine or another session. + +# External Research + +- Read alongside `docs/research/public-repo-security-and-monorepo-patterns-2026-04-04.md`. +- The Notion research backlog repeatedly describes the target ingestion loop as: discover source page or asset URL, download the archive, recursively extract ZIP/DMG/PKG contents, find the `.app` or Mach-O payload, scan architecture/metadata, and persist results. +- Notion still points to Homebrew as a primary source class. `Download and scan Homebrew Casks` and `Source from Homebrew Casks URLs` emphasize starting with ZIPs, then DMGs, then known extensions, while filtering already-known/native apps, batching releases, and adding timeouts/retries. +- Additional documented source classes include generic download pages, GitHub app lists and release assets, Mac App Store data, Product Hunt, Nix, and MacPorts. +- The newer `Data Source Priorities` note from May 8, 2024 still lists `Homebrew Cask Scans`, `Homebrew Formulae API`, `App Scans`, and `Product Hunt Apps/API` as important sources. +- Notion’s App Test notes map closely to the repo’s current scanner work: archive URL submission, download-page scanning, recursive extraction, and writing scan data to a store. +- I did not find explicit Playwright or agent-browser planning in Notion. That looks like a newer execution approach layered on top of the older source-discovery research rather than a documented historical plan. +- The new security/monorepo research recommends keeping the repo public, keeping secrets and raw operational data private, using a Kriasoft-style `apps/` + `packages/` + `db/` + `infra/` layout, defaulting GitHub Actions to read-only tokens and GitHub-hosted runners, and avoiding privileged `pull_request_target` flows that checkout PR code. + +# Recommendation + +- Make D1 the canonical store for discovered apps, source observations, scan runs, and site-facing aggregated records. +- Refactor toward a Kriasoft-style public monorepo before adding more discovery and infra code to the flat root. Adapt the layout, not the entire toolchain. +- Keep the repo public, but keep secrets, raw imports, downloaded archives, quarantine material, and privileged operational state private. +- Add a public-repo security-hardening stage before expanding automation: remove secret logging, tighten GitHub Actions permissions, guard workflow/infra/db paths with `CODEOWNERS`, ignore local secret/state files, and keep public CI on GitHub-hosted runners. +- Stage the work in this order: monorepo target and boundaries, public-repo security hardening, data model and D1 migration contract, scanner stabilization, local discovery pipeline, D1-backed read-path migration, then scheduled automation. +- Keep the scan core runtime-agnostic where feasible, but do not block delivery on proving identical browser and Node/Bun support for every archive type. If DMG/PKG support is impractical in the browser, make that an explicit server-side fallback instead of an accidental limitation. +- Use GitHub Actions as the first scheduled runner because the repo already has scheduled workflows and no committed D1/Wrangler local workflow yet. Keep Cloudflare as the longer-term execution target if a D1-bound ingestion worker becomes the canonical runtime. + +# Rollout Plan + +1. Define the monorepo target and carve the repository into stable boundaries. +- Design the target workspace layout in the style of `kriasoft/react-starter-kit`, adapted to the current pnpm repo: +- `apps/web/` for the Astro site and app-test UI +- `apps/default-worker/` and `apps/analytics-worker/` for current Cloudflare workers +- `apps/discovery/` for discovery/ingest entry points if a dedicated app surface is warranted +- `packages/scanner-core/`, `packages/source-runners/`, `packages/data-model/`, and `packages/site-build/` for shared logic +- `db/` for D1 migrations, seeds, import scripts, and local DB helpers +- `infra/` for Wrangler config, deployment helpers, and infra policy docs +- Keep the package manager as pnpm unless a later decision explicitly changes it. +- Decide which existing root paths move first so the refactor stays reviewable and does not block all feature work at once. + +2. Establish the public-repo security baseline before adding more automation. +- Remove secret printing from CI and rotate any secrets that may have been exposed in logs. +- Adopt a public template env pattern: +- commit a placeholder-only `.env.example` or equivalent template +- keep real credentials in ignored local files and secret managers +- ignore `.env.local`, `.dev.vars`, `.wrangler/`, and other new local-secret/state paths introduced by D1/Workers work +- Set the default `GITHUB_TOKEN` posture to read-only and grant additional permissions per job only where needed. +- Add `CODEOWNERS` or equivalent review protection for `.github/workflows/`, `infra/`, and `db/`. +- Prefer GitHub-hosted runners for public CI and scheduled jobs; do not use self-hosted runners for public PR workflows. +- Add or enable repository security defaults suitable for public OSS: secret scanning, push protection, private vulnerability reporting, dependency review, and workflow/action pinning policy. + +3. Define the source-of-truth data model and migration contract. +- Inventory every current sheet/feed that contributes app, scan, bundle, device, game, and Homebrew data. +- Define D1 tables for apps, app versions, bundles, source types, source observations, discovered assets, scan runs, import jobs, and sync checkpoints. +- Define provenance and dedupe rules for slug, bundle ID, download URL, version string, scan hash, source type, and timestamps like `first_seen_at` and `last_seen_at`. +- Write a field-mapping document from Google Sheets and current remote JSON feeds into the D1 schema. +- Decide which current static outputs remain derived snapshots and which should become D1-backed runtime/API reads. +- Add Wrangler/D1 project config, migrations, and local/staging/prod database environments. +- Build import scripts for existing Google Sheets data and any current remote feeds that must be preserved for continuity. +- Add validation output for import counts, duplicate merges, and skipped/error records. +- Document the local bootstrap on this machine: required env keys, auth/setup steps, migration commands, import commands, and reset/rollback commands. + +4. Stabilize the scanner and resolve the pending compatibility issue. +- Identify the concrete pending scanner compatibility issue, capture a reproducible sample or fixture, and turn it into an automated regression test. +- Audit the legacy `helpers/app-files-scanner.js` path against the worker scanner in `helpers/scanner/*` and decide the deprecation/default path. +- Expand scanner fixtures beyond the current happy-path native ZIP to include Intel-only, malformed, nested, DMG, and PKG examples where legally redistributable. +- Decide whether “isomorphic scanner” means one shared scan core with environment-specific file loaders or truly identical archive support in browser and Node/Bun. +- If full parity is not practical, document and test the split: browser handles `.app` and `.zip`; Node/Bun handles heavier formats like `.dmg` and `.pkg`. + +5. Build the local app discovery pipeline. +- Implement a bounded CLI or script entry point that can run locally on this machine and execute source discovery, download, extraction, scan, and D1 persistence end to end. +- Start with the highest-value documented sources: Homebrew Casks and Homebrew source URLs, then GitHub releases/lists, then generic download pages. +- Normalize all source runners into one contract: source page -> discovered asset URLs -> fetched artifact -> extracted app candidates -> scan result -> persisted record. +- Add rate limiting, retry/backoff, timeout, file-size guards, duplicate suppression, and quarantine/error buckets for failed downloads or unsupported archives. +- Persist per-run audit data so reruns can skip already-processed successes and focus on failures or changed sources. + +6. Move site and build reads onto D1-backed interfaces. +- Replace or wrap the current remote feed dependencies (`SCANS_SOURCE`, `HOMEBREW_SOURCE`, `GAMES_SOURCE`, `VFUNCTIONS_URL`, `PUBLIC_API_DOMAIN`) with D1-backed queries or exported snapshots generated from D1. +- Define how App Test submissions and automated discovery scans appear on app pages, listings, and API outputs without duplicate records. +- Introduce compatibility adapters so the site can cut over source-by-source instead of all at once. +- Verify whether any normalization currently lives behind the existing remote endpoints and port that logic before removing those dependencies. + +7. Add scheduled automation and operational controls. +- Choose the first scheduler architecture: +- GitHub Actions if the job mostly runs scripts or triggers a small ingestion endpoint. +- Cloudflare if a D1-bound worker becomes the canonical ingestion runtime. +- Add a staging/dry-run lane before production writes. +- Emit run summaries for discovered apps, downloaded assets, successful scans, failed scans, D1 writes, duplicates skipped, and retries exhausted. +- Add pause/resume controls per source plus playbooks for reruns, backfills, and bad-import recovery. +- If Cloudflare becomes the later execution target, evaluate Workflows/Queues plus Workers Logs/Traces as the durable background-processing and observability surface. + +# Validation Gates + +- Monorepo scaffolding is internally consistent: +- workspace install succeeds +- shared packages resolve from the intended workspace paths +- moved apps still build from their new locations +- Security baseline changes are reviewed and verified: +- CI no longer prints secret-derived config to logs +- workflow permissions are explicitly declared +- protected paths (`.github/workflows/`, `infra/`, `db/`) have review ownership +- Existing scanner coverage passes: +- `pnpm exec vitest run test/scanner/client.test.ts` +- `pnpm exec vitest run --config vitest.playwright.config.mjs test/playwright/apple-silicon-app-test.playwright.ts` +- Site/build health passes after read-path changes: +- `pnpm typecheck` +- `pnpm build` +- `pnpm test` +- Migration validation artifact exists and is reviewed: +- source row counts vs D1 row counts +- duplicate/merge report for slugs, bundle IDs, versions, and scan hashes +- sample record spot checks across apps, scans, Homebrew, devices, and games +- Local discovery dry run succeeds against a bounded batch: +- at least 10 Homebrew apps +- at least 3 non-Homebrew direct-download pages +- persisted run summary saved to a reviewable artifact +- Automation validation succeeds in staging: +- one scheduled run completes +- rerunning the same batch is idempotent +- no uncontrolled duplicate writes are produced + +# Deliverables + +- A repo-local execution plan in `docs/plans/app-discovery-d1-automation.md` +- A repo-local research memo in `docs/research/public-repo-security-and-monorepo-patterns-2026-04-04.md` +- A target monorepo layout and migration sequence for the current flat-root repo +- A public-repo security baseline for workflows, secrets, and local operational state +- D1 schema and migration design +- Local bootstrap documentation for discovery plus D1 +- Import/backfill scripts and validation report shape +- Scanner audit and regression-coverage plan +- Local discovery pipeline plan and runnable entry point +- Scheduled automation decision with staging rollout path + +# Risks And Open Questions + +- The pending scanner compatibility issue is not named in repo docs, so the first task in that stage is to identify the exact failing app/archive and capture it as a test case. +- Browser-safe DMG/PKG support may not be realistic. Forcing parity could delay delivery more than an explicit split runtime. +- The monorepo refactor can easily sprawl if it tries to move every surface at once; the first slice should focus on boundaries that unblock scanner, D1, and discovery work. +- The current site mixes manual README content, build-time remote feeds, and app-test scan data. The D1 cutover boundary needs an explicit decision. +- Generic site downloads and third-party source scraping need a clear policy for rate limits, robots/TOS, timeouts, and file-size caps. +- Existing remote endpoints may contain normalization logic that is not visible in this repo. That logic has to be audited before those feeds are replaced. +- Cloudflare deployment exists today, but D1 local/prod workflow is not checked in. Local reproducibility depends on adding that missing surface. +- Cloudflare deployment may continue to require long-lived tokens in GitHub Actions even after hardening, so environment scoping and review gates matter. +- If the worker scanner becomes default, the legacy scanner should not remain as an unowned fallback indefinitely. + +# Sources + +- `docs/research/public-repo-security-and-monorepo-patterns-2026-04-04.md` +- `docs/plans/app-test-typescript-refactor.md` +- `docs/app-flow.md` +- `build-lists.js` +- `helpers/build-app-list.js` +- `helpers/build-homebrew-list.js` +- `helpers/build-game-list.js` +- `helpers/build-device-list.js` +- `helpers/app-files-scanner.js` +- `helpers/scanner/scan.ts` +- `helpers/scanner/client.ts` +- `test/playwright/apple-silicon-app-test.playwright.ts` +- `test/scanner/client.test.ts` +- `.github/workflows/deploy-api.yaml` +- `.github/workflows/deploy-cloudflare-workers.yml` +- `.github/workflows/deploy-frontend.yaml` +- `.github/workflows/deploy-functions.yaml` +- Notion: Download and scan Homebrew Casks +- Notion: Source from Homebrew Casks URLs +- Notion: Source from generic App/Download page links +- Notion: App Tester +- Notion: Get App Test Working in Node +- Notion: Detect and decompressed ZIP, DMG, and PKG +- Notion: Data Source Priorities diff --git a/docs/plans/cloudflare-dual-deploy-shadow.md b/docs/plans/cloudflare-dual-deploy-shadow.md new file mode 100644 index 0000000..4cbf9f2 --- /dev/null +++ b/docs/plans/cloudflare-dual-deploy-shadow.md @@ -0,0 +1,198 @@ +# Original Prompt + +> Okay, I've got, we do have environment variables locally, and we may have the same ones on Netlify. Let's go ahead and Build a plan to set up this dual deploy to Cloudflare to use the local environment variables and see how that works. Cloudflare, the desertarm.com is on Cloudflare, so we can just set up the subdomain and it should work. + +# Goal + +Set up a reversible dual-deploy path where the same repo and same commit can be deployed to both Netlify and Cloudflare, with Netlify remaining primary during the migration and Cloudflare serving a shadow subdomain for parity testing, environment validation, and incremental runtime cleanup. + +# Non-Goals + +- Cut production traffic over to Cloudflare in the first pass. +- Migrate all site data dependencies to D1 as part of the initial dual-deploy spike. +- Remove Netlify-specific build and deploy paths before Cloudflare parity exists. +- Introduce a separate runtime or package-manager change just because the deployment target changes. + +# Repo Findings + +- The app currently uses the Netlify adapter in `astro.config.mjs` via `@astrojs/netlify`. +- Astro only uses one adapter per build, so dual deploy will require two build configurations rather than one config that targets both platforms simultaneously. +- Astro CLI supports `--config`, which makes a dual-build setup practical without splitting the repo immediately. +- Current environment keys present in local `.env` are: +- `ALL_UPDATE_SUBSCRIBE` +- `BUILD_ID` +- `COMMITS_SOURCE` +- `GAMES_SOURCE` +- `GOOGLE_API_KEY` +- `HOMEBREW_SOURCE` +- `PUBLIC_API_DOMAIN` +- `PUBLIC_URL` +- `SCANS_SOURCE` +- `TEST_RESULT_STORE` +- `URL` +- `VFUNCTIONS_URL` +- `VIDEO_SOURCE` +- The current frontend/build path is still Netlify-shaped: +- `astro.config.mjs` uses `@astrojs/netlify` +- `netlify.toml` defines redirects and build behavior +- `helpers/astro/request.js` calls `helpers/config-node.js`, which reads `netlify.toml` from disk at runtime +- `package.json` scripts are still centered on `netlify-build` +- Existing workflows are split across deploy hooks for functions/frontend and a separate Cloudflare worker deploy lane. +- There is already a Cloudflare zone for `doesitarm.com`, so attaching a Worker or custom domain to a subdomain should be straightforward once the Worker build exists. + +# External Research + +- Astro’s deployment model is adapter-based. Official docs indicate one adapter per build, and the CLI supports `--config `, which is the cleanest way to run a Netlify build and a Cloudflare build from the same repo. +- Cloudflare recommends `wrangler.jsonc` for new projects and supports named environments under `env.`. +- Cloudflare local development can load secrets from either `.dev.vars` or `.env`, but Cloudflare explicitly says to choose one or the other rather than relying on both at once. +- Cloudflare supports environment-specific `.env.` files and merges `.env` files by precedence during local development. +- Cloudflare Workers can be attached to a subdomain either by a route or by `custom_domain`, and custom domains are the preferred fit when the Worker is the origin for that subdomain. +- Astro’s Cloudflare adapter exposes Cloudflare environment variables and bindings through the Worker runtime, and current docs also support importing environment bindings from `cloudflare:workers`. + +# Recommendation + +- Use a dual-build, dual-deploy model from the same commit: +- Netlify remains the primary production deploy target. +- Cloudflare gets a shadow deploy on a subdomain such as `cf-preview.doesitarm.com` or `edge-preview.doesitarm.com`. +- Create a dedicated Cloudflare Astro config rather than trying to make one config branch internally on host. +- Reuse the existing local root `.env` for the first local Cloudflare spike only if there is no `.dev.vars` in play, because Cloudflare supports `.env`-based local loading. +- Treat that reuse as transitional. The durable state should be: +- `.env.example` for committed placeholders +- `.env` or `.env.local` for current app-local development +- Cloudflare-specific `.env.cloudflare`, `.env.staging`, or `.dev.vars` strategy chosen explicitly +- Wrangler-managed secrets for deployed Cloudflare environments +- Keep the first Cloudflare deployment reading the same external sources that Netlify uses today. Do not bundle D1 into the first shadow deploy unless it is required for the Cloudflare app to boot. + +# Rollout Plan + +1. Define the dual-deploy shape and naming. +- Pick the shadow subdomain and document its purpose. +- Decide whether Cloudflare shadow traffic uses a route or `custom_domain`. +- Decide the first dual-deploy branch policy: +- deploy both on `master` +- or deploy Cloudflare only on a dedicated branch/tag until parity work stabilizes +- Define success criteria for “Cloudflare shadow is viable” before any production cutover discussion. + +2. Split Astro configuration into host-specific build targets. +- Keep the existing Netlify config as the current baseline. +- Add a Cloudflare-specific Astro config file that swaps `@astrojs/netlify` for `@astrojs/cloudflare`. +- Add build scripts that make the host target explicit, for example: +- `build:netlify` +- `build:cloudflare` +- `preview:cloudflare` +- Keep shared Vite/integration logic in shared modules so the configs differ only where platform behavior truly differs. + +3. Define the environment-variable contract for the Cloudflare shadow. +- Inventory each current env key by purpose: +- public site URL values +- external data-source URLs +- tokens/secrets +- build-only values +- runtime values +- Decide which values will live in Wrangler `vars` versus Wrangler secrets. +- For the first local spike, allow Wrangler to load from the existing root `.env` if the config stays at the repo root and there is no `.dev.vars`. +- After the spike, decide and document the long-term local convention: +- continue using `.env` for shared local values +- or move Cloudflare-specific sensitive values to `.dev.vars` +- Add a checked-in env template artifact so Cloudflare setup is reproducible without copying the real `.env`. +- Ensure no secret-bearing values are printed by CI or committed into Wrangler config. + +4. Make the runtime path adapter-neutral enough for shadow deploy. +- Remove runtime dependency on `netlify.toml` from the request path. +- Extract redirects into a platform-neutral source of truth that both Netlify and Cloudflare can consume. +- Audit any assumptions that rely on Netlify SSR behavior, filesystem layout, or deploy-hook conventions. +- Confirm whether `helpers/url.js` and public runtime config are still correct under the Cloudflare adapter. + +5. Add a local Cloudflare development lane. +- Add Wrangler config and local run instructions. +- Add a local dev command that exercises the Cloudflare target with the current env contract. +- Verify that the app boots locally on the Cloudflare target using local env values without editing secrets into source control. +- Add binding/type generation if the Cloudflare adapter path needs `wrangler types`. + +6. Add CI dual deploy without cutting over traffic. +- Keep existing Netlify deploy flow intact. +- Add a Cloudflare build-and-deploy workflow from the same commit SHA. +- Deploy the Cloudflare build to the shadow subdomain. +- Keep the Cloudflare workflow isolated from public PR execution until secrets and permissions are hardened. +- Make the deploy output report the Netlify URL, Cloudflare shadow URL, commit SHA, and environment used. + +7. Add parity checks for the shadow environment. +- Run smoke checks against both Netlify and Cloudflare deployments. +- Verify: +- homepage +- one dynamic app page +- one formula page +- app-test page +- redirects +- static search assets +- Compare response behavior, major console/runtime errors, and critical page content between the two hosts. +- Log differences in a durable artifact so the cleanup work is visible before cutover. + +8. Decide the cutover gate. +- Define the minimum parity bar: +- stable deploys +- no runtime-only Cloudflare failures +- environment values mapped cleanly +- redirects equivalent +- key pages verified +- Only after that, decide whether to: +- switch DNS/subdomain roles +- proxy production traffic through Cloudflare +- or continue Netlify as origin while more data-layer migration work lands + +# Validation Gates + +- Local Cloudflare target boots successfully with local env values and no committed secrets. +- Both host-specific Astro builds complete from the same commit. +- Cloudflare shadow subdomain serves the app over the Cloudflare zone. +- Redirect behavior is equivalent on a representative sample of known redirects. +- Core routes render on both hosts: +- `/` +- one `/app/...` +- one `/formula/...` +- `/apple-silicon-app-test/` +- CI output for Cloudflare deploy contains no secret material. +- Netlify remains healthy throughout the shadow rollout. + +# Deliverables + +- A focused dual-deploy plan in `docs/plans/cloudflare-dual-deploy-shadow.md` +- A target naming decision for the shadow subdomain +- Host-specific Astro build configs and scripts plan +- An explicit Cloudflare env/secrets mapping plan based on current local keys +- A runtime-neutralization checklist for Netlify-specific request/config logic +- A parity verification checklist for Netlify vs Cloudflare + +# Risks And Open Questions + +- The current runtime read of `netlify.toml` is the biggest likely blocker for “works the same on both hosts.” +- Reusing the current root `.env` for the first Cloudflare spike is practical, but it may blur long-term ownership unless a dedicated Cloudflare local env convention is chosen quickly. +- Some current env keys may be build-time only or tied to Netlify/external APIs; they should not all be assumed to map 1:1 to Cloudflare runtime secrets. +- If the Cloudflare shadow deploy uses the same backend sources as Netlify, host parity is easier but data-layer migration remains deferred. +- If the Cloudflare deploy is allowed to drift from the Netlify build graph, the comparison loses value. +- The existing Cloudflare worker workflow currently prints secret-derived files to logs; this must be treated as a blocker to any secret-bearing Cloudflare app deploy workflow. + +# Sources + +- `astro.config.mjs` +- `package.json` +- `.github/workflows/deploy-cloudflare-workers.yml` +- `.github/workflows/deploy-frontend.yaml` +- `.github/workflows/deploy-functions.yaml` +- `netlify.toml` +- `helpers/astro/request.js` +- `helpers/config-node.js` +- Cloudflare Wrangler configuration docs: + https://developers.cloudflare.com/workers/wrangler/configuration/ +- Cloudflare environments docs: + https://developers.cloudflare.com/workers/wrangler/environments/ +- Cloudflare local environment variables docs: + https://developers.cloudflare.com/workers/configuration/environment-variables/ +- Cloudflare custom domains docs: + https://developers.cloudflare.com/workers/configuration/routing/custom-domains +- Astro CLI docs: + https://docs.astro.build/en/reference/cli-reference/ +- Astro Cloudflare adapter docs: + https://docs.astro.build/en/guides/integrations-guide/cloudflare/ +- Astro Netlify adapter docs: + https://docs.astro.build/en/guides/integrations-guide/netlify/ diff --git a/docs/research/desktop-app-compatibility-data-strategy-2026-04-04.md b/docs/research/desktop-app-compatibility-data-strategy-2026-04-04.md new file mode 100644 index 0000000..0084a7e --- /dev/null +++ b/docs/research/desktop-app-compatibility-data-strategy-2026-04-04.md @@ -0,0 +1,600 @@ +# Desktop App Compatibility Data Strategy For doesitarm + +Tease: In 2026, the winning play is not "AI SEO." It is publishing a high-trust, machine-readable compatibility corpus that is genuinely useful on the open web and selectively keeping the operational and proprietary layers private. + +Lede: For `doesitarm` on 2026-04-04, the best-fit strategy is to treat the site as an entity-and-evidence graph for desktop software compatibility: publish canonical app pages, provenance-rich evidence, structured exports, and selective machine-readable surfaces for discovery; keep raw crawls, binary artifacts, candidate matches, scoring logic, and operational intelligence private. + +Why it matters: +- This project can outlive the Apple Silicon transition if the core model is "desktop software compatibility knowledge," not just "Apple Silicon list posts." +- Google's 2025 AI search guidance still rewards the same fundamentals: unique content, crawlable pages, textual clarity, and trustworthy evidence, not special AI-only tricks. +- OpenAI and Anthropic now expose separate search, user-action, and training bots, which means "open versus closed" is no longer binary. You can choose visibility, training access, and operational exposure separately. + +Go deeper: +- Think of the public site as a citation layer and decision-support layer, not as the full warehouse. +- Publish public facts, provenance, timestamps, and curated exports. Keep raw ingestion, low-confidence candidates, and monetizable workflow intelligence private. +- Treat `llms.txt` and Markdown exports as helpful secondary surfaces, not as the core strategy. The core strategy is still clean HTML, canonical URLs, structured data, sitemaps, and useful pages. + +Date: 2026-04-04 + +## Scope + +Research how to think about a long-lived desktop app compatibility database as a +content, SEO, and AI-discoverability system in 2026, including: + +- best practices for public content architecture +- how LLM-driven discovery changes the picture +- what data should likely stay public versus private +- what audiences this data can serve +- tradeoffs between more-open and more-closed approaches + +## Short Answer + +Build `doesitarm` as a public knowledge product with a private operating system +underneath it. + +Publicly, publish: + +- canonical app pages +- compatibility status by platform/environment +- evidence summaries and source links +- timestamps, changelogs, and history +- stable IDs, taxonomy, and machine-readable metadata +- a limited public API or snapshot exports for high-value reuse + +Privately, keep: + +- raw crawls and downloaded binaries +- candidate entities before review +- normalization, dedupe, and confidence logic +- crawler logs, abuse rules, and infrastructure controls +- enrichment that creates monetizable leverage rather than user value on the open web + +The biggest strategic shift from 2018 to 2026 is this: + +1. Search still rewards useful original pages. +2. AI discovery mostly rides on those same pages. +3. Separate crawler controls now let you be open for search while staying more closed for training. +4. The moat is less "having any compatibility data at all" and more: + verification quality, provenance, freshness, historical depth, and workflow speed. + +Inference: +No single source states that exact four-part conclusion. It is the synthesis that +best fits the repo state plus current Google, OpenAI, Anthropic, Cloudflare, +HN, and Lobsters evidence. + +## What The Repo Already Knows + +- The project already acts like a compatibility corpus, not just a blog: + [README.md](/Users/athena/Code/doesitarm/README.md) is a manually curated, + source-linked compatibility list. +- The repo already has a plan to move toward a canonical database and discovery + pipeline: + [docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md). +- The public site already exposes crawlable pages, a sitemap, and permissive + crawling: + [static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt), + [static/sitemap-index.xml](/Users/athena/Code/doesitarm/static/sitemap-index.xml). +- The current public JSON already exposes useful app-level fields such as name, + aliases, status, bundle IDs, related links, scan details, and device support: + [static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json). +- The current structured data implementation is narrow and video-centric: + [helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js), + [helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js), + [helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js). +- I did not find a checked-in `llms.txt`, `llms-full.txt`, or per-page Markdown + export surface. +- I also did not find `SoftwareApplication` or `Dataset` structured data on app + or dataset pages. + +Inference: +`doesitarm` already has enough public data shape to become a strong +machine-readable corpus. The main gap is not "inventing the dataset." The gap is +formalizing and publishing the right layers of it. + +## What The Evidence Says + +### 1. Google AI search still wants normal SEO fundamentals, not special AI tricks + +Google's current AI-features guidance says there are no extra technical +requirements for AI Overviews or AI Mode beyond normal Search eligibility. +Google explicitly says you do not need new AI files or special schema just to +appear in AI features. + +What does matter: + +- crawl access +- internal links +- page experience +- important content in textual form +- structured data matching visible text +- unique, non-commodity content + +This is the strongest argument against building an "AI discoverability" strategy +around gimmicks alone. + +### 2. Large-scale thin template pages are a real risk + +Google's helpful-content and spam-policy guidance is directly relevant to +programmatic compatibility sites: + +- people-first content is favored +- pages made mainly to attract search visits are a warning sign +- scaled content abuse includes generating many low-value pages, including from + feeds or automated transformations + +That means a compatibility database can absolutely win in search, but only if +its pages add decision-making value. Thin pages that just restate a status field +are dangerous. + +### 3. Compatibility content should look more like tested reviews than like directory filler + +Google's reviews guidance is a good proxy for compatibility pages because users +often arrive with a purchase, migration, or workflow decision in mind. + +The guidance consistently rewards: + +- original research +- first-hand evidence +- quantitative measurements where relevant +- comparisons +- what changed across versions +- benefits and drawbacks + +For `doesitarm`, that maps cleanly onto: + +- status by environment +- last verified date +- evidence links +- scanner output or screenshots where appropriate +- "what changed" changelog notes +- comparison pages like native vs Rosetta vs virtualization vs cloud workaround + +### 4. Dataset markup is useful, but it should describe real dataset landing pages + +Google's dataset documentation recommends canonical landing pages plus dataset +metadata such as `sameAs`, `isBasedOn`, identifiers, license, and download +distribution metadata. + +That is a strong fit for curated exports such as: + +- a public daily or weekly compatibility snapshot +- a historical archive by date +- vendor- or category-specific exports +- a Windows-on-ARM or future-transition slice later on + +Important nuance: +Google's dataset docs are about Dataset Search discovery, not a substitute for +general web SEO. Dataset markup helps when you actually publish datasets. + +### 5. `SoftwareApplication` markup fits the entity model, but Google rich-result requirements are narrower + +Schema.org's `SoftwareApplication` type supports fields that are very relevant +here, including: + +- `applicationCategory` +- `downloadUrl` +- `featureList` +- `operatingSystem` +- `softwareRequirements` +- `softwareVersion` +- `supportingData` + +Google also has a software-app structured-data feature, but its rich-result +requirements are more commerce-shaped, including `offers.price` and +review/rating support. That means: + +- use `SoftwareApplication` semantics where they match the visible page truth +- do not invent store-like fields just to chase rich results +- use dataset markup for exports and software/entity markup for canonical app + pages + +### 6. AI discoverability is now bot-by-bot, not one global yes/no + +OpenAI and Anthropic both now distinguish between different AI access modes. + +OpenAI: + +- `OAI-SearchBot` is for search inclusion +- `GPTBot` is for training +- `ChatGPT-User` is for user-triggered actions + +Anthropic: + +- `ClaudeBot` is for training +- `Claude-SearchBot` is for search quality +- `Claude-User` is for user-triggered retrieval + +This is strategically important. You no longer need to choose only between: + +- fully public for every AI purpose +- fully blocked for every AI purpose + +You can allow discovery while disallowing training, or allow search while +tightly managing user-action access, depending on your goals. + +### 7. `llms.txt` is real, but it is still a secondary signal + +Cloudflare has implemented `llms.txt`, `llms-full.txt`, and per-page Markdown +exports, and Simon Willison has highlighted similar docs-map patterns as useful +for agent tooling. + +That said: + +- Google explicitly says no special AI text files are required for AI features +- OpenAI's discoverability guidance focuses on crawler access, `noindex`, and + citation/linking, not `llms.txt` +- HN and Lobsters discussions show real skepticism around AI crawler incentives + and how consistently emerging conventions are respected + +Best interpretation: + +- `llms.txt` is worth adding because it is cheap and increasingly recognized +- it should not be treated as the core lever +- the core lever is still strong public pages plus clean machine-readable + content + +### 8. AI-friendly plain-text and Markdown surfaces do have practical value + +Cloudflare's docs work here is the clearest practical example: + +- per-page Markdown versions +- an index file +- bulk text export +- semantic HTML +- `noindex` on low-value or confusing pages + +This is less about search ranking and more about: + +- making retrieval cheaper and more accurate for agents +- improving citation quality +- reducing token waste +- giving your own future agents and partners a stable ingest format + +For a compatibility corpus, that suggests public Markdown or JSON exports are +worth doing for the canonical facts layer. + +### 9. Freshness and URL discovery matter more as the corpus grows + +Google recommends sitemaps and Search Console monitoring. +IndexNow gives faster change pings for engines that support it, including Bing. + +For a frequently updated compatibility corpus, this argues for: + +- canonical landing pages +- clean sitemap generation +- changelog feeds or update streams +- optional IndexNow support for faster non-Google discovery + +### 10. The crawl environment is getting more adversarial + +Cloudflare Radar reported AI and search crawling growth of 18% from May 2024 to +May 2025 across its measured cohort, with `GPTBot` up 305%. +HN and Lobsters operator discussions show why this matters in practice: + +- some AI crawlers create real infrastructure cost +- incentives are less aligned than classic web search +- operators increasingly need bot-specific controls, rate limiting, and + selective exposure + +This is the best evidence for keeping raw and high-cost surfaces private even if +you lean more open on the public facts layer. + +## Ways This Data Can Create Value + +### Human audiences + +- End users deciding whether they can keep using a favorite app on new hardware. +- IT, procurement, and upgrade planners deciding when a transition is safe. +- Developers and vendors tracking native support gaps and competitive pressure. +- Journalists and analysts covering platform transitions. +- Researchers and historians studying how ecosystems adapt to hardware changes. + +### Machine audiences + +- Search engines indexing canonical app, category, and comparison pages. +- LLM search products citing your pages as evidence. +- RAG systems consuming public snapshots or APIs. +- Agents answering migration, procurement, or troubleshooting questions. +- Internal `doesitarm` automation using the same canonical public layer as a + stable reference surface. + +### Business-model value + +- Audience growth from high-intent compatibility queries. +- Affiliate or sponsored monetization on truly decision-support pages. +- Paid APIs, bulk exports, or enterprise dashboards. +- Vendor intelligence and alerting. +- Historical transition data as a differentiated research asset. + +Inference: +The public facts are likely to commoditize over time. The durable value is the +combination of breadth, freshness, provenance, history, and tooling layered on +top of those facts. + +## What Should Likely Stay Public + +Public-by-default fields: + +- stable app identifier and canonical URL +- app name, aliases, vendor, category, platform family +- compatibility status by environment +- environment dimensions such as CPU architecture, OS family/version, native + vs translation vs virtualization +- bundle IDs and installer/package metadata where safe and user-useful +- last verified date, first seen date, last changed date +- public evidence summary and source links +- changelog summary for status changes +- category and comparison pages built from real user tasks +- curated JSON, CSV, or Parquet snapshot exports +- public structured data and sitemaps + +Public page types that seem high-value: + +- canonical app pages +- category pages +- "best alternatives if not native yet" pages +- transition pages such as "best native DAWs on Apple Silicon" +- comparison pages by use case, hardware generation, and workaround path +- dataset landing pages for bulk exports + +## What Should Likely Stay Private + +Private-by-default fields: + +- raw crawled HTML and downloaded ZIP/DMG/PKG artifacts +- extracted binaries and quarantine samples +- low-confidence matches and candidate entities +- dedupe, normalization, and scoring heuristics +- reviewer notes, moderation notes, and dispute state +- crawler logs, IP intelligence, WAF rules, and abuse signatures +- affiliate economics, contact records, outreach state, and deal terms +- internal confidence models, embeddings, and experimental feature engineering +- unpublished source mappings and scrape recipes that are costly to build + +Why keep these private: + +- operational risk +- legal and hosting risk +- abuse resistance +- clearer moat +- lower copyability + +## Different Ways To Think About The Database + +### 1. Directory / programmatic SEO system + +Upside: +- fastest traffic growth if executed well + +Downside: +- easiest to drift into thin pages and scaled-content abuse +- weakest long-term moat + +Use this frame only if every template answers a real question better than a +generic directory would. + +### 2. Public knowledge graph with evidence + +Upside: +- strongest fit for search, citations, and trust +- best long-term reuse across Apple, Windows, and future transitions + +Downside: +- requires stronger data modeling and provenance discipline + +This is the best framing for `doesitarm`. + +### 3. Public publication layer over a private intelligence system + +Upside: +- best balance of discoverability and defensibility +- easiest path to enterprise/API products later + +Downside: +- more operational complexity + +This is the recommended operating model. + +### 4. Mostly closed database with selective public summaries + +Upside: +- strongest direct control over assets + +Downside: +- weakest SEO and AI discoverability +- hardest to build brand authority from the data itself + +This makes sense only if monetization depends more on closed workflows than on +being the public authority. + +## Open Vs Closed Strategy Options + +## Option 1. Open facts, private operations + +Publish: + +- canonical pages +- evidence summaries +- limited exports +- structured data + +Keep private: + +- raw ingestion +- candidate pipeline +- scoring and ops + +Tradeoff: +Best overall balance of discoverability, trust, and defensibility. + +## Option 2. Open pages, paid API / paid bulk data + +Publish: + +- strong pages for discovery and citations +- free lightweight API or delayed snapshots + +Charge for: + +- real-time API +- higher limits +- historical depth +- enterprise filters and alerts + +Tradeoff: +Strong monetization path, but requires clearer product packaging. + +## Option 3. Fully open data commons + +Publish: + +- everything except unsafe raw binaries/secrets + +Tradeoff: +Maximum goodwill, citation, and reuse. +Minimum moat unless monetization shifts to services, sponsorship, or community +leadership. + +## Option 4. Selective access / crawler monetization layer + +Publish: + +- normal web pages + +Control: + +- which bots crawl +- whether training is allowed +- whether some crawlers must pay + +Tradeoff: +Promising middle path, especially as crawler monetization standards mature, but +still early and not something to build the whole strategy around yet. + +## Recommendation + +For `doesitarm`, use Option 1 now, with a path to Option 2 later. + +Concretely: + +1. Treat the database as transition-agnostic. + Use dimensions like `platform_family`, `cpu_arch`, `translation_layer`, + `virtualization_layer`, `os_version`, `artifact_type`, and + `verification_method` so the same model can cover Apple Silicon, Windows on + ARM, or the next Apple transition. + +2. Build a public canonical facts layer. + Each app should have a canonical page with: + status, environments, timestamps, evidence links, and short synthesis. + +3. Build a public dataset layer. + Publish periodic snapshots with dataset landing pages, license, provenance, + versioning, and download metadata. + +4. Keep ingestion and raw evidence private. + Store raw downloads, scrape traces, matching logic, and low-confidence + candidates outside the public repo and public site. + +5. Add public machine-readable surfaces in this order: + - `SoftwareApplication`-style entity markup where it truthfully matches page content + - dataset landing pages plus `Dataset` / `DataDownload` metadata for exports + - stable JSON or CSV snapshots + - `llms.txt` and Markdown exports as secondary aids + +6. Make public pages citation-friendly. + Add clear authorship, methodology, "how we know", last verified date, and + source links. + +7. Avoid index bloat. + Keep canonical entity and high-intent comparison pages indexable. + Use `noindex` or crawl controls for low-value filter permutations and stale + or confusing pages. + +8. Measure before deciding how open to be. + Track: + - Search Console web traffic + - ChatGPT referral traffic via `utm_source=chatgpt.com` + - bot traffic by user agent + - crawl cost versus referral value + +Inference: +The best long-term moat is not withholding all facts. It is being the most +trusted and most reusable source for those facts, while keeping the expensive +and differentiating machinery private. + +## Near-Term Next Steps For doesitarm + +1. Add a public data-contract document describing the canonical app entity, + environment entity, evidence entity, and snapshot dataset. +2. Expand app pages from "status page" to "evidence page": + include methodology, last verified date, change history, and source + attribution. +3. Add structured data intentionally: + entity markup for app pages, dataset markup for exports, not generic markup + everywhere. +4. Add a public snapshot export and a dataset landing page. +5. Add a bot-policy matrix to `robots.txt` planning: + Google search, OpenAI search, Anthropic search, training bots, and user bots. +6. Add `llms.txt` only after the public canonical and export layers are clean. +7. Keep filters/search-result pages from becoming the primary indexable surface. + +## Source Links + +- Repo context: + - [README.md](/Users/athena/Code/doesitarm/README.md) + - [docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md) + - [static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt) + - [helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js) + - [helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js) + - [helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js) + - [static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json) + +- Google AI features and AI search: + - https://developers.google.com/search/docs/appearance/ai-features + - https://developers.google.com/search/blog/2025/05/succeeding-in-ai-search + - https://developers.google.com/search/docs/fundamentals/creating-helpful-content + - https://developers.google.com/search/docs/essentials/spam-policies + - https://developers.google.com/search/docs/fundamentals/using-gen-ai-content + +- Google review and structured-data guidance: + - https://developers.google.com/search/docs/appearance/reviews-system + - https://developers.google.com/search/docs/specialty/ecommerce/write-high-quality-reviews + - https://developers.google.com/search/docs/appearance/structured-data/software-app + - https://developers.google.com/search/docs/appearance/structured-data/dataset + - https://developers.google.com/search/docs/appearance/structured-data/sd-policies + - https://developers.google.com/search/docs/crawling-indexing/crawling-managing-faceted-navigation + - https://developers.google.com/search/docs/crawling-indexing/block-indexing + +- Schema and dataset modeling: + - https://schema.org/SoftwareApplication + +- OpenAI: + - https://help.openai.com/en/articles/12627856-publishers-and-developers-faq + - https://help.openai.com/en/articles/9237897-chatgpt-search + - https://platform.openai.com/docs/gptbot + +- Anthropic: + - https://support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler + - https://docs.anthropic.com/en/docs/build-with-claude/search-results + +- Cloudflare: + - https://developers.cloudflare.com/style-guide/how-we-docs/ai-consumability/ + - https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/ + - https://blog.cloudflare.com/introducing-pay-per-crawl/ + +- Discovery and freshness: + - https://www.indexnow.org/index + +- Practitioner and discussion context: + - https://simonwillison.net/2025/Oct/24/claude-code-docs-map/ + - https://news.ycombinator.com/item?id=41072549 + - https://lobste.rs/s/dmuad3/mitigating_sourcehut_s_partial_outage + +## Source Quality Notes + +- Google Search Central, OpenAI, Anthropic, Schema.org, IndexNow, and Cloudflare + were the primary sources for current guidance. +- The HN and Lobsters links were useful for operator sentiment and failure modes, + not as primary authority for ranking behavior. +- `llms.txt` appears real and increasingly implemented, but the strongest + current evidence still says it is supplemental rather than foundational. diff --git a/docs/research/private-public-repo-sync-patterns-2026-04-04.md b/docs/research/private-public-repo-sync-patterns-2026-04-04.md new file mode 100644 index 0000000..da63dee --- /dev/null +++ b/docs/research/private-public-repo-sync-patterns-2026-04-04.md @@ -0,0 +1,222 @@ +# Private/Public Repo Sync Patterns For doesitarm + +Tease: Git can push to a second remote, but it does not natively maintain a safe long-lived branch that "merges everything except `docs/`." + +Lede: For `doesitarm` on 2026-04-04, the best-fit pattern is a second repo/remote plus an automated one-way export from `origin/master` that rewrites out `docs/` and force-pushes the result. + +Why it matters: +- `docs/` is already present on `origin/master`, so excluding it from a future mirror does not make it private retroactively. +- `docs/` does not appear to participate in the app build, which makes export-time exclusion low-risk. +- Cross-repo automation on GitHub needs separate auth; the default workflow token is not enough for another private repo. + +Go deeper: +- Simple one-way sync: fresh clone -> `git filter-repo --path docs/ --invert-paths` -> force-push to a second remote. +- Faster repeated projection: evaluate `splitsh-lite` if export speed or repeatability becomes important. +- Advanced bidirectional filtered views: `josh` is the serious option, but it is heavier than this repo likely needs. + +Date: 2026-04-04 + +## Scope + +Research whether `doesitarm` should support a `private-main`-style branch or +remote that automatically tracks the public default branch while excluding +paths such as `docs/`, and identify better patterns if they exist. + +## Short Answer + +Yes, you can push to a different remote from the public repo. + +No, the durable pattern is not a long-lived `private-main` merge branch that +keeps deleting `docs/` after every merge. Git branches and merges are full-tree +operations, and sparse checkout does not change that. + +For this repo, the cleanest pattern is: + +1. Keep `origin/master` as the source branch. +2. Add a second remote or second repository for the private target. +3. Run a one-way export job on each push to `master`. +4. In that job, create a sanitized tree/history with `docs/` removed. +5. Force-push the sanitized result to the private remote branch. + +If the real requirement is that future docs stay private, the better topology +is the reverse: keep the canonical repo private and generate the public export. + +Inference: +That last recommendation is based on the current repo state: `docs/` is already +committed on `origin/master`, so a private mirror without `docs/` only changes +future distribution, not past public exposure. + +## What The Repo Already Knows + +- The default remote today is only `origin`, pointed at the public GitHub repo. +- The default tracked branch is `origin/master`, not `main`. +- There is no checked-in `.github/` workflow that already handles cross-repo + sync. +- `docs/` currently contains repo-local planning and research material: + [docs/app-flow.md](/Users/athena/Code/doesitarm/docs/app-flow.md), + [docs/plans/app-test-typescript-refactor.md](/Users/athena/Code/doesitarm/docs/plans/app-test-typescript-refactor.md), + and dated research memos under + [docs/research](/Users/athena/Code/doesitarm/docs/research). +- `docs/` is already present in `origin/master` history as of 2026-04-04. +- The build surface in + [package.json](/Users/athena/Code/doesitarm/package.json) + does not appear to depend on `docs/`. + +## What The Evidence Says + +- Different remote from the public repo: + yes. Git supports separate remotes, and the `git remote` docs explicitly say + that when fetch and push use different locations, you should use two separate + remotes rather than pretending they are the same remote. +- Long-lived branch with path exclusions: + not a native Git capability. Git merges operate on full trees, not + "everything except these directories." +- Sparse checkout: + not the answer here. The `git sparse-checkout` docs describe it as a working + directory reduction feature, and they note that operations such as merge or + rebase may still materialize paths outside the sparse specification. +- `git filter-repo`: + good fit for one-way export. The Git project now recommends it instead of + `filter-branch`, and its docs support `--invert-paths` for "keep everything + except these paths" rewrites. That matches "mirror the repo except `docs/`." +- `splitsh-lite`: + promising when you want repeatable projections into standalone repos and care + about performance. Its README supports split prefixes that can include + exclusions and uses a history cache, which is more appropriate than a manual + merge branch when this becomes a repeated sync lane. +- `josh`: + the advanced option. Its repo describes a proxy Git server that exposes + filtered histories as standalone repos and synchronizes between original and + filtered views. This is the closest thing to a "real" selective mirror + system, but it adds operational weight. +- GitHub Actions auth: + the default `github.token` is scoped to the current repository. If a workflow + in the public repo needs to push to a different private repo, you need a PAT, + deploy key, or GitHub App token instead. + +## What Works + +- A second remote or second repository for the private target. +- A one-way generated branch or repo, not a hand-maintained merge branch. +- Rebuilding the private export from `origin/master` every run. +- Treating the private mirror as generated output with force-pushes allowed. +- Keeping development on one source branch and one source-of-truth repo. + +## What To Avoid + +- Do not maintain `private-main` by repeatedly merging `master` and deleting + `docs/`. That creates unnecessary churn and eventual conflict debt. +- Do not use sparse checkout as if it were a publishing filter. +- Do not make the generated private mirror a peer source of truth unless you + also adopt a projection system designed for bidirectional sync. +- Do not rely on the default GitHub Actions token for pushes to another repo. +- Do not assume this setup hides `docs/` historically; those files are already + in the public remote history. + +## Best Patterns + +## 1. Best Fit For This Repo: one-way export to a second remote + +Use a second remote, for example `private`, pointing at a separate private +GitHub repository. On each push to `origin/master`, run an automation that: + +1. checks out `master` +2. authenticates to the private repo with a PAT, deploy key, or GitHub App +3. creates a fresh export clone or export worktree +4. rewrites out `docs/` and any other excluded paths +5. force-pushes the sanitized result to the private repo branch + +Why this is the best fit: + +- it matches the repo's current single-source workflow +- it does not depend on path-aware merges that Git does not have +- it keeps excluded-path logic in one place +- it is easy to reason about and recover from + +Tradeoffs: + +- exported commit SHAs will differ from public `master` +- the private mirror should be treated as generated/read-only + +## 2. Better If Repeated Projection Becomes Core: `splitsh-lite` + +If you end up publishing multiple filtered mirrors or need fast repeated +updates, `splitsh-lite` is worth a spike. It is built for turning repository +views into standalone histories and caching the work. + +Tradeoffs: + +- more specialized operational knowledge +- less obvious to future maintainers than a simple export script + +## 3. Better Only For Advanced Bidirectional Partial Views: `josh` + +If the real requirement becomes "developers commit through filtered views and +changes synchronize both directions," `josh` is the pattern to study. + +Tradeoffs: + +- significant infrastructure/runtime overhead +- far more complexity than `doesitarm` appears to need today + +## 4. Adjacent But Not The Same Problem: GitHub Private Mirrors + +GitHub's `private-mirrors` app is relevant if the goal is to collaborate +privately around a public repository and upstream later. It is not the right +answer for "same repo minus `docs/`," but it is worth noting as a neighboring +pattern. + +## Recommendation + +For `doesitarm`, use a separate private repository plus a generated sync job. +Name the target branch after the actual default branch in this repo, for +example `private-master` or simply `master` on the private remote. + +Do not implement this as a merge branch. + +If the aim is just "same code, different remote, minus `docs/`," a generated +one-way mirror is the right level of machinery. + +If the aim is "keep future internal docs private," move the source of truth to +a private repo and generate the public mirror from that private origin. + +## Missing Information + +- Whether the private target is intended to be read-only/generated or whether + anyone will commit directly to it. +- Whether `docs/` is the only excluded path or just the first example. +- Whether the real goal is secrecy, deployment hygiene, or private-only + collaboration before publishing. + +## Source Links + +- Git remote docs: + https://git-scm.com/docs/git-remote +- Git sparse-checkout docs: + https://git-scm.com/docs/git-sparse-checkout +- `git-filter-repo` repository: + https://github.com/newren/git-filter-repo +- `git-filter-repo` manual: + https://www.mankier.com/1/git-filter-repo +- `splitsh-lite` repository: + https://github.com/splitsh/lite +- `josh` repository: + https://github.com/josh-project/josh +- `actions/checkout` README: + https://github.com/actions/checkout +- GitHub App auth in GitHub Actions: + https://docs.github.com/en/enterprise-cloud@latest/apps/creating-github-apps/authenticating-with-a-github-app/making-authenticated-api-requests-with-a-github-app-in-a-github-actions-workflow +- GitHub deploy keys: + https://docs.github.com/v3/guides/managing-deploy-keys +- GitHub Private Mirrors app: + https://github.com/github-community-projects/private-mirrors +- Stack Overflow, partial sharing of Git repositories: + https://stackoverflow.com/questions/278270/partial-sharing-of-git-repositories + +## Source Quality Notes + +- HN and Lobsters searches on 2026-04-04 did not surface a clearly better + mainstream pattern than the Git/GitHub docs plus the specialized projection + tools above. +- Primary docs and project READMEs were materially more useful than forum + commentary for this question. diff --git a/docs/research/public-repo-security-and-monorepo-patterns-2026-04-04.md b/docs/research/public-repo-security-and-monorepo-patterns-2026-04-04.md new file mode 100644 index 0000000..8bf5bf3 --- /dev/null +++ b/docs/research/public-repo-security-and-monorepo-patterns-2026-04-04.md @@ -0,0 +1,396 @@ +# Public Repo Security And Monorepo Patterns For doesitarm + +Tease: The safest version of this plan keeps `doesitarm` public, but treats credentials, imports, downloaded app artifacts, and privileged automation as private operational surfaces. + +Lede: For `doesitarm` on 2026-04-04, the best-fit pattern is a Kriasoft-style public monorepo with clear `apps/`, `packages/`, `db/`, and `infra/` boundaries, plus hardened GitHub Actions, GitHub-hosted runners for public workflows, D1 local development via Wrangler, and private storage for secrets, backups, and quarantined artifacts. + +Why it matters: +- The current repo is about to add higher-risk surfaces: D1, automated app discovery, archive downloading, scheduled jobs, and more Cloudflare automation. +- In a public repo, CI/CD mistakes matter as much as application code mistakes. Workflow files, tokens, logs, and runner choices become part of the threat model. +- The current repo already has one immediate security problem: a workflow prints secret-derived files to CI logs. + +Go deeper: +- Keep the code public; keep secrets, raw data, and operational state private. +- Refactor toward a monorepo shape early so new ingestion, scanner, D1, and infra code do not spread across a flat root. +- Adopt OSS-friendly GitHub hardening: read-only default `GITHUB_TOKEN`, pinned actions, CODEOWNERS on workflow/infra/db paths, secret scanning, private vulnerability reporting, and no self-hosted runners for public PRs. + +Date: 2026-04-04 + +## Scope + +Research security considerations and common open-source repository patterns for a +setup like `doesitarm`: + +- public GitHub repository +- Cloudflare Workers and D1 +- scheduled automation +- automated downloading and scanning of third-party app archives +- prospective monorepo refactor in the style of + `kriasoft/react-starter-kit` + +This memo is intended to drive updates to +[app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md). + +## Short Answer + +Do not move the whole repo private. + +Instead: + +1. Keep the application and infrastructure code public. +2. Move secrets, imported raw data, D1 operational state, downloaded artifacts, + quarantined samples, and any sensitive fixtures to private systems. +3. Refactor into a monorepo early, using a Kriasoft-style structure adapted to + this repo's existing pnpm/Netlify/Astro/Workers setup. +4. Harden GitHub Actions before expanding automation. + +Best-fit recommendation: + +- Public monorepo with `apps/`, `packages/`, `db/`, `infra/`, `scripts/`, + and `docs/` +- GitHub-hosted runners for public workflows +- GitHub environment secrets with required reviewers for production deploys +- Cloudflare D1 local development and tests via Wrangler `--local`, + `preview_database_id`, and test harnesses like `unstable_dev()`/Miniflare +- Private object storage or equivalent for raw app archives, import dumps, + and quarantine material + +Inference: +This is the right fit because the repo is open source and community-facing, but +the risky parts are operational, not architectural. Public code is compatible +with good security here; public credentials and public operational data are not. + +## What The Repo Already Knows + +- The repo is currently flat-rooted, not organized as a workspace monorepo. +- There is no checked-in D1 configuration or local D1 bootstrap yet. +- There is Cloudflare deployment automation in + [deploy-cloudflare-workers.yml](/Users/athena/Code/doesitarm/.github/workflows/deploy-cloudflare-workers.yml). +- That workflow currently decodes secret-backed `.env` / `wrangler.toml` files + and prints them with `cat`, which is a real security issue in CI logs. +- The site build still depends on remote/env-backed feeds such as + `SCANS_SOURCE`, `COMMITS_SOURCE`, `HOMEBREW_SOURCE`, `GAMES_SOURCE`, and + `VFUNCTIONS_URL`. +- The scanner and planned discovery pipeline will process untrusted third-party + files, including archive formats like ZIP, DMG, and PKG. +- `.env` is ignored at the root, and per-worker `wrangler.toml` files are + already ignored in worker subdirectories. + +## What The Evidence Says + +### 1. Public repos can stay public if the operational boundary is private + +GitHub's own docs assume public repositories will use: + +- repository or environment secrets +- restricted organization secret access +- private vulnerability reporting +- automatic secret scanning on public repos + +That is strong evidence that the normal pattern is not "make the repo private"; +it is "keep sensitive operational material out of the repo and out of logs." + +### 2. Default GitHub Actions posture should be least privilege + +GitHub recommends: + +- minimum required `GITHUB_TOKEN` permissions +- default repository token permission set to read-only +- escalating permissions only per job +- using a GitHub App token if a job needs more than `GITHUB_TOKEN` can provide + +This matches what open-source repos increasingly do for deploy, release, and +cross-repo automation. + +### 3. Secrets are still easy to leak through logs and workflow behavior + +GitHub's secure-use docs explicitly warn that: + +- redaction is not guaranteed for transformed values +- structured blobs like JSON/YAML are poor secret formats +- non-secret values should be masked explicitly with `::add-mask::` +- exposed secrets in logs should trigger deletion/rotation + +For `doesitarm`, this directly applies to the current workflow that prints +secret-derived config files into CI output. + +### 4. Public repos should avoid self-hosted runners for untrusted PRs + +GitHub explicitly recommends self-hosted runners only with private +repositories, because forks of public repositories can run dangerous code on +them through pull requests. + +For this repo, that means: + +- do not put public PR workflows on a local machine or other long-lived + self-hosted runner +- do not run untrusted archive-processing jobs on a self-hosted runner that + also holds production credentials + +### 5. `pull_request_target` remains a common footgun + +GitHub Security Lab's `Preventing pwn requests` guidance is still the clearest +implementation reference: + +- `pull_request_target` plus checking out/building PR code is dangerous +- untrusted PR code should run in an unprivileged `pull_request` workflow +- privileged follow-up actions should happen through `workflow_run` with + carefully handled artifacts + +HN discussion around real workflow exploits reinforces the same point: the +problem is not theoretical. + +### 6. Common OSS hardening patterns for GitHub workflows are now well-defined + +GitHub secure-use guidance and OpenSSF best-practice guidance converge on: + +- pin actions to full commit SHAs +- restrict allowed actions where possible +- guard `.github/workflows/` with `CODEOWNERS` +- keep default branch protected +- require reviews and passing checks +- use code scanning / dependency review / secret scanning / Dependabot +- use private vulnerability reporting for public repos + +These are standard public-repo practices, not enterprise-only overkill. + +### 7. Cloudflare D1 already supports local-first development and tests + +Cloudflare's D1 docs explicitly support: + +- `wrangler dev` local mode +- `preview_database_id` +- `wrangler d1 migrations apply --local` +- test setups using Miniflare and `unstable_dev()` + +That means D1 does not require a private repo or remote-only workflow. It fits +the "run locally on this machine, then automate" plan well. + +### 8. Cloudflare Workflows and observability make Cloudflare a credible later home for ingestion + +Cloudflare Workflows now position themselves as durable multi-step execution +with retries, persisted state, and debugging. Workers Logs and Traces provide +native observability. That is enough evidence to treat Cloudflare as a viable +later landing zone for scheduled ingestion and scan orchestration. + +Inference: +GitHub Actions is still the easier first scheduler because it is already in the +repo, but Cloudflare Workflows has matured enough to stay in the plan as a +serious later option. + +### 9. Kriasoft's monorepo shape is a good architectural fit, but not every exact convention should be copied blindly + +`kriasoft/react-starter-kit` is a public monorepo with: + +- `apps/` +- `packages/` +- `db/` +- `docs/` +- `infra/` +- `scripts/` + +It also documents a public template env pattern where committed `.env` +contains placeholders/defaults and `.env.local` contains real credentials. + +That shape is a strong fit for `doesitarm`, but I would adapt the env pattern +slightly for safety and clarity: + +- keep a committed public template file such as `.env.example` +- keep real credentials in `.env.local`, `.dev.vars`, GitHub environment + secrets, and Cloudflare secrets + +Inference: +Kriasoft's folder layout is the part worth copying directly. The exact env-file +naming should follow the least-confusing safe convention for this repo. + +## Common Open-Source Patterns That Fit doesitarm + +### Public code, private state + +Keep public: + +- app code +- scanner code +- D1 schema and migrations +- workflow definitions +- docs and plans + +Keep private: + +- deploy credentials and tokens +- raw Google Sheets exports or database backups +- downloaded app archives +- quarantine samples +- private test fixtures that would create redistribution or abuse risk +- operational dashboards and alert destinations + +### Workspace monorepo with clear trust boundaries + +Best-fit structure for `doesitarm`: + +- `apps/web/` — Astro site and app-test UI +- `apps/default-worker/` — current `doesitarm-default` +- `apps/analytics-worker/` — current `workers/analytics` +- `apps/ingest/` or `apps/discovery/` — CLI/admin surface for discovery jobs +- `packages/scanner-core/` — shared scan engine and file-format logic +- `packages/source-runners/` — Homebrew/GitHub/download-page source runners +- `packages/data-model/` — shared D1 schema types, DTOs, validation +- `packages/site-build/` — list/build/export helpers +- `db/` — D1 migrations, seeds, import scripts, local test DB helpers +- `infra/` — Wrangler config, deploy config, policy docs +- `scripts/` — repo automation +- `docs/` — plans, research, operational docs + +### Repo template files, not repo secrets + +Common OSS pattern: + +- commit `.env.example` or placeholder-only `.env` +- ignore `.env.local`, `.dev.vars`, and `.wrangler/` +- keep Cloudflare secrets in Workers secrets / GitHub environment secrets + +### Hardened GitHub Actions for public forks + +Common OSS pattern: + +- default `permissions: { contents: read }` +- explicit per-job escalation only +- require approval for fork PR workflows where appropriate +- no self-hosted runners for public PRs +- no `pull_request_target` workflows that checkout/build PR code + +### Supply-chain hygiene for workflows + +Common OSS pattern: + +- pin actions to full SHAs +- restrict allowed actions +- Dependabot for action updates +- CodeQL / code scanning for workflow vulnerabilities +- OpenSSF Scorecards for ongoing hygiene checks + +### Disclosure and scanning defaults + +Common OSS pattern: + +- enable private vulnerability reporting +- enable secret scanning and push protection +- keep a `SECURITY.md` policy + +## What Works + +- Keeping the repo public while moving secrets and sensitive data out of git +- Refactoring to a monorepo before adding more D1/discovery complexity +- Treating workflow files, `infra/`, and `db/` as protected surfaces with + `CODEOWNERS` +- Using GitHub-hosted runners for public CI and scheduled jobs +- Using environment-specific secrets with required reviewers for production + deployment jobs +- Using D1 local mode and local migrations as part of normal development +- Using Cloudflare Logs/Traces or equivalent observability for scheduled jobs +- Storing raw archives and quarantine material in private object storage rather + than in the repo + +## What To Avoid + +- Do not move the whole repo private as a substitute for secrets hygiene +- Do not keep the current workflow behavior that prints secret-derived files to + CI logs +- Do not use self-hosted runners for public PR workflows +- Do not run archive downloads/extraction in privileged workflows that also have + deploy credentials +- Do not combine `pull_request_target` with explicit PR checkout/build steps +- Do not keep adding discovery/D1/worker code into the current flat root +- Do not commit raw import dumps, app archives, or structured secret blobs + +## Recommendation + +For `doesitarm`, the strongest next-step package is: + +1. Refactor toward a Kriasoft-style monorepo shape adapted to pnpm. +2. Add a security-hardening stage before expanding automation. +3. Keep the repo public. +4. Keep secrets, raw operational data, and archive/quarantine material private. +5. Start scheduled discovery on GitHub-hosted runners with hardened workflows. +6. Keep Cloudflare Workflows as a second-phase target for durable ingestion. + +Immediate high-priority actions to capture in the plan: + +1. Remove secret printing from + [deploy-cloudflare-workers.yml](/Users/athena/Code/doesitarm/.github/workflows/deploy-cloudflare-workers.yml) + and rotate affected secrets. +2. Add repo policy and tooling for: + - read-only default `GITHUB_TOKEN` + - pinned actions + - `CODEOWNERS` for `.github/workflows/`, `infra/`, and `db/` + - secret scanning / push protection + - private vulnerability reporting +3. Add ignored local-secret files for the new D1/Workers workflow: + - `.env.local` + - `.dev.vars` + - `.wrangler/` +4. Keep public PR CI on GitHub-hosted runners only. +5. Store raw archives/import snapshots outside the repo. + +## Missing Information + +- Whether the future ingestion runtime is expected to stay GitHub-first or + eventually move fully to Cloudflare Workers/Workflows. +- Whether there are legal or vendor-policy constraints around storing downloaded + app archives long term. +- Whether the monorepo refactor should keep Netlify as-is or consolidate more + runtime surfaces onto Cloudflare. + +## Source Links + +- GitHub Docs, `GITHUB_TOKEN` least-privilege and GitHub App escalation: + https://docs.github.com/en/actions/tutorials/authenticate-with-github_token +- GitHub Docs, secrets in Actions, fork-secret behavior, environment reviewers, + OIDC, and masking: + https://docs.github.com/en/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions +- GitHub Docs, secure use reference, pinning actions, CODEOWNERS, code scanning, + Dependabot, and Scorecards: + https://docs.github.com/en/actions/reference/security/secure-use +- GitHub Docs, self-hosted runner warning for public repositories: + https://docs.github.com/en/actions/how-tos/manage-runners/self-hosted-runners/add-runners +- GitHub Docs, limiting self-hosted runners in organizations: + https://docs.github.com/en/organizations/managing-organization-settings/disabling-or-limiting-github-actions-for-your-organization +- GitHub Docs, approval requirements for fork PR workflows: + https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/approving-workflow-runs-from-public-forks +- GitHub Docs, repository Actions settings and fork workflow controls: + https://docs.github.com/github/administering-a-repository/managing-repository-settings/disabling-or-limiting-github-actions-for-a-repository +- GitHub Docs, secret scanning for public repositories: + https://docs.github.com/github/administering-a-repository/about-token-scanning +- GitHub Docs, enabling secret scanning / push protection: + https://docs.github.com/en/code-security/how-tos/secure-your-secrets/detect-secret-leaks/enabling-secret-scanning-for-your-repository +- GitHub Docs, enabling push protection: + https://docs.github.com/en/code-security/secret-scanning/enabling-secret-scanning-features/enabling-push-protection-for-your-repository +- GitHub Docs, private vulnerability reporting: + https://docs.github.com/en/code-security/security-advisories/working-with-repository-security-advisories/configuring-private-vulnerability-reporting-for-a-repository +- GitHub Security Lab, `pull_request_target` / `workflow_run` guidance: + https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/ +- OpenSSF GitHub configuration best practices: + https://best.openssf.org/SCM-BestPractices/github/ +- Kriasoft React Starter Kit: + https://github.com/kriasoft/react-starter-kit +- Cloudflare D1 local development: + https://developers.cloudflare.com/d1/best-practices/local-development/ +- Cloudflare Workers observability: + https://developers.cloudflare.com/workers/observability/ +- Cloudflare Workers logs: + https://developers.cloudflare.com/workers/observability/logs/ +- Cloudflare Workers traces: + https://developers.cloudflare.com/workers/observability/traces/ +- Cloudflare Workflows overview: + https://developers.cloudflare.com/workflows/ + +## Source Quality Notes + +- Highest-confidence sources in this memo are GitHub Docs, GitHub Security Lab, + OpenSSF, Cloudflare Docs, and the Kriasoft repository itself. +- HN/Lobsters did not surface a materially better competing pattern in this + pass; the most useful HN signal reinforced GitHub Security Lab's warning on + `pull_request_target`. +- The recommendation to keep the repo public but move operational data private + is a synthesis from official guidance plus this repo's current shape and risk + surface.