docs(plan): add discovery and deploy follow-up research

Capture the next discovery, security, compatibility-data, and dual-deploy planning work, and ignore local Vercel/env state that should not be committed. This keeps the operational research with the repo while avoiding accidental local-config churn. Constraint: Must not alter production runtime behavior Rejected: Fold research notes into the runtime fix commit | obscures the user-facing app-test correction with planning-only material Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep .omx local state untracked even when committing broad workspace updates Tested: Document review only Not-tested: No runtime verification required for docs and ignore rules
2026-05-18 06:44:46 -07:00 · 2026-04-04 15:38:39 -05:00 · 2026-04-04 15:38:39 -05:00 · 1248c705b0
commit 1248c705b0
parent e667ab564e
6 changed files with 1634 additions and 0 deletions
--- a/docs/research/desktop-app-compatibility-data-strategy-2026-04-04.md
+++ b/docs/research/desktop-app-compatibility-data-strategy-2026-04-04.md
@ -0,0 +1,600 @@
+# Desktop App Compatibility Data Strategy For doesitarm
+
+Tease: In 2026, the winning play is not "AI SEO." It is publishing a high-trust, machine-readable compatibility corpus that is genuinely useful on the open web and selectively keeping the operational and proprietary layers private.
+
+Lede: For `doesitarm` on 2026-04-04, the best-fit strategy is to treat the site as an entity-and-evidence graph for desktop software compatibility: publish canonical app pages, provenance-rich evidence, structured exports, and selective machine-readable surfaces for discovery; keep raw crawls, binary artifacts, candidate matches, scoring logic, and operational intelligence private.
+
+Why it matters:
+- This project can outlive the Apple Silicon transition if the core model is "desktop software compatibility knowledge," not just "Apple Silicon list posts."
+- Google's 2025 AI search guidance still rewards the same fundamentals: unique content, crawlable pages, textual clarity, and trustworthy evidence, not special AI-only tricks.
+- OpenAI and Anthropic now expose separate search, user-action, and training bots, which means "open versus closed" is no longer binary. You can choose visibility, training access, and operational exposure separately.
+
+Go deeper:
+- Think of the public site as a citation layer and decision-support layer, not as the full warehouse.
+- Publish public facts, provenance, timestamps, and curated exports. Keep raw ingestion, low-confidence candidates, and monetizable workflow intelligence private.
+- Treat `llms.txt` and Markdown exports as helpful secondary surfaces, not as the core strategy. The core strategy is still clean HTML, canonical URLs, structured data, sitemaps, and useful pages.
+
+Date: 2026-04-04
+
+## Scope
+
+Research how to think about a long-lived desktop app compatibility database as a
+content, SEO, and AI-discoverability system in 2026, including:
+
+- best practices for public content architecture
+- how LLM-driven discovery changes the picture
+- what data should likely stay public versus private
+- what audiences this data can serve
+- tradeoffs between more-open and more-closed approaches
+
+## Short Answer
+
+Build `doesitarm` as a public knowledge product with a private operating system
+underneath it.
+
+Publicly, publish:
+
+- canonical app pages
+- compatibility status by platform/environment
+- evidence summaries and source links
+- timestamps, changelogs, and history
+- stable IDs, taxonomy, and machine-readable metadata
+- a limited public API or snapshot exports for high-value reuse
+
+Privately, keep:
+
+- raw crawls and downloaded binaries
+- candidate entities before review
+- normalization, dedupe, and confidence logic
+- crawler logs, abuse rules, and infrastructure controls
+- enrichment that creates monetizable leverage rather than user value on the open web
+
+The biggest strategic shift from 2018 to 2026 is this:
+
+1. Search still rewards useful original pages.
+2. AI discovery mostly rides on those same pages.
+3. Separate crawler controls now let you be open for search while staying more closed for training.
+4. The moat is less "having any compatibility data at all" and more:
+   verification quality, provenance, freshness, historical depth, and workflow speed.
+
+Inference:
+No single source states that exact four-part conclusion. It is the synthesis that
+best fits the repo state plus current Google, OpenAI, Anthropic, Cloudflare,
+HN, and Lobsters evidence.
+
+## What The Repo Already Knows
+
+- The project already acts like a compatibility corpus, not just a blog:
+  [README.md](/Users/athena/Code/doesitarm/README.md) is a manually curated,
+  source-linked compatibility list.
+- The repo already has a plan to move toward a canonical database and discovery
+  pipeline:
+  [docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md).
+- The public site already exposes crawlable pages, a sitemap, and permissive
+  crawling:
+  [static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt),
+  [static/sitemap-index.xml](/Users/athena/Code/doesitarm/static/sitemap-index.xml).
+- The current public JSON already exposes useful app-level fields such as name,
+  aliases, status, bundle IDs, related links, scan details, and device support:
+  [static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json).
+- The current structured data implementation is narrow and video-centric:
+  [helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js),
+  [helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js),
+  [helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js).
+- I did not find a checked-in `llms.txt`, `llms-full.txt`, or per-page Markdown
+  export surface.
+- I also did not find `SoftwareApplication` or `Dataset` structured data on app
+  or dataset pages.
+
+Inference:
+`doesitarm` already has enough public data shape to become a strong
+machine-readable corpus. The main gap is not "inventing the dataset." The gap is
+formalizing and publishing the right layers of it.
+
+## What The Evidence Says
+
+### 1. Google AI search still wants normal SEO fundamentals, not special AI tricks
+
+Google's current AI-features guidance says there are no extra technical
+requirements for AI Overviews or AI Mode beyond normal Search eligibility.
+Google explicitly says you do not need new AI files or special schema just to
+appear in AI features.
+
+What does matter:
+
+- crawl access
+- internal links
+- page experience
+- important content in textual form
+- structured data matching visible text
+- unique, non-commodity content
+
+This is the strongest argument against building an "AI discoverability" strategy
+around gimmicks alone.
+
+### 2. Large-scale thin template pages are a real risk
+
+Google's helpful-content and spam-policy guidance is directly relevant to
+programmatic compatibility sites:
+
+- people-first content is favored
+- pages made mainly to attract search visits are a warning sign
+- scaled content abuse includes generating many low-value pages, including from
+  feeds or automated transformations
+
+That means a compatibility database can absolutely win in search, but only if
+its pages add decision-making value. Thin pages that just restate a status field
+are dangerous.
+
+### 3. Compatibility content should look more like tested reviews than like directory filler
+
+Google's reviews guidance is a good proxy for compatibility pages because users
+often arrive with a purchase, migration, or workflow decision in mind.
+
+The guidance consistently rewards:
+
+- original research
+- first-hand evidence
+- quantitative measurements where relevant
+- comparisons
+- what changed across versions
+- benefits and drawbacks
+
+For `doesitarm`, that maps cleanly onto:
+
+- status by environment
+- last verified date
+- evidence links
+- scanner output or screenshots where appropriate
+- "what changed" changelog notes
+- comparison pages like native vs Rosetta vs virtualization vs cloud workaround
+
+### 4. Dataset markup is useful, but it should describe real dataset landing pages
+
+Google's dataset documentation recommends canonical landing pages plus dataset
+metadata such as `sameAs`, `isBasedOn`, identifiers, license, and download
+distribution metadata.
+
+That is a strong fit for curated exports such as:
+
+- a public daily or weekly compatibility snapshot
+- a historical archive by date
+- vendor- or category-specific exports
+- a Windows-on-ARM or future-transition slice later on
+
+Important nuance:
+Google's dataset docs are about Dataset Search discovery, not a substitute for
+general web SEO. Dataset markup helps when you actually publish datasets.
+
+### 5. `SoftwareApplication` markup fits the entity model, but Google rich-result requirements are narrower
+
+Schema.org's `SoftwareApplication` type supports fields that are very relevant
+here, including:
+
+- `applicationCategory`
+- `downloadUrl`
+- `featureList`
+- `operatingSystem`
+- `softwareRequirements`
+- `softwareVersion`
+- `supportingData`
+
+Google also has a software-app structured-data feature, but its rich-result
+requirements are more commerce-shaped, including `offers.price` and
+review/rating support. That means:
+
+- use `SoftwareApplication` semantics where they match the visible page truth
+- do not invent store-like fields just to chase rich results
+- use dataset markup for exports and software/entity markup for canonical app
+  pages
+
+### 6. AI discoverability is now bot-by-bot, not one global yes/no
+
+OpenAI and Anthropic both now distinguish between different AI access modes.
+
+OpenAI:
+
+- `OAI-SearchBot` is for search inclusion
+- `GPTBot` is for training
+- `ChatGPT-User` is for user-triggered actions
+
+Anthropic:
+
+- `ClaudeBot` is for training
+- `Claude-SearchBot` is for search quality
+- `Claude-User` is for user-triggered retrieval
+
+This is strategically important. You no longer need to choose only between:
+
+- fully public for every AI purpose
+- fully blocked for every AI purpose
+
+You can allow discovery while disallowing training, or allow search while
+tightly managing user-action access, depending on your goals.
+
+### 7. `llms.txt` is real, but it is still a secondary signal
+
+Cloudflare has implemented `llms.txt`, `llms-full.txt`, and per-page Markdown
+exports, and Simon Willison has highlighted similar docs-map patterns as useful
+for agent tooling.
+
+That said:
+
+- Google explicitly says no special AI text files are required for AI features
+- OpenAI's discoverability guidance focuses on crawler access, `noindex`, and
+  citation/linking, not `llms.txt`
+- HN and Lobsters discussions show real skepticism around AI crawler incentives
+  and how consistently emerging conventions are respected
+
+Best interpretation:
+
+- `llms.txt` is worth adding because it is cheap and increasingly recognized
+- it should not be treated as the core lever
+- the core lever is still strong public pages plus clean machine-readable
+  content
+
+### 8. AI-friendly plain-text and Markdown surfaces do have practical value
+
+Cloudflare's docs work here is the clearest practical example:
+
+- per-page Markdown versions
+- an index file
+- bulk text export
+- semantic HTML
+- `noindex` on low-value or confusing pages
+
+This is less about search ranking and more about:
+
+- making retrieval cheaper and more accurate for agents
+- improving citation quality
+- reducing token waste
+- giving your own future agents and partners a stable ingest format
+
+For a compatibility corpus, that suggests public Markdown or JSON exports are
+worth doing for the canonical facts layer.
+
+### 9. Freshness and URL discovery matter more as the corpus grows
+
+Google recommends sitemaps and Search Console monitoring.
+IndexNow gives faster change pings for engines that support it, including Bing.
+
+For a frequently updated compatibility corpus, this argues for:
+
+- canonical landing pages
+- clean sitemap generation
+- changelog feeds or update streams
+- optional IndexNow support for faster non-Google discovery
+
+### 10. The crawl environment is getting more adversarial
+
+Cloudflare Radar reported AI and search crawling growth of 18% from May 2024 to
+May 2025 across its measured cohort, with `GPTBot` up 305%.
+HN and Lobsters operator discussions show why this matters in practice:
+
+- some AI crawlers create real infrastructure cost
+- incentives are less aligned than classic web search
+- operators increasingly need bot-specific controls, rate limiting, and
+  selective exposure
+
+This is the best evidence for keeping raw and high-cost surfaces private even if
+you lean more open on the public facts layer.
+
+## Ways This Data Can Create Value
+
+### Human audiences
+
+- End users deciding whether they can keep using a favorite app on new hardware.
+- IT, procurement, and upgrade planners deciding when a transition is safe.
+- Developers and vendors tracking native support gaps and competitive pressure.
+- Journalists and analysts covering platform transitions.
+- Researchers and historians studying how ecosystems adapt to hardware changes.
+
+### Machine audiences
+
+- Search engines indexing canonical app, category, and comparison pages.
+- LLM search products citing your pages as evidence.
+- RAG systems consuming public snapshots or APIs.
+- Agents answering migration, procurement, or troubleshooting questions.
+- Internal `doesitarm` automation using the same canonical public layer as a
+  stable reference surface.
+
+### Business-model value
+
+- Audience growth from high-intent compatibility queries.
+- Affiliate or sponsored monetization on truly decision-support pages.
+- Paid APIs, bulk exports, or enterprise dashboards.
+- Vendor intelligence and alerting.
+- Historical transition data as a differentiated research asset.
+
+Inference:
+The public facts are likely to commoditize over time. The durable value is the
+combination of breadth, freshness, provenance, history, and tooling layered on
+top of those facts.
+
+## What Should Likely Stay Public
+
+Public-by-default fields:
+
+- stable app identifier and canonical URL
+- app name, aliases, vendor, category, platform family
+- compatibility status by environment
+- environment dimensions such as CPU architecture, OS family/version, native
+  vs translation vs virtualization
+- bundle IDs and installer/package metadata where safe and user-useful
+- last verified date, first seen date, last changed date
+- public evidence summary and source links
+- changelog summary for status changes
+- category and comparison pages built from real user tasks
+- curated JSON, CSV, or Parquet snapshot exports
+- public structured data and sitemaps
+
+Public page types that seem high-value:
+
+- canonical app pages
+- category pages
+- "best alternatives if not native yet" pages
+- transition pages such as "best native DAWs on Apple Silicon"
+- comparison pages by use case, hardware generation, and workaround path
+- dataset landing pages for bulk exports
+
+## What Should Likely Stay Private
+
+Private-by-default fields:
+
+- raw crawled HTML and downloaded ZIP/DMG/PKG artifacts
+- extracted binaries and quarantine samples
+- low-confidence matches and candidate entities
+- dedupe, normalization, and scoring heuristics
+- reviewer notes, moderation notes, and dispute state
+- crawler logs, IP intelligence, WAF rules, and abuse signatures
+- affiliate economics, contact records, outreach state, and deal terms
+- internal confidence models, embeddings, and experimental feature engineering
+- unpublished source mappings and scrape recipes that are costly to build
+
+Why keep these private:
+
+- operational risk
+- legal and hosting risk
+- abuse resistance
+- clearer moat
+- lower copyability
+
+## Different Ways To Think About The Database
+
+### 1. Directory / programmatic SEO system
+
+Upside:
+- fastest traffic growth if executed well
+
+Downside:
+- easiest to drift into thin pages and scaled-content abuse
+- weakest long-term moat
+
+Use this frame only if every template answers a real question better than a
+generic directory would.
+
+### 2. Public knowledge graph with evidence
+
+Upside:
+- strongest fit for search, citations, and trust
+- best long-term reuse across Apple, Windows, and future transitions
+
+Downside:
+- requires stronger data modeling and provenance discipline
+
+This is the best framing for `doesitarm`.
+
+### 3. Public publication layer over a private intelligence system
+
+Upside:
+- best balance of discoverability and defensibility
+- easiest path to enterprise/API products later
+
+Downside:
+- more operational complexity
+
+This is the recommended operating model.
+
+### 4. Mostly closed database with selective public summaries
+
+Upside:
+- strongest direct control over assets
+
+Downside:
+- weakest SEO and AI discoverability
+- hardest to build brand authority from the data itself
+
+This makes sense only if monetization depends more on closed workflows than on
+being the public authority.
+
+## Open Vs Closed Strategy Options
+
+## Option 1. Open facts, private operations
+
+Publish:
+
+- canonical pages
+- evidence summaries
+- limited exports
+- structured data
+
+Keep private:
+
+- raw ingestion
+- candidate pipeline
+- scoring and ops
+
+Tradeoff:
+Best overall balance of discoverability, trust, and defensibility.
+
+## Option 2. Open pages, paid API / paid bulk data
+
+Publish:
+
+- strong pages for discovery and citations
+- free lightweight API or delayed snapshots
+
+Charge for:
+
+- real-time API
+- higher limits
+- historical depth
+- enterprise filters and alerts
+
+Tradeoff:
+Strong monetization path, but requires clearer product packaging.
+
+## Option 3. Fully open data commons
+
+Publish:
+
+- everything except unsafe raw binaries/secrets
+
+Tradeoff:
+Maximum goodwill, citation, and reuse.
+Minimum moat unless monetization shifts to services, sponsorship, or community
+leadership.
+
+## Option 4. Selective access / crawler monetization layer
+
+Publish:
+
+- normal web pages
+
+Control:
+
+- which bots crawl
+- whether training is allowed
+- whether some crawlers must pay
+
+Tradeoff:
+Promising middle path, especially as crawler monetization standards mature, but
+still early and not something to build the whole strategy around yet.
+
+## Recommendation
+
+For `doesitarm`, use Option 1 now, with a path to Option 2 later.
+
+Concretely:
+
+1. Treat the database as transition-agnostic.
+   Use dimensions like `platform_family`, `cpu_arch`, `translation_layer`,
+   `virtualization_layer`, `os_version`, `artifact_type`, and
+   `verification_method` so the same model can cover Apple Silicon, Windows on
+   ARM, or the next Apple transition.
+
+2. Build a public canonical facts layer.
+   Each app should have a canonical page with:
+   status, environments, timestamps, evidence links, and short synthesis.
+
+3. Build a public dataset layer.
+   Publish periodic snapshots with dataset landing pages, license, provenance,
+   versioning, and download metadata.
+
+4. Keep ingestion and raw evidence private.
+   Store raw downloads, scrape traces, matching logic, and low-confidence
+   candidates outside the public repo and public site.
+
+5. Add public machine-readable surfaces in this order:
+   - `SoftwareApplication`-style entity markup where it truthfully matches page content
+   - dataset landing pages plus `Dataset` / `DataDownload` metadata for exports
+   - stable JSON or CSV snapshots
+   - `llms.txt` and Markdown exports as secondary aids
+
+6. Make public pages citation-friendly.
+   Add clear authorship, methodology, "how we know", last verified date, and
+   source links.
+
+7. Avoid index bloat.
+   Keep canonical entity and high-intent comparison pages indexable.
+   Use `noindex` or crawl controls for low-value filter permutations and stale
+   or confusing pages.
+
+8. Measure before deciding how open to be.
+   Track:
+   - Search Console web traffic
+   - ChatGPT referral traffic via `utm_source=chatgpt.com`
+   - bot traffic by user agent
+   - crawl cost versus referral value
+
+Inference:
+The best long-term moat is not withholding all facts. It is being the most
+trusted and most reusable source for those facts, while keeping the expensive
+and differentiating machinery private.
+
+## Near-Term Next Steps For doesitarm
+
+1. Add a public data-contract document describing the canonical app entity,
+   environment entity, evidence entity, and snapshot dataset.
+2. Expand app pages from "status page" to "evidence page":
+   include methodology, last verified date, change history, and source
+   attribution.
+3. Add structured data intentionally:
+   entity markup for app pages, dataset markup for exports, not generic markup
+   everywhere.
+4. Add a public snapshot export and a dataset landing page.
+5. Add a bot-policy matrix to `robots.txt` planning:
+   Google search, OpenAI search, Anthropic search, training bots, and user bots.
+6. Add `llms.txt` only after the public canonical and export layers are clean.
+7. Keep filters/search-result pages from becoming the primary indexable surface.
+
+## Source Links
+
+- Repo context:
+  - [README.md](/Users/athena/Code/doesitarm/README.md)
+  - [docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md)
+  - [static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt)
+  - [helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js)
+  - [helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js)
+  - [helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js)
+  - [static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json)
+
+- Google AI features and AI search:
+  - https://developers.google.com/search/docs/appearance/ai-features
+  - https://developers.google.com/search/blog/2025/05/succeeding-in-ai-search
+  - https://developers.google.com/search/docs/fundamentals/creating-helpful-content
+  - https://developers.google.com/search/docs/essentials/spam-policies
+  - https://developers.google.com/search/docs/fundamentals/using-gen-ai-content
+
+- Google review and structured-data guidance:
+  - https://developers.google.com/search/docs/appearance/reviews-system
+  - https://developers.google.com/search/docs/specialty/ecommerce/write-high-quality-reviews
+  - https://developers.google.com/search/docs/appearance/structured-data/software-app
+  - https://developers.google.com/search/docs/appearance/structured-data/dataset
+  - https://developers.google.com/search/docs/appearance/structured-data/sd-policies
+  - https://developers.google.com/search/docs/crawling-indexing/crawling-managing-faceted-navigation
+  - https://developers.google.com/search/docs/crawling-indexing/block-indexing
+
+- Schema and dataset modeling:
+  - https://schema.org/SoftwareApplication
+
+- OpenAI:
+  - https://help.openai.com/en/articles/12627856-publishers-and-developers-faq
+  - https://help.openai.com/en/articles/9237897-chatgpt-search
+  - https://platform.openai.com/docs/gptbot
+
+- Anthropic:
+  - https://support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
+  - https://docs.anthropic.com/en/docs/build-with-claude/search-results
+
+- Cloudflare:
+  - https://developers.cloudflare.com/style-guide/how-we-docs/ai-consumability/
+  - https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/
+  - https://blog.cloudflare.com/introducing-pay-per-crawl/
+
+- Discovery and freshness:
+  - https://www.indexnow.org/index
+
+- Practitioner and discussion context:
+  - https://simonwillison.net/2025/Oct/24/claude-code-docs-map/
+  - https://news.ycombinator.com/item?id=41072549
+  - https://lobste.rs/s/dmuad3/mitigating_sourcehut_s_partial_outage
+
+## Source Quality Notes
+
+- Google Search Central, OpenAI, Anthropic, Schema.org, IndexNow, and Cloudflare
+  were the primary sources for current guidance.
+- The HN and Lobsters links were useful for operator sentiment and failure modes,
+  not as primary authority for ranking behavior.
+- `llms.txt` appears real and increasingly implemented, but the strongest
+  current evidence still says it is supplemental rather than foundational.
--- a/docs/research/private-public-repo-sync-patterns-2026-04-04.md
+++ b/docs/research/private-public-repo-sync-patterns-2026-04-04.md
@ -0,0 +1,222 @@
+# Private/Public Repo Sync Patterns For doesitarm
+
+Tease: Git can push to a second remote, but it does not natively maintain a safe long-lived branch that "merges everything except `docs/`."
+
+Lede: For `doesitarm` on 2026-04-04, the best-fit pattern is a second repo/remote plus an automated one-way export from `origin/master` that rewrites out `docs/` and force-pushes the result.
+
+Why it matters:
+- `docs/` is already present on `origin/master`, so excluding it from a future mirror does not make it private retroactively.
+- `docs/` does not appear to participate in the app build, which makes export-time exclusion low-risk.
+- Cross-repo automation on GitHub needs separate auth; the default workflow token is not enough for another private repo.
+
+Go deeper:
+- Simple one-way sync: fresh clone -> `git filter-repo --path docs/ --invert-paths` -> force-push to a second remote.
+- Faster repeated projection: evaluate `splitsh-lite` if export speed or repeatability becomes important.
+- Advanced bidirectional filtered views: `josh` is the serious option, but it is heavier than this repo likely needs.
+
+Date: 2026-04-04
+
+## Scope
+
+Research whether `doesitarm` should support a `private-main`-style branch or
+remote that automatically tracks the public default branch while excluding
+paths such as `docs/`, and identify better patterns if they exist.
+
+## Short Answer
+
+Yes, you can push to a different remote from the public repo.
+
+No, the durable pattern is not a long-lived `private-main` merge branch that
+keeps deleting `docs/` after every merge. Git branches and merges are full-tree
+operations, and sparse checkout does not change that.
+
+For this repo, the cleanest pattern is:
+
+1. Keep `origin/master` as the source branch.
+2. Add a second remote or second repository for the private target.
+3. Run a one-way export job on each push to `master`.
+4. In that job, create a sanitized tree/history with `docs/` removed.
+5. Force-push the sanitized result to the private remote branch.
+
+If the real requirement is that future docs stay private, the better topology
+is the reverse: keep the canonical repo private and generate the public export.
+
+Inference:
+That last recommendation is based on the current repo state: `docs/` is already
+committed on `origin/master`, so a private mirror without `docs/` only changes
+future distribution, not past public exposure.
+
+## What The Repo Already Knows
+
+- The default remote today is only `origin`, pointed at the public GitHub repo.
+- The default tracked branch is `origin/master`, not `main`.
+- There is no checked-in `.github/` workflow that already handles cross-repo
+  sync.
+- `docs/` currently contains repo-local planning and research material:
+  [docs/app-flow.md](/Users/athena/Code/doesitarm/docs/app-flow.md),
+  [docs/plans/app-test-typescript-refactor.md](/Users/athena/Code/doesitarm/docs/plans/app-test-typescript-refactor.md),
+  and dated research memos under
+  [docs/research](/Users/athena/Code/doesitarm/docs/research).
+- `docs/` is already present in `origin/master` history as of 2026-04-04.
+- The build surface in
+  [package.json](/Users/athena/Code/doesitarm/package.json)
+  does not appear to depend on `docs/`.
+
+## What The Evidence Says
+
+- Different remote from the public repo:
+  yes. Git supports separate remotes, and the `git remote` docs explicitly say
+  that when fetch and push use different locations, you should use two separate
+  remotes rather than pretending they are the same remote.
+- Long-lived branch with path exclusions:
+  not a native Git capability. Git merges operate on full trees, not
+  "everything except these directories."
+- Sparse checkout:
+  not the answer here. The `git sparse-checkout` docs describe it as a working
+  directory reduction feature, and they note that operations such as merge or
+  rebase may still materialize paths outside the sparse specification.
+- `git filter-repo`:
+  good fit for one-way export. The Git project now recommends it instead of
+  `filter-branch`, and its docs support `--invert-paths` for "keep everything
+  except these paths" rewrites. That matches "mirror the repo except `docs/`."
+- `splitsh-lite`:
+  promising when you want repeatable projections into standalone repos and care
+  about performance. Its README supports split prefixes that can include
+  exclusions and uses a history cache, which is more appropriate than a manual
+  merge branch when this becomes a repeated sync lane.
+- `josh`:
+  the advanced option. Its repo describes a proxy Git server that exposes
+  filtered histories as standalone repos and synchronizes between original and
+  filtered views. This is the closest thing to a "real" selective mirror
+  system, but it adds operational weight.
+- GitHub Actions auth:
+  the default `github.token` is scoped to the current repository. If a workflow
+  in the public repo needs to push to a different private repo, you need a PAT,
+  deploy key, or GitHub App token instead.
+
+## What Works
+
+- A second remote or second repository for the private target.
+- A one-way generated branch or repo, not a hand-maintained merge branch.
+- Rebuilding the private export from `origin/master` every run.
+- Treating the private mirror as generated output with force-pushes allowed.
+- Keeping development on one source branch and one source-of-truth repo.
+
+## What To Avoid
+
+- Do not maintain `private-main` by repeatedly merging `master` and deleting
+  `docs/`. That creates unnecessary churn and eventual conflict debt.
+- Do not use sparse checkout as if it were a publishing filter.
+- Do not make the generated private mirror a peer source of truth unless you
+  also adopt a projection system designed for bidirectional sync.
+- Do not rely on the default GitHub Actions token for pushes to another repo.
+- Do not assume this setup hides `docs/` historically; those files are already
+  in the public remote history.
+
+## Best Patterns
+
+## 1. Best Fit For This Repo: one-way export to a second remote
+
+Use a second remote, for example `private`, pointing at a separate private
+GitHub repository. On each push to `origin/master`, run an automation that:
+
+1. checks out `master`
+2. authenticates to the private repo with a PAT, deploy key, or GitHub App
+3. creates a fresh export clone or export worktree
+4. rewrites out `docs/` and any other excluded paths
+5. force-pushes the sanitized result to the private repo branch
+
+Why this is the best fit:
+
+- it matches the repo's current single-source workflow
+- it does not depend on path-aware merges that Git does not have
+- it keeps excluded-path logic in one place
+- it is easy to reason about and recover from
+
+Tradeoffs:
+
+- exported commit SHAs will differ from public `master`
+- the private mirror should be treated as generated/read-only
+
+## 2. Better If Repeated Projection Becomes Core: `splitsh-lite`
+
+If you end up publishing multiple filtered mirrors or need fast repeated
+updates, `splitsh-lite` is worth a spike. It is built for turning repository
+views into standalone histories and caching the work.
+
+Tradeoffs:
+
+- more specialized operational knowledge
+- less obvious to future maintainers than a simple export script
+
+## 3. Better Only For Advanced Bidirectional Partial Views: `josh`
+
+If the real requirement becomes "developers commit through filtered views and
+changes synchronize both directions," `josh` is the pattern to study.
+
+Tradeoffs:
+
+- significant infrastructure/runtime overhead
+- far more complexity than `doesitarm` appears to need today
+
+## 4. Adjacent But Not The Same Problem: GitHub Private Mirrors
+
+GitHub's `private-mirrors` app is relevant if the goal is to collaborate
+privately around a public repository and upstream later. It is not the right
+answer for "same repo minus `docs/`," but it is worth noting as a neighboring
+pattern.
+
+## Recommendation
+
+For `doesitarm`, use a separate private repository plus a generated sync job.
+Name the target branch after the actual default branch in this repo, for
+example `private-master` or simply `master` on the private remote.
+
+Do not implement this as a merge branch.
+
+If the aim is just "same code, different remote, minus `docs/`," a generated
+one-way mirror is the right level of machinery.
+
+If the aim is "keep future internal docs private," move the source of truth to
+a private repo and generate the public mirror from that private origin.
+
+## Missing Information
+
+- Whether the private target is intended to be read-only/generated or whether
+  anyone will commit directly to it.
+- Whether `docs/` is the only excluded path or just the first example.
+- Whether the real goal is secrecy, deployment hygiene, or private-only
+  collaboration before publishing.
+
+## Source Links
+
+- Git remote docs:
+  https://git-scm.com/docs/git-remote
+- Git sparse-checkout docs:
+  https://git-scm.com/docs/git-sparse-checkout
+- `git-filter-repo` repository:
+  https://github.com/newren/git-filter-repo
+- `git-filter-repo` manual:
+  https://www.mankier.com/1/git-filter-repo
+- `splitsh-lite` repository:
+  https://github.com/splitsh/lite
+- `josh` repository:
+  https://github.com/josh-project/josh
+- `actions/checkout` README:
+  https://github.com/actions/checkout
+- GitHub App auth in GitHub Actions:
+  https://docs.github.com/en/enterprise-cloud@latest/apps/creating-github-apps/authenticating-with-a-github-app/making-authenticated-api-requests-with-a-github-app-in-a-github-actions-workflow
+- GitHub deploy keys:
+  https://docs.github.com/v3/guides/managing-deploy-keys
+- GitHub Private Mirrors app:
+  https://github.com/github-community-projects/private-mirrors
+- Stack Overflow, partial sharing of Git repositories:
+  https://stackoverflow.com/questions/278270/partial-sharing-of-git-repositories
+
+## Source Quality Notes
+
+- HN and Lobsters searches on 2026-04-04 did not surface a clearly better
+  mainstream pattern than the Git/GitHub docs plus the specialized projection
+  tools above.
+- Primary docs and project READMEs were materially more useful than forum
+  commentary for this question.
--- a/docs/research/public-repo-security-and-monorepo-patterns-2026-04-04.md
+++ b/docs/research/public-repo-security-and-monorepo-patterns-2026-04-04.md
@ -0,0 +1,396 @@
+# Public Repo Security And Monorepo Patterns For doesitarm
+
+Tease: The safest version of this plan keeps `doesitarm` public, but treats credentials, imports, downloaded app artifacts, and privileged automation as private operational surfaces.
+
+Lede: For `doesitarm` on 2026-04-04, the best-fit pattern is a Kriasoft-style public monorepo with clear `apps/`, `packages/`, `db/`, and `infra/` boundaries, plus hardened GitHub Actions, GitHub-hosted runners for public workflows, D1 local development via Wrangler, and private storage for secrets, backups, and quarantined artifacts.
+
+Why it matters:
+- The current repo is about to add higher-risk surfaces: D1, automated app discovery, archive downloading, scheduled jobs, and more Cloudflare automation.
+- In a public repo, CI/CD mistakes matter as much as application code mistakes. Workflow files, tokens, logs, and runner choices become part of the threat model.
+- The current repo already has one immediate security problem: a workflow prints secret-derived files to CI logs.
+
+Go deeper:
+- Keep the code public; keep secrets, raw data, and operational state private.
+- Refactor toward a monorepo shape early so new ingestion, scanner, D1, and infra code do not spread across a flat root.
+- Adopt OSS-friendly GitHub hardening: read-only default `GITHUB_TOKEN`, pinned actions, CODEOWNERS on workflow/infra/db paths, secret scanning, private vulnerability reporting, and no self-hosted runners for public PRs.
+
+Date: 2026-04-04
+
+## Scope
+
+Research security considerations and common open-source repository patterns for a
+setup like `doesitarm`:
+
+- public GitHub repository
+- Cloudflare Workers and D1
+- scheduled automation
+- automated downloading and scanning of third-party app archives
+- prospective monorepo refactor in the style of
+  `kriasoft/react-starter-kit`
+
+This memo is intended to drive updates to
+[app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md).
+
+## Short Answer
+
+Do not move the whole repo private.
+
+Instead:
+
+1. Keep the application and infrastructure code public.
+2. Move secrets, imported raw data, D1 operational state, downloaded artifacts,
+   quarantined samples, and any sensitive fixtures to private systems.
+3. Refactor into a monorepo early, using a Kriasoft-style structure adapted to
+   this repo's existing pnpm/Netlify/Astro/Workers setup.
+4. Harden GitHub Actions before expanding automation.
+
+Best-fit recommendation:
+
+- Public monorepo with `apps/`, `packages/`, `db/`, `infra/`, `scripts/`,
+  and `docs/`
+- GitHub-hosted runners for public workflows
+- GitHub environment secrets with required reviewers for production deploys
+- Cloudflare D1 local development and tests via Wrangler `--local`,
+  `preview_database_id`, and test harnesses like `unstable_dev()`/Miniflare
+- Private object storage or equivalent for raw app archives, import dumps,
+  and quarantine material
+
+Inference:
+This is the right fit because the repo is open source and community-facing, but
+the risky parts are operational, not architectural. Public code is compatible
+with good security here; public credentials and public operational data are not.
+
+## What The Repo Already Knows
+
+- The repo is currently flat-rooted, not organized as a workspace monorepo.
+- There is no checked-in D1 configuration or local D1 bootstrap yet.
+- There is Cloudflare deployment automation in
+  [deploy-cloudflare-workers.yml](/Users/athena/Code/doesitarm/.github/workflows/deploy-cloudflare-workers.yml).
+- That workflow currently decodes secret-backed `.env` / `wrangler.toml` files
+  and prints them with `cat`, which is a real security issue in CI logs.
+- The site build still depends on remote/env-backed feeds such as
+  `SCANS_SOURCE`, `COMMITS_SOURCE`, `HOMEBREW_SOURCE`, `GAMES_SOURCE`, and
+  `VFUNCTIONS_URL`.
+- The scanner and planned discovery pipeline will process untrusted third-party
+  files, including archive formats like ZIP, DMG, and PKG.
+- `.env` is ignored at the root, and per-worker `wrangler.toml` files are
+  already ignored in worker subdirectories.
+
+## What The Evidence Says
+
+### 1. Public repos can stay public if the operational boundary is private
+
+GitHub's own docs assume public repositories will use:
+
+- repository or environment secrets
+- restricted organization secret access
+- private vulnerability reporting
+- automatic secret scanning on public repos
+
+That is strong evidence that the normal pattern is not "make the repo private";
+it is "keep sensitive operational material out of the repo and out of logs."
+
+### 2. Default GitHub Actions posture should be least privilege
+
+GitHub recommends:
+
+- minimum required `GITHUB_TOKEN` permissions
+- default repository token permission set to read-only
+- escalating permissions only per job
+- using a GitHub App token if a job needs more than `GITHUB_TOKEN` can provide
+
+This matches what open-source repos increasingly do for deploy, release, and
+cross-repo automation.
+
+### 3. Secrets are still easy to leak through logs and workflow behavior
+
+GitHub's secure-use docs explicitly warn that:
+
+- redaction is not guaranteed for transformed values
+- structured blobs like JSON/YAML are poor secret formats
+- non-secret values should be masked explicitly with `::add-mask::`
+- exposed secrets in logs should trigger deletion/rotation
+
+For `doesitarm`, this directly applies to the current workflow that prints
+secret-derived config files into CI output.
+
+### 4. Public repos should avoid self-hosted runners for untrusted PRs
+
+GitHub explicitly recommends self-hosted runners only with private
+repositories, because forks of public repositories can run dangerous code on
+them through pull requests.
+
+For this repo, that means:
+
+- do not put public PR workflows on a local machine or other long-lived
+  self-hosted runner
+- do not run untrusted archive-processing jobs on a self-hosted runner that
+  also holds production credentials
+
+### 5. `pull_request_target` remains a common footgun
+
+GitHub Security Lab's `Preventing pwn requests` guidance is still the clearest
+implementation reference:
+
+- `pull_request_target` plus checking out/building PR code is dangerous
+- untrusted PR code should run in an unprivileged `pull_request` workflow
+- privileged follow-up actions should happen through `workflow_run` with
+  carefully handled artifacts
+
+HN discussion around real workflow exploits reinforces the same point: the
+problem is not theoretical.
+
+### 6. Common OSS hardening patterns for GitHub workflows are now well-defined
+
+GitHub secure-use guidance and OpenSSF best-practice guidance converge on:
+
+- pin actions to full commit SHAs
+- restrict allowed actions where possible
+- guard `.github/workflows/` with `CODEOWNERS`
+- keep default branch protected
+- require reviews and passing checks
+- use code scanning / dependency review / secret scanning / Dependabot
+- use private vulnerability reporting for public repos
+
+These are standard public-repo practices, not enterprise-only overkill.
+
+### 7. Cloudflare D1 already supports local-first development and tests
+
+Cloudflare's D1 docs explicitly support:
+
+- `wrangler dev` local mode
+- `preview_database_id`
+- `wrangler d1 migrations apply --local`
+- test setups using Miniflare and `unstable_dev()`
+
+That means D1 does not require a private repo or remote-only workflow. It fits
+the "run locally on this machine, then automate" plan well.
+
+### 8. Cloudflare Workflows and observability make Cloudflare a credible later home for ingestion
+
+Cloudflare Workflows now position themselves as durable multi-step execution
+with retries, persisted state, and debugging. Workers Logs and Traces provide
+native observability. That is enough evidence to treat Cloudflare as a viable
+later landing zone for scheduled ingestion and scan orchestration.
+
+Inference:
+GitHub Actions is still the easier first scheduler because it is already in the
+repo, but Cloudflare Workflows has matured enough to stay in the plan as a
+serious later option.
+
+### 9. Kriasoft's monorepo shape is a good architectural fit, but not every exact convention should be copied blindly
+
+`kriasoft/react-starter-kit` is a public monorepo with:
+
+- `apps/`
+- `packages/`
+- `db/`
+- `docs/`
+- `infra/`
+- `scripts/`
+
+It also documents a public template env pattern where committed `.env`
+contains placeholders/defaults and `.env.local` contains real credentials.
+
+That shape is a strong fit for `doesitarm`, but I would adapt the env pattern
+slightly for safety and clarity:
+
+- keep a committed public template file such as `.env.example`
+- keep real credentials in `.env.local`, `.dev.vars`, GitHub environment
+  secrets, and Cloudflare secrets
+
+Inference:
+Kriasoft's folder layout is the part worth copying directly. The exact env-file
+naming should follow the least-confusing safe convention for this repo.
+
+## Common Open-Source Patterns That Fit doesitarm
+
+### Public code, private state
+
+Keep public:
+
+- app code
+- scanner code
+- D1 schema and migrations
+- workflow definitions
+- docs and plans
+
+Keep private:
+
+- deploy credentials and tokens
+- raw Google Sheets exports or database backups
+- downloaded app archives
+- quarantine samples
+- private test fixtures that would create redistribution or abuse risk
+- operational dashboards and alert destinations
+
+### Workspace monorepo with clear trust boundaries
+
+Best-fit structure for `doesitarm`:
+
+- `apps/web/` — Astro site and app-test UI
+- `apps/default-worker/` — current `doesitarm-default`
+- `apps/analytics-worker/` — current `workers/analytics`
+- `apps/ingest/` or `apps/discovery/` — CLI/admin surface for discovery jobs
+- `packages/scanner-core/` — shared scan engine and file-format logic
+- `packages/source-runners/` — Homebrew/GitHub/download-page source runners
+- `packages/data-model/` — shared D1 schema types, DTOs, validation
+- `packages/site-build/` — list/build/export helpers
+- `db/` — D1 migrations, seeds, import scripts, local test DB helpers
+- `infra/` — Wrangler config, deploy config, policy docs
+- `scripts/` — repo automation
+- `docs/` — plans, research, operational docs
+
+### Repo template files, not repo secrets
+
+Common OSS pattern:
+
+- commit `.env.example` or placeholder-only `.env`
+- ignore `.env.local`, `.dev.vars`, and `.wrangler/`
+- keep Cloudflare secrets in Workers secrets / GitHub environment secrets
+
+### Hardened GitHub Actions for public forks
+
+Common OSS pattern:
+
+- default `permissions: { contents: read }`
+- explicit per-job escalation only
+- require approval for fork PR workflows where appropriate
+- no self-hosted runners for public PRs
+- no `pull_request_target` workflows that checkout/build PR code
+
+### Supply-chain hygiene for workflows
+
+Common OSS pattern:
+
+- pin actions to full SHAs
+- restrict allowed actions
+- Dependabot for action updates
+- CodeQL / code scanning for workflow vulnerabilities
+- OpenSSF Scorecards for ongoing hygiene checks
+
+### Disclosure and scanning defaults
+
+Common OSS pattern:
+
+- enable private vulnerability reporting
+- enable secret scanning and push protection
+- keep a `SECURITY.md` policy
+
+## What Works
+
+- Keeping the repo public while moving secrets and sensitive data out of git
+- Refactoring to a monorepo before adding more D1/discovery complexity
+- Treating workflow files, `infra/`, and `db/` as protected surfaces with
+  `CODEOWNERS`
+- Using GitHub-hosted runners for public CI and scheduled jobs
+- Using environment-specific secrets with required reviewers for production
+  deployment jobs
+- Using D1 local mode and local migrations as part of normal development
+- Using Cloudflare Logs/Traces or equivalent observability for scheduled jobs
+- Storing raw archives and quarantine material in private object storage rather
+  than in the repo
+
+## What To Avoid
+
+- Do not move the whole repo private as a substitute for secrets hygiene
+- Do not keep the current workflow behavior that prints secret-derived files to
+  CI logs
+- Do not use self-hosted runners for public PR workflows
+- Do not run archive downloads/extraction in privileged workflows that also have
+  deploy credentials
+- Do not combine `pull_request_target` with explicit PR checkout/build steps
+- Do not keep adding discovery/D1/worker code into the current flat root
+- Do not commit raw import dumps, app archives, or structured secret blobs
+
+## Recommendation
+
+For `doesitarm`, the strongest next-step package is:
+
+1. Refactor toward a Kriasoft-style monorepo shape adapted to pnpm.
+2. Add a security-hardening stage before expanding automation.
+3. Keep the repo public.
+4. Keep secrets, raw operational data, and archive/quarantine material private.
+5. Start scheduled discovery on GitHub-hosted runners with hardened workflows.
+6. Keep Cloudflare Workflows as a second-phase target for durable ingestion.
+
+Immediate high-priority actions to capture in the plan:
+
+1. Remove secret printing from
+   [deploy-cloudflare-workers.yml](/Users/athena/Code/doesitarm/.github/workflows/deploy-cloudflare-workers.yml)
+   and rotate affected secrets.
+2. Add repo policy and tooling for:
+   - read-only default `GITHUB_TOKEN`
+   - pinned actions
+   - `CODEOWNERS` for `.github/workflows/`, `infra/`, and `db/`
+   - secret scanning / push protection
+   - private vulnerability reporting
+3. Add ignored local-secret files for the new D1/Workers workflow:
+   - `.env.local`
+   - `.dev.vars`
+   - `.wrangler/`
+4. Keep public PR CI on GitHub-hosted runners only.
+5. Store raw archives/import snapshots outside the repo.
+
+## Missing Information
+
+- Whether the future ingestion runtime is expected to stay GitHub-first or
+  eventually move fully to Cloudflare Workers/Workflows.
+- Whether there are legal or vendor-policy constraints around storing downloaded
+  app archives long term.
+- Whether the monorepo refactor should keep Netlify as-is or consolidate more
+  runtime surfaces onto Cloudflare.
+
+## Source Links
+
+- GitHub Docs, `GITHUB_TOKEN` least-privilege and GitHub App escalation:
+  https://docs.github.com/en/actions/tutorials/authenticate-with-github_token
+- GitHub Docs, secrets in Actions, fork-secret behavior, environment reviewers,
+  OIDC, and masking:
+  https://docs.github.com/en/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions
+- GitHub Docs, secure use reference, pinning actions, CODEOWNERS, code scanning,
+  Dependabot, and Scorecards:
+  https://docs.github.com/en/actions/reference/security/secure-use
+- GitHub Docs, self-hosted runner warning for public repositories:
+  https://docs.github.com/en/actions/how-tos/manage-runners/self-hosted-runners/add-runners
+- GitHub Docs, limiting self-hosted runners in organizations:
+  https://docs.github.com/en/organizations/managing-organization-settings/disabling-or-limiting-github-actions-for-your-organization
+- GitHub Docs, approval requirements for fork PR workflows:
+  https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/approving-workflow-runs-from-public-forks
+- GitHub Docs, repository Actions settings and fork workflow controls:
+  https://docs.github.com/github/administering-a-repository/managing-repository-settings/disabling-or-limiting-github-actions-for-a-repository
+- GitHub Docs, secret scanning for public repositories:
+  https://docs.github.com/github/administering-a-repository/about-token-scanning
+- GitHub Docs, enabling secret scanning / push protection:
+  https://docs.github.com/en/code-security/how-tos/secure-your-secrets/detect-secret-leaks/enabling-secret-scanning-for-your-repository
+- GitHub Docs, enabling push protection:
+  https://docs.github.com/en/code-security/secret-scanning/enabling-secret-scanning-features/enabling-push-protection-for-your-repository
+- GitHub Docs, private vulnerability reporting:
+  https://docs.github.com/en/code-security/security-advisories/working-with-repository-security-advisories/configuring-private-vulnerability-reporting-for-a-repository
+- GitHub Security Lab, `pull_request_target` / `workflow_run` guidance:
+  https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/
+- OpenSSF GitHub configuration best practices:
+  https://best.openssf.org/SCM-BestPractices/github/
+- Kriasoft React Starter Kit:
+  https://github.com/kriasoft/react-starter-kit
+- Cloudflare D1 local development:
+  https://developers.cloudflare.com/d1/best-practices/local-development/
+- Cloudflare Workers observability:
+  https://developers.cloudflare.com/workers/observability/
+- Cloudflare Workers logs:
+  https://developers.cloudflare.com/workers/observability/logs/
+- Cloudflare Workers traces:
+  https://developers.cloudflare.com/workers/observability/traces/
+- Cloudflare Workflows overview:
+  https://developers.cloudflare.com/workflows/
+
+## Source Quality Notes
+
+- Highest-confidence sources in this memo are GitHub Docs, GitHub Security Lab,
+  OpenSSF, Cloudflare Docs, and the Kriasoft repository itself.
+- HN/Lobsters did not surface a materially better competing pattern in this
+  pass; the most useful HN signal reinforced GitHub Security Lab's warning on
+  `pull_request_target`.
+- The recommendation to keep the repo public but move operational data private
+  is a synthesis from official guidance plus this repo's current shape and risk
+  surface.