docs(plan): add discovery and deploy follow-up research

Capture the next discovery, security, compatibility-data, and dual-deploy planning work, and ignore local Vercel/env state that should not be committed. This keeps the operational research with the repo while avoiding accidental local-config churn.

Constraint: Must not alter production runtime behavior
Rejected: Fold research notes into the runtime fix commit | obscures the user-facing app-test correction with planning-only material
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep .omx local state untracked even when committing broad workspace updates
Tested: Document review only
Not-tested: No runtime verification required for docs and ignore rules
This commit is contained in:
ThatGuySam 2026-04-04 15:38:39 -05:00
parent e667ab564e
commit 1248c705b0
6 changed files with 1634 additions and 0 deletions

View file

@ -0,0 +1,600 @@
# Desktop App Compatibility Data Strategy For doesitarm
Tease: In 2026, the winning play is not "AI SEO." It is publishing a high-trust, machine-readable compatibility corpus that is genuinely useful on the open web and selectively keeping the operational and proprietary layers private.
Lede: For `doesitarm` on 2026-04-04, the best-fit strategy is to treat the site as an entity-and-evidence graph for desktop software compatibility: publish canonical app pages, provenance-rich evidence, structured exports, and selective machine-readable surfaces for discovery; keep raw crawls, binary artifacts, candidate matches, scoring logic, and operational intelligence private.
Why it matters:
- This project can outlive the Apple Silicon transition if the core model is "desktop software compatibility knowledge," not just "Apple Silicon list posts."
- Google's 2025 AI search guidance still rewards the same fundamentals: unique content, crawlable pages, textual clarity, and trustworthy evidence, not special AI-only tricks.
- OpenAI and Anthropic now expose separate search, user-action, and training bots, which means "open versus closed" is no longer binary. You can choose visibility, training access, and operational exposure separately.
Go deeper:
- Think of the public site as a citation layer and decision-support layer, not as the full warehouse.
- Publish public facts, provenance, timestamps, and curated exports. Keep raw ingestion, low-confidence candidates, and monetizable workflow intelligence private.
- Treat `llms.txt` and Markdown exports as helpful secondary surfaces, not as the core strategy. The core strategy is still clean HTML, canonical URLs, structured data, sitemaps, and useful pages.
Date: 2026-04-04
## Scope
Research how to think about a long-lived desktop app compatibility database as a
content, SEO, and AI-discoverability system in 2026, including:
- best practices for public content architecture
- how LLM-driven discovery changes the picture
- what data should likely stay public versus private
- what audiences this data can serve
- tradeoffs between more-open and more-closed approaches
## Short Answer
Build `doesitarm` as a public knowledge product with a private operating system
underneath it.
Publicly, publish:
- canonical app pages
- compatibility status by platform/environment
- evidence summaries and source links
- timestamps, changelogs, and history
- stable IDs, taxonomy, and machine-readable metadata
- a limited public API or snapshot exports for high-value reuse
Privately, keep:
- raw crawls and downloaded binaries
- candidate entities before review
- normalization, dedupe, and confidence logic
- crawler logs, abuse rules, and infrastructure controls
- enrichment that creates monetizable leverage rather than user value on the open web
The biggest strategic shift from 2018 to 2026 is this:
1. Search still rewards useful original pages.
2. AI discovery mostly rides on those same pages.
3. Separate crawler controls now let you be open for search while staying more closed for training.
4. The moat is less "having any compatibility data at all" and more:
verification quality, provenance, freshness, historical depth, and workflow speed.
Inference:
No single source states that exact four-part conclusion. It is the synthesis that
best fits the repo state plus current Google, OpenAI, Anthropic, Cloudflare,
HN, and Lobsters evidence.
## What The Repo Already Knows
- The project already acts like a compatibility corpus, not just a blog:
[README.md](/Users/athena/Code/doesitarm/README.md) is a manually curated,
source-linked compatibility list.
- The repo already has a plan to move toward a canonical database and discovery
pipeline:
[docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md).
- The public site already exposes crawlable pages, a sitemap, and permissive
crawling:
[static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt),
[static/sitemap-index.xml](/Users/athena/Code/doesitarm/static/sitemap-index.xml).
- The current public JSON already exposes useful app-level fields such as name,
aliases, status, bundle IDs, related links, scan details, and device support:
[static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json).
- The current structured data implementation is narrow and video-centric:
[helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js),
[helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js),
[helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js).
- I did not find a checked-in `llms.txt`, `llms-full.txt`, or per-page Markdown
export surface.
- I also did not find `SoftwareApplication` or `Dataset` structured data on app
or dataset pages.
Inference:
`doesitarm` already has enough public data shape to become a strong
machine-readable corpus. The main gap is not "inventing the dataset." The gap is
formalizing and publishing the right layers of it.
## What The Evidence Says
### 1. Google AI search still wants normal SEO fundamentals, not special AI tricks
Google's current AI-features guidance says there are no extra technical
requirements for AI Overviews or AI Mode beyond normal Search eligibility.
Google explicitly says you do not need new AI files or special schema just to
appear in AI features.
What does matter:
- crawl access
- internal links
- page experience
- important content in textual form
- structured data matching visible text
- unique, non-commodity content
This is the strongest argument against building an "AI discoverability" strategy
around gimmicks alone.
### 2. Large-scale thin template pages are a real risk
Google's helpful-content and spam-policy guidance is directly relevant to
programmatic compatibility sites:
- people-first content is favored
- pages made mainly to attract search visits are a warning sign
- scaled content abuse includes generating many low-value pages, including from
feeds or automated transformations
That means a compatibility database can absolutely win in search, but only if
its pages add decision-making value. Thin pages that just restate a status field
are dangerous.
### 3. Compatibility content should look more like tested reviews than like directory filler
Google's reviews guidance is a good proxy for compatibility pages because users
often arrive with a purchase, migration, or workflow decision in mind.
The guidance consistently rewards:
- original research
- first-hand evidence
- quantitative measurements where relevant
- comparisons
- what changed across versions
- benefits and drawbacks
For `doesitarm`, that maps cleanly onto:
- status by environment
- last verified date
- evidence links
- scanner output or screenshots where appropriate
- "what changed" changelog notes
- comparison pages like native vs Rosetta vs virtualization vs cloud workaround
### 4. Dataset markup is useful, but it should describe real dataset landing pages
Google's dataset documentation recommends canonical landing pages plus dataset
metadata such as `sameAs`, `isBasedOn`, identifiers, license, and download
distribution metadata.
That is a strong fit for curated exports such as:
- a public daily or weekly compatibility snapshot
- a historical archive by date
- vendor- or category-specific exports
- a Windows-on-ARM or future-transition slice later on
Important nuance:
Google's dataset docs are about Dataset Search discovery, not a substitute for
general web SEO. Dataset markup helps when you actually publish datasets.
### 5. `SoftwareApplication` markup fits the entity model, but Google rich-result requirements are narrower
Schema.org's `SoftwareApplication` type supports fields that are very relevant
here, including:
- `applicationCategory`
- `downloadUrl`
- `featureList`
- `operatingSystem`
- `softwareRequirements`
- `softwareVersion`
- `supportingData`
Google also has a software-app structured-data feature, but its rich-result
requirements are more commerce-shaped, including `offers.price` and
review/rating support. That means:
- use `SoftwareApplication` semantics where they match the visible page truth
- do not invent store-like fields just to chase rich results
- use dataset markup for exports and software/entity markup for canonical app
pages
### 6. AI discoverability is now bot-by-bot, not one global yes/no
OpenAI and Anthropic both now distinguish between different AI access modes.
OpenAI:
- `OAI-SearchBot` is for search inclusion
- `GPTBot` is for training
- `ChatGPT-User` is for user-triggered actions
Anthropic:
- `ClaudeBot` is for training
- `Claude-SearchBot` is for search quality
- `Claude-User` is for user-triggered retrieval
This is strategically important. You no longer need to choose only between:
- fully public for every AI purpose
- fully blocked for every AI purpose
You can allow discovery while disallowing training, or allow search while
tightly managing user-action access, depending on your goals.
### 7. `llms.txt` is real, but it is still a secondary signal
Cloudflare has implemented `llms.txt`, `llms-full.txt`, and per-page Markdown
exports, and Simon Willison has highlighted similar docs-map patterns as useful
for agent tooling.
That said:
- Google explicitly says no special AI text files are required for AI features
- OpenAI's discoverability guidance focuses on crawler access, `noindex`, and
citation/linking, not `llms.txt`
- HN and Lobsters discussions show real skepticism around AI crawler incentives
and how consistently emerging conventions are respected
Best interpretation:
- `llms.txt` is worth adding because it is cheap and increasingly recognized
- it should not be treated as the core lever
- the core lever is still strong public pages plus clean machine-readable
content
### 8. AI-friendly plain-text and Markdown surfaces do have practical value
Cloudflare's docs work here is the clearest practical example:
- per-page Markdown versions
- an index file
- bulk text export
- semantic HTML
- `noindex` on low-value or confusing pages
This is less about search ranking and more about:
- making retrieval cheaper and more accurate for agents
- improving citation quality
- reducing token waste
- giving your own future agents and partners a stable ingest format
For a compatibility corpus, that suggests public Markdown or JSON exports are
worth doing for the canonical facts layer.
### 9. Freshness and URL discovery matter more as the corpus grows
Google recommends sitemaps and Search Console monitoring.
IndexNow gives faster change pings for engines that support it, including Bing.
For a frequently updated compatibility corpus, this argues for:
- canonical landing pages
- clean sitemap generation
- changelog feeds or update streams
- optional IndexNow support for faster non-Google discovery
### 10. The crawl environment is getting more adversarial
Cloudflare Radar reported AI and search crawling growth of 18% from May 2024 to
May 2025 across its measured cohort, with `GPTBot` up 305%.
HN and Lobsters operator discussions show why this matters in practice:
- some AI crawlers create real infrastructure cost
- incentives are less aligned than classic web search
- operators increasingly need bot-specific controls, rate limiting, and
selective exposure
This is the best evidence for keeping raw and high-cost surfaces private even if
you lean more open on the public facts layer.
## Ways This Data Can Create Value
### Human audiences
- End users deciding whether they can keep using a favorite app on new hardware.
- IT, procurement, and upgrade planners deciding when a transition is safe.
- Developers and vendors tracking native support gaps and competitive pressure.
- Journalists and analysts covering platform transitions.
- Researchers and historians studying how ecosystems adapt to hardware changes.
### Machine audiences
- Search engines indexing canonical app, category, and comparison pages.
- LLM search products citing your pages as evidence.
- RAG systems consuming public snapshots or APIs.
- Agents answering migration, procurement, or troubleshooting questions.
- Internal `doesitarm` automation using the same canonical public layer as a
stable reference surface.
### Business-model value
- Audience growth from high-intent compatibility queries.
- Affiliate or sponsored monetization on truly decision-support pages.
- Paid APIs, bulk exports, or enterprise dashboards.
- Vendor intelligence and alerting.
- Historical transition data as a differentiated research asset.
Inference:
The public facts are likely to commoditize over time. The durable value is the
combination of breadth, freshness, provenance, history, and tooling layered on
top of those facts.
## What Should Likely Stay Public
Public-by-default fields:
- stable app identifier and canonical URL
- app name, aliases, vendor, category, platform family
- compatibility status by environment
- environment dimensions such as CPU architecture, OS family/version, native
vs translation vs virtualization
- bundle IDs and installer/package metadata where safe and user-useful
- last verified date, first seen date, last changed date
- public evidence summary and source links
- changelog summary for status changes
- category and comparison pages built from real user tasks
- curated JSON, CSV, or Parquet snapshot exports
- public structured data and sitemaps
Public page types that seem high-value:
- canonical app pages
- category pages
- "best alternatives if not native yet" pages
- transition pages such as "best native DAWs on Apple Silicon"
- comparison pages by use case, hardware generation, and workaround path
- dataset landing pages for bulk exports
## What Should Likely Stay Private
Private-by-default fields:
- raw crawled HTML and downloaded ZIP/DMG/PKG artifacts
- extracted binaries and quarantine samples
- low-confidence matches and candidate entities
- dedupe, normalization, and scoring heuristics
- reviewer notes, moderation notes, and dispute state
- crawler logs, IP intelligence, WAF rules, and abuse signatures
- affiliate economics, contact records, outreach state, and deal terms
- internal confidence models, embeddings, and experimental feature engineering
- unpublished source mappings and scrape recipes that are costly to build
Why keep these private:
- operational risk
- legal and hosting risk
- abuse resistance
- clearer moat
- lower copyability
## Different Ways To Think About The Database
### 1. Directory / programmatic SEO system
Upside:
- fastest traffic growth if executed well
Downside:
- easiest to drift into thin pages and scaled-content abuse
- weakest long-term moat
Use this frame only if every template answers a real question better than a
generic directory would.
### 2. Public knowledge graph with evidence
Upside:
- strongest fit for search, citations, and trust
- best long-term reuse across Apple, Windows, and future transitions
Downside:
- requires stronger data modeling and provenance discipline
This is the best framing for `doesitarm`.
### 3. Public publication layer over a private intelligence system
Upside:
- best balance of discoverability and defensibility
- easiest path to enterprise/API products later
Downside:
- more operational complexity
This is the recommended operating model.
### 4. Mostly closed database with selective public summaries
Upside:
- strongest direct control over assets
Downside:
- weakest SEO and AI discoverability
- hardest to build brand authority from the data itself
This makes sense only if monetization depends more on closed workflows than on
being the public authority.
## Open Vs Closed Strategy Options
## Option 1. Open facts, private operations
Publish:
- canonical pages
- evidence summaries
- limited exports
- structured data
Keep private:
- raw ingestion
- candidate pipeline
- scoring and ops
Tradeoff:
Best overall balance of discoverability, trust, and defensibility.
## Option 2. Open pages, paid API / paid bulk data
Publish:
- strong pages for discovery and citations
- free lightweight API or delayed snapshots
Charge for:
- real-time API
- higher limits
- historical depth
- enterprise filters and alerts
Tradeoff:
Strong monetization path, but requires clearer product packaging.
## Option 3. Fully open data commons
Publish:
- everything except unsafe raw binaries/secrets
Tradeoff:
Maximum goodwill, citation, and reuse.
Minimum moat unless monetization shifts to services, sponsorship, or community
leadership.
## Option 4. Selective access / crawler monetization layer
Publish:
- normal web pages
Control:
- which bots crawl
- whether training is allowed
- whether some crawlers must pay
Tradeoff:
Promising middle path, especially as crawler monetization standards mature, but
still early and not something to build the whole strategy around yet.
## Recommendation
For `doesitarm`, use Option 1 now, with a path to Option 2 later.
Concretely:
1. Treat the database as transition-agnostic.
Use dimensions like `platform_family`, `cpu_arch`, `translation_layer`,
`virtualization_layer`, `os_version`, `artifact_type`, and
`verification_method` so the same model can cover Apple Silicon, Windows on
ARM, or the next Apple transition.
2. Build a public canonical facts layer.
Each app should have a canonical page with:
status, environments, timestamps, evidence links, and short synthesis.
3. Build a public dataset layer.
Publish periodic snapshots with dataset landing pages, license, provenance,
versioning, and download metadata.
4. Keep ingestion and raw evidence private.
Store raw downloads, scrape traces, matching logic, and low-confidence
candidates outside the public repo and public site.
5. Add public machine-readable surfaces in this order:
- `SoftwareApplication`-style entity markup where it truthfully matches page content
- dataset landing pages plus `Dataset` / `DataDownload` metadata for exports
- stable JSON or CSV snapshots
- `llms.txt` and Markdown exports as secondary aids
6. Make public pages citation-friendly.
Add clear authorship, methodology, "how we know", last verified date, and
source links.
7. Avoid index bloat.
Keep canonical entity and high-intent comparison pages indexable.
Use `noindex` or crawl controls for low-value filter permutations and stale
or confusing pages.
8. Measure before deciding how open to be.
Track:
- Search Console web traffic
- ChatGPT referral traffic via `utm_source=chatgpt.com`
- bot traffic by user agent
- crawl cost versus referral value
Inference:
The best long-term moat is not withholding all facts. It is being the most
trusted and most reusable source for those facts, while keeping the expensive
and differentiating machinery private.
## Near-Term Next Steps For doesitarm
1. Add a public data-contract document describing the canonical app entity,
environment entity, evidence entity, and snapshot dataset.
2. Expand app pages from "status page" to "evidence page":
include methodology, last verified date, change history, and source
attribution.
3. Add structured data intentionally:
entity markup for app pages, dataset markup for exports, not generic markup
everywhere.
4. Add a public snapshot export and a dataset landing page.
5. Add a bot-policy matrix to `robots.txt` planning:
Google search, OpenAI search, Anthropic search, training bots, and user bots.
6. Add `llms.txt` only after the public canonical and export layers are clean.
7. Keep filters/search-result pages from becoming the primary indexable surface.
## Source Links
- Repo context:
- [README.md](/Users/athena/Code/doesitarm/README.md)
- [docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md)
- [static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt)
- [helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js)
- [helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js)
- [helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js)
- [static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json)
- Google AI features and AI search:
- https://developers.google.com/search/docs/appearance/ai-features
- https://developers.google.com/search/blog/2025/05/succeeding-in-ai-search
- https://developers.google.com/search/docs/fundamentals/creating-helpful-content
- https://developers.google.com/search/docs/essentials/spam-policies
- https://developers.google.com/search/docs/fundamentals/using-gen-ai-content
- Google review and structured-data guidance:
- https://developers.google.com/search/docs/appearance/reviews-system
- https://developers.google.com/search/docs/specialty/ecommerce/write-high-quality-reviews
- https://developers.google.com/search/docs/appearance/structured-data/software-app
- https://developers.google.com/search/docs/appearance/structured-data/dataset
- https://developers.google.com/search/docs/appearance/structured-data/sd-policies
- https://developers.google.com/search/docs/crawling-indexing/crawling-managing-faceted-navigation
- https://developers.google.com/search/docs/crawling-indexing/block-indexing
- Schema and dataset modeling:
- https://schema.org/SoftwareApplication
- OpenAI:
- https://help.openai.com/en/articles/12627856-publishers-and-developers-faq
- https://help.openai.com/en/articles/9237897-chatgpt-search
- https://platform.openai.com/docs/gptbot
- Anthropic:
- https://support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
- https://docs.anthropic.com/en/docs/build-with-claude/search-results
- Cloudflare:
- https://developers.cloudflare.com/style-guide/how-we-docs/ai-consumability/
- https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/
- https://blog.cloudflare.com/introducing-pay-per-crawl/
- Discovery and freshness:
- https://www.indexnow.org/index
- Practitioner and discussion context:
- https://simonwillison.net/2025/Oct/24/claude-code-docs-map/
- https://news.ycombinator.com/item?id=41072549
- https://lobste.rs/s/dmuad3/mitigating_sourcehut_s_partial_outage
## Source Quality Notes
- Google Search Central, OpenAI, Anthropic, Schema.org, IndexNow, and Cloudflare
were the primary sources for current guidance.
- The HN and Lobsters links were useful for operator sentiment and failure modes,
not as primary authority for ranking behavior.
- `llms.txt` appears real and increasingly implemented, but the strongest
current evidence still says it is supplemental rather than foundational.

View file

@ -0,0 +1,222 @@
# Private/Public Repo Sync Patterns For doesitarm
Tease: Git can push to a second remote, but it does not natively maintain a safe long-lived branch that "merges everything except `docs/`."
Lede: For `doesitarm` on 2026-04-04, the best-fit pattern is a second repo/remote plus an automated one-way export from `origin/master` that rewrites out `docs/` and force-pushes the result.
Why it matters:
- `docs/` is already present on `origin/master`, so excluding it from a future mirror does not make it private retroactively.
- `docs/` does not appear to participate in the app build, which makes export-time exclusion low-risk.
- Cross-repo automation on GitHub needs separate auth; the default workflow token is not enough for another private repo.
Go deeper:
- Simple one-way sync: fresh clone -> `git filter-repo --path docs/ --invert-paths` -> force-push to a second remote.
- Faster repeated projection: evaluate `splitsh-lite` if export speed or repeatability becomes important.
- Advanced bidirectional filtered views: `josh` is the serious option, but it is heavier than this repo likely needs.
Date: 2026-04-04
## Scope
Research whether `doesitarm` should support a `private-main`-style branch or
remote that automatically tracks the public default branch while excluding
paths such as `docs/`, and identify better patterns if they exist.
## Short Answer
Yes, you can push to a different remote from the public repo.
No, the durable pattern is not a long-lived `private-main` merge branch that
keeps deleting `docs/` after every merge. Git branches and merges are full-tree
operations, and sparse checkout does not change that.
For this repo, the cleanest pattern is:
1. Keep `origin/master` as the source branch.
2. Add a second remote or second repository for the private target.
3. Run a one-way export job on each push to `master`.
4. In that job, create a sanitized tree/history with `docs/` removed.
5. Force-push the sanitized result to the private remote branch.
If the real requirement is that future docs stay private, the better topology
is the reverse: keep the canonical repo private and generate the public export.
Inference:
That last recommendation is based on the current repo state: `docs/` is already
committed on `origin/master`, so a private mirror without `docs/` only changes
future distribution, not past public exposure.
## What The Repo Already Knows
- The default remote today is only `origin`, pointed at the public GitHub repo.
- The default tracked branch is `origin/master`, not `main`.
- There is no checked-in `.github/` workflow that already handles cross-repo
sync.
- `docs/` currently contains repo-local planning and research material:
[docs/app-flow.md](/Users/athena/Code/doesitarm/docs/app-flow.md),
[docs/plans/app-test-typescript-refactor.md](/Users/athena/Code/doesitarm/docs/plans/app-test-typescript-refactor.md),
and dated research memos under
[docs/research](/Users/athena/Code/doesitarm/docs/research).
- `docs/` is already present in `origin/master` history as of 2026-04-04.
- The build surface in
[package.json](/Users/athena/Code/doesitarm/package.json)
does not appear to depend on `docs/`.
## What The Evidence Says
- Different remote from the public repo:
yes. Git supports separate remotes, and the `git remote` docs explicitly say
that when fetch and push use different locations, you should use two separate
remotes rather than pretending they are the same remote.
- Long-lived branch with path exclusions:
not a native Git capability. Git merges operate on full trees, not
"everything except these directories."
- Sparse checkout:
not the answer here. The `git sparse-checkout` docs describe it as a working
directory reduction feature, and they note that operations such as merge or
rebase may still materialize paths outside the sparse specification.
- `git filter-repo`:
good fit for one-way export. The Git project now recommends it instead of
`filter-branch`, and its docs support `--invert-paths` for "keep everything
except these paths" rewrites. That matches "mirror the repo except `docs/`."
- `splitsh-lite`:
promising when you want repeatable projections into standalone repos and care
about performance. Its README supports split prefixes that can include
exclusions and uses a history cache, which is more appropriate than a manual
merge branch when this becomes a repeated sync lane.
- `josh`:
the advanced option. Its repo describes a proxy Git server that exposes
filtered histories as standalone repos and synchronizes between original and
filtered views. This is the closest thing to a "real" selective mirror
system, but it adds operational weight.
- GitHub Actions auth:
the default `github.token` is scoped to the current repository. If a workflow
in the public repo needs to push to a different private repo, you need a PAT,
deploy key, or GitHub App token instead.
## What Works
- A second remote or second repository for the private target.
- A one-way generated branch or repo, not a hand-maintained merge branch.
- Rebuilding the private export from `origin/master` every run.
- Treating the private mirror as generated output with force-pushes allowed.
- Keeping development on one source branch and one source-of-truth repo.
## What To Avoid
- Do not maintain `private-main` by repeatedly merging `master` and deleting
`docs/`. That creates unnecessary churn and eventual conflict debt.
- Do not use sparse checkout as if it were a publishing filter.
- Do not make the generated private mirror a peer source of truth unless you
also adopt a projection system designed for bidirectional sync.
- Do not rely on the default GitHub Actions token for pushes to another repo.
- Do not assume this setup hides `docs/` historically; those files are already
in the public remote history.
## Best Patterns
## 1. Best Fit For This Repo: one-way export to a second remote
Use a second remote, for example `private`, pointing at a separate private
GitHub repository. On each push to `origin/master`, run an automation that:
1. checks out `master`
2. authenticates to the private repo with a PAT, deploy key, or GitHub App
3. creates a fresh export clone or export worktree
4. rewrites out `docs/` and any other excluded paths
5. force-pushes the sanitized result to the private repo branch
Why this is the best fit:
- it matches the repo's current single-source workflow
- it does not depend on path-aware merges that Git does not have
- it keeps excluded-path logic in one place
- it is easy to reason about and recover from
Tradeoffs:
- exported commit SHAs will differ from public `master`
- the private mirror should be treated as generated/read-only
## 2. Better If Repeated Projection Becomes Core: `splitsh-lite`
If you end up publishing multiple filtered mirrors or need fast repeated
updates, `splitsh-lite` is worth a spike. It is built for turning repository
views into standalone histories and caching the work.
Tradeoffs:
- more specialized operational knowledge
- less obvious to future maintainers than a simple export script
## 3. Better Only For Advanced Bidirectional Partial Views: `josh`
If the real requirement becomes "developers commit through filtered views and
changes synchronize both directions," `josh` is the pattern to study.
Tradeoffs:
- significant infrastructure/runtime overhead
- far more complexity than `doesitarm` appears to need today
## 4. Adjacent But Not The Same Problem: GitHub Private Mirrors
GitHub's `private-mirrors` app is relevant if the goal is to collaborate
privately around a public repository and upstream later. It is not the right
answer for "same repo minus `docs/`," but it is worth noting as a neighboring
pattern.
## Recommendation
For `doesitarm`, use a separate private repository plus a generated sync job.
Name the target branch after the actual default branch in this repo, for
example `private-master` or simply `master` on the private remote.
Do not implement this as a merge branch.
If the aim is just "same code, different remote, minus `docs/`," a generated
one-way mirror is the right level of machinery.
If the aim is "keep future internal docs private," move the source of truth to
a private repo and generate the public mirror from that private origin.
## Missing Information
- Whether the private target is intended to be read-only/generated or whether
anyone will commit directly to it.
- Whether `docs/` is the only excluded path or just the first example.
- Whether the real goal is secrecy, deployment hygiene, or private-only
collaboration before publishing.
## Source Links
- Git remote docs:
https://git-scm.com/docs/git-remote
- Git sparse-checkout docs:
https://git-scm.com/docs/git-sparse-checkout
- `git-filter-repo` repository:
https://github.com/newren/git-filter-repo
- `git-filter-repo` manual:
https://www.mankier.com/1/git-filter-repo
- `splitsh-lite` repository:
https://github.com/splitsh/lite
- `josh` repository:
https://github.com/josh-project/josh
- `actions/checkout` README:
https://github.com/actions/checkout
- GitHub App auth in GitHub Actions:
https://docs.github.com/en/enterprise-cloud@latest/apps/creating-github-apps/authenticating-with-a-github-app/making-authenticated-api-requests-with-a-github-app-in-a-github-actions-workflow
- GitHub deploy keys:
https://docs.github.com/v3/guides/managing-deploy-keys
- GitHub Private Mirrors app:
https://github.com/github-community-projects/private-mirrors
- Stack Overflow, partial sharing of Git repositories:
https://stackoverflow.com/questions/278270/partial-sharing-of-git-repositories
## Source Quality Notes
- HN and Lobsters searches on 2026-04-04 did not surface a clearly better
mainstream pattern than the Git/GitHub docs plus the specialized projection
tools above.
- Primary docs and project READMEs were materially more useful than forum
commentary for this question.

View file

@ -0,0 +1,396 @@
# Public Repo Security And Monorepo Patterns For doesitarm
Tease: The safest version of this plan keeps `doesitarm` public, but treats credentials, imports, downloaded app artifacts, and privileged automation as private operational surfaces.
Lede: For `doesitarm` on 2026-04-04, the best-fit pattern is a Kriasoft-style public monorepo with clear `apps/`, `packages/`, `db/`, and `infra/` boundaries, plus hardened GitHub Actions, GitHub-hosted runners for public workflows, D1 local development via Wrangler, and private storage for secrets, backups, and quarantined artifacts.
Why it matters:
- The current repo is about to add higher-risk surfaces: D1, automated app discovery, archive downloading, scheduled jobs, and more Cloudflare automation.
- In a public repo, CI/CD mistakes matter as much as application code mistakes. Workflow files, tokens, logs, and runner choices become part of the threat model.
- The current repo already has one immediate security problem: a workflow prints secret-derived files to CI logs.
Go deeper:
- Keep the code public; keep secrets, raw data, and operational state private.
- Refactor toward a monorepo shape early so new ingestion, scanner, D1, and infra code do not spread across a flat root.
- Adopt OSS-friendly GitHub hardening: read-only default `GITHUB_TOKEN`, pinned actions, CODEOWNERS on workflow/infra/db paths, secret scanning, private vulnerability reporting, and no self-hosted runners for public PRs.
Date: 2026-04-04
## Scope
Research security considerations and common open-source repository patterns for a
setup like `doesitarm`:
- public GitHub repository
- Cloudflare Workers and D1
- scheduled automation
- automated downloading and scanning of third-party app archives
- prospective monorepo refactor in the style of
`kriasoft/react-starter-kit`
This memo is intended to drive updates to
[app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md).
## Short Answer
Do not move the whole repo private.
Instead:
1. Keep the application and infrastructure code public.
2. Move secrets, imported raw data, D1 operational state, downloaded artifacts,
quarantined samples, and any sensitive fixtures to private systems.
3. Refactor into a monorepo early, using a Kriasoft-style structure adapted to
this repo's existing pnpm/Netlify/Astro/Workers setup.
4. Harden GitHub Actions before expanding automation.
Best-fit recommendation:
- Public monorepo with `apps/`, `packages/`, `db/`, `infra/`, `scripts/`,
and `docs/`
- GitHub-hosted runners for public workflows
- GitHub environment secrets with required reviewers for production deploys
- Cloudflare D1 local development and tests via Wrangler `--local`,
`preview_database_id`, and test harnesses like `unstable_dev()`/Miniflare
- Private object storage or equivalent for raw app archives, import dumps,
and quarantine material
Inference:
This is the right fit because the repo is open source and community-facing, but
the risky parts are operational, not architectural. Public code is compatible
with good security here; public credentials and public operational data are not.
## What The Repo Already Knows
- The repo is currently flat-rooted, not organized as a workspace monorepo.
- There is no checked-in D1 configuration or local D1 bootstrap yet.
- There is Cloudflare deployment automation in
[deploy-cloudflare-workers.yml](/Users/athena/Code/doesitarm/.github/workflows/deploy-cloudflare-workers.yml).
- That workflow currently decodes secret-backed `.env` / `wrangler.toml` files
and prints them with `cat`, which is a real security issue in CI logs.
- The site build still depends on remote/env-backed feeds such as
`SCANS_SOURCE`, `COMMITS_SOURCE`, `HOMEBREW_SOURCE`, `GAMES_SOURCE`, and
`VFUNCTIONS_URL`.
- The scanner and planned discovery pipeline will process untrusted third-party
files, including archive formats like ZIP, DMG, and PKG.
- `.env` is ignored at the root, and per-worker `wrangler.toml` files are
already ignored in worker subdirectories.
## What The Evidence Says
### 1. Public repos can stay public if the operational boundary is private
GitHub's own docs assume public repositories will use:
- repository or environment secrets
- restricted organization secret access
- private vulnerability reporting
- automatic secret scanning on public repos
That is strong evidence that the normal pattern is not "make the repo private";
it is "keep sensitive operational material out of the repo and out of logs."
### 2. Default GitHub Actions posture should be least privilege
GitHub recommends:
- minimum required `GITHUB_TOKEN` permissions
- default repository token permission set to read-only
- escalating permissions only per job
- using a GitHub App token if a job needs more than `GITHUB_TOKEN` can provide
This matches what open-source repos increasingly do for deploy, release, and
cross-repo automation.
### 3. Secrets are still easy to leak through logs and workflow behavior
GitHub's secure-use docs explicitly warn that:
- redaction is not guaranteed for transformed values
- structured blobs like JSON/YAML are poor secret formats
- non-secret values should be masked explicitly with `::add-mask::`
- exposed secrets in logs should trigger deletion/rotation
For `doesitarm`, this directly applies to the current workflow that prints
secret-derived config files into CI output.
### 4. Public repos should avoid self-hosted runners for untrusted PRs
GitHub explicitly recommends self-hosted runners only with private
repositories, because forks of public repositories can run dangerous code on
them through pull requests.
For this repo, that means:
- do not put public PR workflows on a local machine or other long-lived
self-hosted runner
- do not run untrusted archive-processing jobs on a self-hosted runner that
also holds production credentials
### 5. `pull_request_target` remains a common footgun
GitHub Security Lab's `Preventing pwn requests` guidance is still the clearest
implementation reference:
- `pull_request_target` plus checking out/building PR code is dangerous
- untrusted PR code should run in an unprivileged `pull_request` workflow
- privileged follow-up actions should happen through `workflow_run` with
carefully handled artifacts
HN discussion around real workflow exploits reinforces the same point: the
problem is not theoretical.
### 6. Common OSS hardening patterns for GitHub workflows are now well-defined
GitHub secure-use guidance and OpenSSF best-practice guidance converge on:
- pin actions to full commit SHAs
- restrict allowed actions where possible
- guard `.github/workflows/` with `CODEOWNERS`
- keep default branch protected
- require reviews and passing checks
- use code scanning / dependency review / secret scanning / Dependabot
- use private vulnerability reporting for public repos
These are standard public-repo practices, not enterprise-only overkill.
### 7. Cloudflare D1 already supports local-first development and tests
Cloudflare's D1 docs explicitly support:
- `wrangler dev` local mode
- `preview_database_id`
- `wrangler d1 migrations apply --local`
- test setups using Miniflare and `unstable_dev()`
That means D1 does not require a private repo or remote-only workflow. It fits
the "run locally on this machine, then automate" plan well.
### 8. Cloudflare Workflows and observability make Cloudflare a credible later home for ingestion
Cloudflare Workflows now position themselves as durable multi-step execution
with retries, persisted state, and debugging. Workers Logs and Traces provide
native observability. That is enough evidence to treat Cloudflare as a viable
later landing zone for scheduled ingestion and scan orchestration.
Inference:
GitHub Actions is still the easier first scheduler because it is already in the
repo, but Cloudflare Workflows has matured enough to stay in the plan as a
serious later option.
### 9. Kriasoft's monorepo shape is a good architectural fit, but not every exact convention should be copied blindly
`kriasoft/react-starter-kit` is a public monorepo with:
- `apps/`
- `packages/`
- `db/`
- `docs/`
- `infra/`
- `scripts/`
It also documents a public template env pattern where committed `.env`
contains placeholders/defaults and `.env.local` contains real credentials.
That shape is a strong fit for `doesitarm`, but I would adapt the env pattern
slightly for safety and clarity:
- keep a committed public template file such as `.env.example`
- keep real credentials in `.env.local`, `.dev.vars`, GitHub environment
secrets, and Cloudflare secrets
Inference:
Kriasoft's folder layout is the part worth copying directly. The exact env-file
naming should follow the least-confusing safe convention for this repo.
## Common Open-Source Patterns That Fit doesitarm
### Public code, private state
Keep public:
- app code
- scanner code
- D1 schema and migrations
- workflow definitions
- docs and plans
Keep private:
- deploy credentials and tokens
- raw Google Sheets exports or database backups
- downloaded app archives
- quarantine samples
- private test fixtures that would create redistribution or abuse risk
- operational dashboards and alert destinations
### Workspace monorepo with clear trust boundaries
Best-fit structure for `doesitarm`:
- `apps/web/` — Astro site and app-test UI
- `apps/default-worker/` — current `doesitarm-default`
- `apps/analytics-worker/` — current `workers/analytics`
- `apps/ingest/` or `apps/discovery/` — CLI/admin surface for discovery jobs
- `packages/scanner-core/` — shared scan engine and file-format logic
- `packages/source-runners/` — Homebrew/GitHub/download-page source runners
- `packages/data-model/` — shared D1 schema types, DTOs, validation
- `packages/site-build/` — list/build/export helpers
- `db/` — D1 migrations, seeds, import scripts, local test DB helpers
- `infra/` — Wrangler config, deploy config, policy docs
- `scripts/` — repo automation
- `docs/` — plans, research, operational docs
### Repo template files, not repo secrets
Common OSS pattern:
- commit `.env.example` or placeholder-only `.env`
- ignore `.env.local`, `.dev.vars`, and `.wrangler/`
- keep Cloudflare secrets in Workers secrets / GitHub environment secrets
### Hardened GitHub Actions for public forks
Common OSS pattern:
- default `permissions: { contents: read }`
- explicit per-job escalation only
- require approval for fork PR workflows where appropriate
- no self-hosted runners for public PRs
- no `pull_request_target` workflows that checkout/build PR code
### Supply-chain hygiene for workflows
Common OSS pattern:
- pin actions to full SHAs
- restrict allowed actions
- Dependabot for action updates
- CodeQL / code scanning for workflow vulnerabilities
- OpenSSF Scorecards for ongoing hygiene checks
### Disclosure and scanning defaults
Common OSS pattern:
- enable private vulnerability reporting
- enable secret scanning and push protection
- keep a `SECURITY.md` policy
## What Works
- Keeping the repo public while moving secrets and sensitive data out of git
- Refactoring to a monorepo before adding more D1/discovery complexity
- Treating workflow files, `infra/`, and `db/` as protected surfaces with
`CODEOWNERS`
- Using GitHub-hosted runners for public CI and scheduled jobs
- Using environment-specific secrets with required reviewers for production
deployment jobs
- Using D1 local mode and local migrations as part of normal development
- Using Cloudflare Logs/Traces or equivalent observability for scheduled jobs
- Storing raw archives and quarantine material in private object storage rather
than in the repo
## What To Avoid
- Do not move the whole repo private as a substitute for secrets hygiene
- Do not keep the current workflow behavior that prints secret-derived files to
CI logs
- Do not use self-hosted runners for public PR workflows
- Do not run archive downloads/extraction in privileged workflows that also have
deploy credentials
- Do not combine `pull_request_target` with explicit PR checkout/build steps
- Do not keep adding discovery/D1/worker code into the current flat root
- Do not commit raw import dumps, app archives, or structured secret blobs
## Recommendation
For `doesitarm`, the strongest next-step package is:
1. Refactor toward a Kriasoft-style monorepo shape adapted to pnpm.
2. Add a security-hardening stage before expanding automation.
3. Keep the repo public.
4. Keep secrets, raw operational data, and archive/quarantine material private.
5. Start scheduled discovery on GitHub-hosted runners with hardened workflows.
6. Keep Cloudflare Workflows as a second-phase target for durable ingestion.
Immediate high-priority actions to capture in the plan:
1. Remove secret printing from
[deploy-cloudflare-workers.yml](/Users/athena/Code/doesitarm/.github/workflows/deploy-cloudflare-workers.yml)
and rotate affected secrets.
2. Add repo policy and tooling for:
- read-only default `GITHUB_TOKEN`
- pinned actions
- `CODEOWNERS` for `.github/workflows/`, `infra/`, and `db/`
- secret scanning / push protection
- private vulnerability reporting
3. Add ignored local-secret files for the new D1/Workers workflow:
- `.env.local`
- `.dev.vars`
- `.wrangler/`
4. Keep public PR CI on GitHub-hosted runners only.
5. Store raw archives/import snapshots outside the repo.
## Missing Information
- Whether the future ingestion runtime is expected to stay GitHub-first or
eventually move fully to Cloudflare Workers/Workflows.
- Whether there are legal or vendor-policy constraints around storing downloaded
app archives long term.
- Whether the monorepo refactor should keep Netlify as-is or consolidate more
runtime surfaces onto Cloudflare.
## Source Links
- GitHub Docs, `GITHUB_TOKEN` least-privilege and GitHub App escalation:
https://docs.github.com/en/actions/tutorials/authenticate-with-github_token
- GitHub Docs, secrets in Actions, fork-secret behavior, environment reviewers,
OIDC, and masking:
https://docs.github.com/en/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions
- GitHub Docs, secure use reference, pinning actions, CODEOWNERS, code scanning,
Dependabot, and Scorecards:
https://docs.github.com/en/actions/reference/security/secure-use
- GitHub Docs, self-hosted runner warning for public repositories:
https://docs.github.com/en/actions/how-tos/manage-runners/self-hosted-runners/add-runners
- GitHub Docs, limiting self-hosted runners in organizations:
https://docs.github.com/en/organizations/managing-organization-settings/disabling-or-limiting-github-actions-for-your-organization
- GitHub Docs, approval requirements for fork PR workflows:
https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/approving-workflow-runs-from-public-forks
- GitHub Docs, repository Actions settings and fork workflow controls:
https://docs.github.com/github/administering-a-repository/managing-repository-settings/disabling-or-limiting-github-actions-for-a-repository
- GitHub Docs, secret scanning for public repositories:
https://docs.github.com/github/administering-a-repository/about-token-scanning
- GitHub Docs, enabling secret scanning / push protection:
https://docs.github.com/en/code-security/how-tos/secure-your-secrets/detect-secret-leaks/enabling-secret-scanning-for-your-repository
- GitHub Docs, enabling push protection:
https://docs.github.com/en/code-security/secret-scanning/enabling-secret-scanning-features/enabling-push-protection-for-your-repository
- GitHub Docs, private vulnerability reporting:
https://docs.github.com/en/code-security/security-advisories/working-with-repository-security-advisories/configuring-private-vulnerability-reporting-for-a-repository
- GitHub Security Lab, `pull_request_target` / `workflow_run` guidance:
https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/
- OpenSSF GitHub configuration best practices:
https://best.openssf.org/SCM-BestPractices/github/
- Kriasoft React Starter Kit:
https://github.com/kriasoft/react-starter-kit
- Cloudflare D1 local development:
https://developers.cloudflare.com/d1/best-practices/local-development/
- Cloudflare Workers observability:
https://developers.cloudflare.com/workers/observability/
- Cloudflare Workers logs:
https://developers.cloudflare.com/workers/observability/logs/
- Cloudflare Workers traces:
https://developers.cloudflare.com/workers/observability/traces/
- Cloudflare Workflows overview:
https://developers.cloudflare.com/workflows/
## Source Quality Notes
- Highest-confidence sources in this memo are GitHub Docs, GitHub Security Lab,
OpenSSF, Cloudflare Docs, and the Kriasoft repository itself.
- HN/Lobsters did not surface a materially better competing pattern in this
pass; the most useful HN signal reinforced GitHub Security Lab's warning on
`pull_request_target`.
- The recommendation to keep the repo public but move operational data private
is a synthesis from official guidance plus this repo's current shape and risk
surface.