mirror of
https://github.com/ThatGuySam/doesitarm.git
synced 2026-05-18 06:44:46 -07:00
docs(plan): add discovery and deploy follow-up research
Capture the next discovery, security, compatibility-data, and dual-deploy planning work, and ignore local Vercel/env state that should not be committed. This keeps the operational research with the repo while avoiding accidental local-config churn. Constraint: Must not alter production runtime behavior Rejected: Fold research notes into the runtime fix commit | obscures the user-facing app-test correction with planning-only material Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep .omx local state untracked even when committing broad workspace updates Tested: Document review only Not-tested: No runtime verification required for docs and ignore rules
This commit is contained in:
parent
e667ab564e
commit
1248c705b0
6 changed files with 1634 additions and 0 deletions
|
|
@ -0,0 +1,600 @@
|
|||
# Desktop App Compatibility Data Strategy For doesitarm
|
||||
|
||||
Tease: In 2026, the winning play is not "AI SEO." It is publishing a high-trust, machine-readable compatibility corpus that is genuinely useful on the open web and selectively keeping the operational and proprietary layers private.
|
||||
|
||||
Lede: For `doesitarm` on 2026-04-04, the best-fit strategy is to treat the site as an entity-and-evidence graph for desktop software compatibility: publish canonical app pages, provenance-rich evidence, structured exports, and selective machine-readable surfaces for discovery; keep raw crawls, binary artifacts, candidate matches, scoring logic, and operational intelligence private.
|
||||
|
||||
Why it matters:
|
||||
- This project can outlive the Apple Silicon transition if the core model is "desktop software compatibility knowledge," not just "Apple Silicon list posts."
|
||||
- Google's 2025 AI search guidance still rewards the same fundamentals: unique content, crawlable pages, textual clarity, and trustworthy evidence, not special AI-only tricks.
|
||||
- OpenAI and Anthropic now expose separate search, user-action, and training bots, which means "open versus closed" is no longer binary. You can choose visibility, training access, and operational exposure separately.
|
||||
|
||||
Go deeper:
|
||||
- Think of the public site as a citation layer and decision-support layer, not as the full warehouse.
|
||||
- Publish public facts, provenance, timestamps, and curated exports. Keep raw ingestion, low-confidence candidates, and monetizable workflow intelligence private.
|
||||
- Treat `llms.txt` and Markdown exports as helpful secondary surfaces, not as the core strategy. The core strategy is still clean HTML, canonical URLs, structured data, sitemaps, and useful pages.
|
||||
|
||||
Date: 2026-04-04
|
||||
|
||||
## Scope
|
||||
|
||||
Research how to think about a long-lived desktop app compatibility database as a
|
||||
content, SEO, and AI-discoverability system in 2026, including:
|
||||
|
||||
- best practices for public content architecture
|
||||
- how LLM-driven discovery changes the picture
|
||||
- what data should likely stay public versus private
|
||||
- what audiences this data can serve
|
||||
- tradeoffs between more-open and more-closed approaches
|
||||
|
||||
## Short Answer
|
||||
|
||||
Build `doesitarm` as a public knowledge product with a private operating system
|
||||
underneath it.
|
||||
|
||||
Publicly, publish:
|
||||
|
||||
- canonical app pages
|
||||
- compatibility status by platform/environment
|
||||
- evidence summaries and source links
|
||||
- timestamps, changelogs, and history
|
||||
- stable IDs, taxonomy, and machine-readable metadata
|
||||
- a limited public API or snapshot exports for high-value reuse
|
||||
|
||||
Privately, keep:
|
||||
|
||||
- raw crawls and downloaded binaries
|
||||
- candidate entities before review
|
||||
- normalization, dedupe, and confidence logic
|
||||
- crawler logs, abuse rules, and infrastructure controls
|
||||
- enrichment that creates monetizable leverage rather than user value on the open web
|
||||
|
||||
The biggest strategic shift from 2018 to 2026 is this:
|
||||
|
||||
1. Search still rewards useful original pages.
|
||||
2. AI discovery mostly rides on those same pages.
|
||||
3. Separate crawler controls now let you be open for search while staying more closed for training.
|
||||
4. The moat is less "having any compatibility data at all" and more:
|
||||
verification quality, provenance, freshness, historical depth, and workflow speed.
|
||||
|
||||
Inference:
|
||||
No single source states that exact four-part conclusion. It is the synthesis that
|
||||
best fits the repo state plus current Google, OpenAI, Anthropic, Cloudflare,
|
||||
HN, and Lobsters evidence.
|
||||
|
||||
## What The Repo Already Knows
|
||||
|
||||
- The project already acts like a compatibility corpus, not just a blog:
|
||||
[README.md](/Users/athena/Code/doesitarm/README.md) is a manually curated,
|
||||
source-linked compatibility list.
|
||||
- The repo already has a plan to move toward a canonical database and discovery
|
||||
pipeline:
|
||||
[docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md).
|
||||
- The public site already exposes crawlable pages, a sitemap, and permissive
|
||||
crawling:
|
||||
[static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt),
|
||||
[static/sitemap-index.xml](/Users/athena/Code/doesitarm/static/sitemap-index.xml).
|
||||
- The current public JSON already exposes useful app-level fields such as name,
|
||||
aliases, status, bundle IDs, related links, scan details, and device support:
|
||||
[static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json).
|
||||
- The current structured data implementation is narrow and video-centric:
|
||||
[helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js),
|
||||
[helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js),
|
||||
[helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js).
|
||||
- I did not find a checked-in `llms.txt`, `llms-full.txt`, or per-page Markdown
|
||||
export surface.
|
||||
- I also did not find `SoftwareApplication` or `Dataset` structured data on app
|
||||
or dataset pages.
|
||||
|
||||
Inference:
|
||||
`doesitarm` already has enough public data shape to become a strong
|
||||
machine-readable corpus. The main gap is not "inventing the dataset." The gap is
|
||||
formalizing and publishing the right layers of it.
|
||||
|
||||
## What The Evidence Says
|
||||
|
||||
### 1. Google AI search still wants normal SEO fundamentals, not special AI tricks
|
||||
|
||||
Google's current AI-features guidance says there are no extra technical
|
||||
requirements for AI Overviews or AI Mode beyond normal Search eligibility.
|
||||
Google explicitly says you do not need new AI files or special schema just to
|
||||
appear in AI features.
|
||||
|
||||
What does matter:
|
||||
|
||||
- crawl access
|
||||
- internal links
|
||||
- page experience
|
||||
- important content in textual form
|
||||
- structured data matching visible text
|
||||
- unique, non-commodity content
|
||||
|
||||
This is the strongest argument against building an "AI discoverability" strategy
|
||||
around gimmicks alone.
|
||||
|
||||
### 2. Large-scale thin template pages are a real risk
|
||||
|
||||
Google's helpful-content and spam-policy guidance is directly relevant to
|
||||
programmatic compatibility sites:
|
||||
|
||||
- people-first content is favored
|
||||
- pages made mainly to attract search visits are a warning sign
|
||||
- scaled content abuse includes generating many low-value pages, including from
|
||||
feeds or automated transformations
|
||||
|
||||
That means a compatibility database can absolutely win in search, but only if
|
||||
its pages add decision-making value. Thin pages that just restate a status field
|
||||
are dangerous.
|
||||
|
||||
### 3. Compatibility content should look more like tested reviews than like directory filler
|
||||
|
||||
Google's reviews guidance is a good proxy for compatibility pages because users
|
||||
often arrive with a purchase, migration, or workflow decision in mind.
|
||||
|
||||
The guidance consistently rewards:
|
||||
|
||||
- original research
|
||||
- first-hand evidence
|
||||
- quantitative measurements where relevant
|
||||
- comparisons
|
||||
- what changed across versions
|
||||
- benefits and drawbacks
|
||||
|
||||
For `doesitarm`, that maps cleanly onto:
|
||||
|
||||
- status by environment
|
||||
- last verified date
|
||||
- evidence links
|
||||
- scanner output or screenshots where appropriate
|
||||
- "what changed" changelog notes
|
||||
- comparison pages like native vs Rosetta vs virtualization vs cloud workaround
|
||||
|
||||
### 4. Dataset markup is useful, but it should describe real dataset landing pages
|
||||
|
||||
Google's dataset documentation recommends canonical landing pages plus dataset
|
||||
metadata such as `sameAs`, `isBasedOn`, identifiers, license, and download
|
||||
distribution metadata.
|
||||
|
||||
That is a strong fit for curated exports such as:
|
||||
|
||||
- a public daily or weekly compatibility snapshot
|
||||
- a historical archive by date
|
||||
- vendor- or category-specific exports
|
||||
- a Windows-on-ARM or future-transition slice later on
|
||||
|
||||
Important nuance:
|
||||
Google's dataset docs are about Dataset Search discovery, not a substitute for
|
||||
general web SEO. Dataset markup helps when you actually publish datasets.
|
||||
|
||||
### 5. `SoftwareApplication` markup fits the entity model, but Google rich-result requirements are narrower
|
||||
|
||||
Schema.org's `SoftwareApplication` type supports fields that are very relevant
|
||||
here, including:
|
||||
|
||||
- `applicationCategory`
|
||||
- `downloadUrl`
|
||||
- `featureList`
|
||||
- `operatingSystem`
|
||||
- `softwareRequirements`
|
||||
- `softwareVersion`
|
||||
- `supportingData`
|
||||
|
||||
Google also has a software-app structured-data feature, but its rich-result
|
||||
requirements are more commerce-shaped, including `offers.price` and
|
||||
review/rating support. That means:
|
||||
|
||||
- use `SoftwareApplication` semantics where they match the visible page truth
|
||||
- do not invent store-like fields just to chase rich results
|
||||
- use dataset markup for exports and software/entity markup for canonical app
|
||||
pages
|
||||
|
||||
### 6. AI discoverability is now bot-by-bot, not one global yes/no
|
||||
|
||||
OpenAI and Anthropic both now distinguish between different AI access modes.
|
||||
|
||||
OpenAI:
|
||||
|
||||
- `OAI-SearchBot` is for search inclusion
|
||||
- `GPTBot` is for training
|
||||
- `ChatGPT-User` is for user-triggered actions
|
||||
|
||||
Anthropic:
|
||||
|
||||
- `ClaudeBot` is for training
|
||||
- `Claude-SearchBot` is for search quality
|
||||
- `Claude-User` is for user-triggered retrieval
|
||||
|
||||
This is strategically important. You no longer need to choose only between:
|
||||
|
||||
- fully public for every AI purpose
|
||||
- fully blocked for every AI purpose
|
||||
|
||||
You can allow discovery while disallowing training, or allow search while
|
||||
tightly managing user-action access, depending on your goals.
|
||||
|
||||
### 7. `llms.txt` is real, but it is still a secondary signal
|
||||
|
||||
Cloudflare has implemented `llms.txt`, `llms-full.txt`, and per-page Markdown
|
||||
exports, and Simon Willison has highlighted similar docs-map patterns as useful
|
||||
for agent tooling.
|
||||
|
||||
That said:
|
||||
|
||||
- Google explicitly says no special AI text files are required for AI features
|
||||
- OpenAI's discoverability guidance focuses on crawler access, `noindex`, and
|
||||
citation/linking, not `llms.txt`
|
||||
- HN and Lobsters discussions show real skepticism around AI crawler incentives
|
||||
and how consistently emerging conventions are respected
|
||||
|
||||
Best interpretation:
|
||||
|
||||
- `llms.txt` is worth adding because it is cheap and increasingly recognized
|
||||
- it should not be treated as the core lever
|
||||
- the core lever is still strong public pages plus clean machine-readable
|
||||
content
|
||||
|
||||
### 8. AI-friendly plain-text and Markdown surfaces do have practical value
|
||||
|
||||
Cloudflare's docs work here is the clearest practical example:
|
||||
|
||||
- per-page Markdown versions
|
||||
- an index file
|
||||
- bulk text export
|
||||
- semantic HTML
|
||||
- `noindex` on low-value or confusing pages
|
||||
|
||||
This is less about search ranking and more about:
|
||||
|
||||
- making retrieval cheaper and more accurate for agents
|
||||
- improving citation quality
|
||||
- reducing token waste
|
||||
- giving your own future agents and partners a stable ingest format
|
||||
|
||||
For a compatibility corpus, that suggests public Markdown or JSON exports are
|
||||
worth doing for the canonical facts layer.
|
||||
|
||||
### 9. Freshness and URL discovery matter more as the corpus grows
|
||||
|
||||
Google recommends sitemaps and Search Console monitoring.
|
||||
IndexNow gives faster change pings for engines that support it, including Bing.
|
||||
|
||||
For a frequently updated compatibility corpus, this argues for:
|
||||
|
||||
- canonical landing pages
|
||||
- clean sitemap generation
|
||||
- changelog feeds or update streams
|
||||
- optional IndexNow support for faster non-Google discovery
|
||||
|
||||
### 10. The crawl environment is getting more adversarial
|
||||
|
||||
Cloudflare Radar reported AI and search crawling growth of 18% from May 2024 to
|
||||
May 2025 across its measured cohort, with `GPTBot` up 305%.
|
||||
HN and Lobsters operator discussions show why this matters in practice:
|
||||
|
||||
- some AI crawlers create real infrastructure cost
|
||||
- incentives are less aligned than classic web search
|
||||
- operators increasingly need bot-specific controls, rate limiting, and
|
||||
selective exposure
|
||||
|
||||
This is the best evidence for keeping raw and high-cost surfaces private even if
|
||||
you lean more open on the public facts layer.
|
||||
|
||||
## Ways This Data Can Create Value
|
||||
|
||||
### Human audiences
|
||||
|
||||
- End users deciding whether they can keep using a favorite app on new hardware.
|
||||
- IT, procurement, and upgrade planners deciding when a transition is safe.
|
||||
- Developers and vendors tracking native support gaps and competitive pressure.
|
||||
- Journalists and analysts covering platform transitions.
|
||||
- Researchers and historians studying how ecosystems adapt to hardware changes.
|
||||
|
||||
### Machine audiences
|
||||
|
||||
- Search engines indexing canonical app, category, and comparison pages.
|
||||
- LLM search products citing your pages as evidence.
|
||||
- RAG systems consuming public snapshots or APIs.
|
||||
- Agents answering migration, procurement, or troubleshooting questions.
|
||||
- Internal `doesitarm` automation using the same canonical public layer as a
|
||||
stable reference surface.
|
||||
|
||||
### Business-model value
|
||||
|
||||
- Audience growth from high-intent compatibility queries.
|
||||
- Affiliate or sponsored monetization on truly decision-support pages.
|
||||
- Paid APIs, bulk exports, or enterprise dashboards.
|
||||
- Vendor intelligence and alerting.
|
||||
- Historical transition data as a differentiated research asset.
|
||||
|
||||
Inference:
|
||||
The public facts are likely to commoditize over time. The durable value is the
|
||||
combination of breadth, freshness, provenance, history, and tooling layered on
|
||||
top of those facts.
|
||||
|
||||
## What Should Likely Stay Public
|
||||
|
||||
Public-by-default fields:
|
||||
|
||||
- stable app identifier and canonical URL
|
||||
- app name, aliases, vendor, category, platform family
|
||||
- compatibility status by environment
|
||||
- environment dimensions such as CPU architecture, OS family/version, native
|
||||
vs translation vs virtualization
|
||||
- bundle IDs and installer/package metadata where safe and user-useful
|
||||
- last verified date, first seen date, last changed date
|
||||
- public evidence summary and source links
|
||||
- changelog summary for status changes
|
||||
- category and comparison pages built from real user tasks
|
||||
- curated JSON, CSV, or Parquet snapshot exports
|
||||
- public structured data and sitemaps
|
||||
|
||||
Public page types that seem high-value:
|
||||
|
||||
- canonical app pages
|
||||
- category pages
|
||||
- "best alternatives if not native yet" pages
|
||||
- transition pages such as "best native DAWs on Apple Silicon"
|
||||
- comparison pages by use case, hardware generation, and workaround path
|
||||
- dataset landing pages for bulk exports
|
||||
|
||||
## What Should Likely Stay Private
|
||||
|
||||
Private-by-default fields:
|
||||
|
||||
- raw crawled HTML and downloaded ZIP/DMG/PKG artifacts
|
||||
- extracted binaries and quarantine samples
|
||||
- low-confidence matches and candidate entities
|
||||
- dedupe, normalization, and scoring heuristics
|
||||
- reviewer notes, moderation notes, and dispute state
|
||||
- crawler logs, IP intelligence, WAF rules, and abuse signatures
|
||||
- affiliate economics, contact records, outreach state, and deal terms
|
||||
- internal confidence models, embeddings, and experimental feature engineering
|
||||
- unpublished source mappings and scrape recipes that are costly to build
|
||||
|
||||
Why keep these private:
|
||||
|
||||
- operational risk
|
||||
- legal and hosting risk
|
||||
- abuse resistance
|
||||
- clearer moat
|
||||
- lower copyability
|
||||
|
||||
## Different Ways To Think About The Database
|
||||
|
||||
### 1. Directory / programmatic SEO system
|
||||
|
||||
Upside:
|
||||
- fastest traffic growth if executed well
|
||||
|
||||
Downside:
|
||||
- easiest to drift into thin pages and scaled-content abuse
|
||||
- weakest long-term moat
|
||||
|
||||
Use this frame only if every template answers a real question better than a
|
||||
generic directory would.
|
||||
|
||||
### 2. Public knowledge graph with evidence
|
||||
|
||||
Upside:
|
||||
- strongest fit for search, citations, and trust
|
||||
- best long-term reuse across Apple, Windows, and future transitions
|
||||
|
||||
Downside:
|
||||
- requires stronger data modeling and provenance discipline
|
||||
|
||||
This is the best framing for `doesitarm`.
|
||||
|
||||
### 3. Public publication layer over a private intelligence system
|
||||
|
||||
Upside:
|
||||
- best balance of discoverability and defensibility
|
||||
- easiest path to enterprise/API products later
|
||||
|
||||
Downside:
|
||||
- more operational complexity
|
||||
|
||||
This is the recommended operating model.
|
||||
|
||||
### 4. Mostly closed database with selective public summaries
|
||||
|
||||
Upside:
|
||||
- strongest direct control over assets
|
||||
|
||||
Downside:
|
||||
- weakest SEO and AI discoverability
|
||||
- hardest to build brand authority from the data itself
|
||||
|
||||
This makes sense only if monetization depends more on closed workflows than on
|
||||
being the public authority.
|
||||
|
||||
## Open Vs Closed Strategy Options
|
||||
|
||||
## Option 1. Open facts, private operations
|
||||
|
||||
Publish:
|
||||
|
||||
- canonical pages
|
||||
- evidence summaries
|
||||
- limited exports
|
||||
- structured data
|
||||
|
||||
Keep private:
|
||||
|
||||
- raw ingestion
|
||||
- candidate pipeline
|
||||
- scoring and ops
|
||||
|
||||
Tradeoff:
|
||||
Best overall balance of discoverability, trust, and defensibility.
|
||||
|
||||
## Option 2. Open pages, paid API / paid bulk data
|
||||
|
||||
Publish:
|
||||
|
||||
- strong pages for discovery and citations
|
||||
- free lightweight API or delayed snapshots
|
||||
|
||||
Charge for:
|
||||
|
||||
- real-time API
|
||||
- higher limits
|
||||
- historical depth
|
||||
- enterprise filters and alerts
|
||||
|
||||
Tradeoff:
|
||||
Strong monetization path, but requires clearer product packaging.
|
||||
|
||||
## Option 3. Fully open data commons
|
||||
|
||||
Publish:
|
||||
|
||||
- everything except unsafe raw binaries/secrets
|
||||
|
||||
Tradeoff:
|
||||
Maximum goodwill, citation, and reuse.
|
||||
Minimum moat unless monetization shifts to services, sponsorship, or community
|
||||
leadership.
|
||||
|
||||
## Option 4. Selective access / crawler monetization layer
|
||||
|
||||
Publish:
|
||||
|
||||
- normal web pages
|
||||
|
||||
Control:
|
||||
|
||||
- which bots crawl
|
||||
- whether training is allowed
|
||||
- whether some crawlers must pay
|
||||
|
||||
Tradeoff:
|
||||
Promising middle path, especially as crawler monetization standards mature, but
|
||||
still early and not something to build the whole strategy around yet.
|
||||
|
||||
## Recommendation
|
||||
|
||||
For `doesitarm`, use Option 1 now, with a path to Option 2 later.
|
||||
|
||||
Concretely:
|
||||
|
||||
1. Treat the database as transition-agnostic.
|
||||
Use dimensions like `platform_family`, `cpu_arch`, `translation_layer`,
|
||||
`virtualization_layer`, `os_version`, `artifact_type`, and
|
||||
`verification_method` so the same model can cover Apple Silicon, Windows on
|
||||
ARM, or the next Apple transition.
|
||||
|
||||
2. Build a public canonical facts layer.
|
||||
Each app should have a canonical page with:
|
||||
status, environments, timestamps, evidence links, and short synthesis.
|
||||
|
||||
3. Build a public dataset layer.
|
||||
Publish periodic snapshots with dataset landing pages, license, provenance,
|
||||
versioning, and download metadata.
|
||||
|
||||
4. Keep ingestion and raw evidence private.
|
||||
Store raw downloads, scrape traces, matching logic, and low-confidence
|
||||
candidates outside the public repo and public site.
|
||||
|
||||
5. Add public machine-readable surfaces in this order:
|
||||
- `SoftwareApplication`-style entity markup where it truthfully matches page content
|
||||
- dataset landing pages plus `Dataset` / `DataDownload` metadata for exports
|
||||
- stable JSON or CSV snapshots
|
||||
- `llms.txt` and Markdown exports as secondary aids
|
||||
|
||||
6. Make public pages citation-friendly.
|
||||
Add clear authorship, methodology, "how we know", last verified date, and
|
||||
source links.
|
||||
|
||||
7. Avoid index bloat.
|
||||
Keep canonical entity and high-intent comparison pages indexable.
|
||||
Use `noindex` or crawl controls for low-value filter permutations and stale
|
||||
or confusing pages.
|
||||
|
||||
8. Measure before deciding how open to be.
|
||||
Track:
|
||||
- Search Console web traffic
|
||||
- ChatGPT referral traffic via `utm_source=chatgpt.com`
|
||||
- bot traffic by user agent
|
||||
- crawl cost versus referral value
|
||||
|
||||
Inference:
|
||||
The best long-term moat is not withholding all facts. It is being the most
|
||||
trusted and most reusable source for those facts, while keeping the expensive
|
||||
and differentiating machinery private.
|
||||
|
||||
## Near-Term Next Steps For doesitarm
|
||||
|
||||
1. Add a public data-contract document describing the canonical app entity,
|
||||
environment entity, evidence entity, and snapshot dataset.
|
||||
2. Expand app pages from "status page" to "evidence page":
|
||||
include methodology, last verified date, change history, and source
|
||||
attribution.
|
||||
3. Add structured data intentionally:
|
||||
entity markup for app pages, dataset markup for exports, not generic markup
|
||||
everywhere.
|
||||
4. Add a public snapshot export and a dataset landing page.
|
||||
5. Add a bot-policy matrix to `robots.txt` planning:
|
||||
Google search, OpenAI search, Anthropic search, training bots, and user bots.
|
||||
6. Add `llms.txt` only after the public canonical and export layers are clean.
|
||||
7. Keep filters/search-result pages from becoming the primary indexable surface.
|
||||
|
||||
## Source Links
|
||||
|
||||
- Repo context:
|
||||
- [README.md](/Users/athena/Code/doesitarm/README.md)
|
||||
- [docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md)
|
||||
- [static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt)
|
||||
- [helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js)
|
||||
- [helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js)
|
||||
- [helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js)
|
||||
- [static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json)
|
||||
|
||||
- Google AI features and AI search:
|
||||
- https://developers.google.com/search/docs/appearance/ai-features
|
||||
- https://developers.google.com/search/blog/2025/05/succeeding-in-ai-search
|
||||
- https://developers.google.com/search/docs/fundamentals/creating-helpful-content
|
||||
- https://developers.google.com/search/docs/essentials/spam-policies
|
||||
- https://developers.google.com/search/docs/fundamentals/using-gen-ai-content
|
||||
|
||||
- Google review and structured-data guidance:
|
||||
- https://developers.google.com/search/docs/appearance/reviews-system
|
||||
- https://developers.google.com/search/docs/specialty/ecommerce/write-high-quality-reviews
|
||||
- https://developers.google.com/search/docs/appearance/structured-data/software-app
|
||||
- https://developers.google.com/search/docs/appearance/structured-data/dataset
|
||||
- https://developers.google.com/search/docs/appearance/structured-data/sd-policies
|
||||
- https://developers.google.com/search/docs/crawling-indexing/crawling-managing-faceted-navigation
|
||||
- https://developers.google.com/search/docs/crawling-indexing/block-indexing
|
||||
|
||||
- Schema and dataset modeling:
|
||||
- https://schema.org/SoftwareApplication
|
||||
|
||||
- OpenAI:
|
||||
- https://help.openai.com/en/articles/12627856-publishers-and-developers-faq
|
||||
- https://help.openai.com/en/articles/9237897-chatgpt-search
|
||||
- https://platform.openai.com/docs/gptbot
|
||||
|
||||
- Anthropic:
|
||||
- https://support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
|
||||
- https://docs.anthropic.com/en/docs/build-with-claude/search-results
|
||||
|
||||
- Cloudflare:
|
||||
- https://developers.cloudflare.com/style-guide/how-we-docs/ai-consumability/
|
||||
- https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/
|
||||
- https://blog.cloudflare.com/introducing-pay-per-crawl/
|
||||
|
||||
- Discovery and freshness:
|
||||
- https://www.indexnow.org/index
|
||||
|
||||
- Practitioner and discussion context:
|
||||
- https://simonwillison.net/2025/Oct/24/claude-code-docs-map/
|
||||
- https://news.ycombinator.com/item?id=41072549
|
||||
- https://lobste.rs/s/dmuad3/mitigating_sourcehut_s_partial_outage
|
||||
|
||||
## Source Quality Notes
|
||||
|
||||
- Google Search Central, OpenAI, Anthropic, Schema.org, IndexNow, and Cloudflare
|
||||
were the primary sources for current guidance.
|
||||
- The HN and Lobsters links were useful for operator sentiment and failure modes,
|
||||
not as primary authority for ranking behavior.
|
||||
- `llms.txt` appears real and increasingly implemented, but the strongest
|
||||
current evidence still says it is supplemental rather than foundational.
|
||||
222
docs/research/private-public-repo-sync-patterns-2026-04-04.md
Normal file
222
docs/research/private-public-repo-sync-patterns-2026-04-04.md
Normal file
|
|
@ -0,0 +1,222 @@
|
|||
# Private/Public Repo Sync Patterns For doesitarm
|
||||
|
||||
Tease: Git can push to a second remote, but it does not natively maintain a safe long-lived branch that "merges everything except `docs/`."
|
||||
|
||||
Lede: For `doesitarm` on 2026-04-04, the best-fit pattern is a second repo/remote plus an automated one-way export from `origin/master` that rewrites out `docs/` and force-pushes the result.
|
||||
|
||||
Why it matters:
|
||||
- `docs/` is already present on `origin/master`, so excluding it from a future mirror does not make it private retroactively.
|
||||
- `docs/` does not appear to participate in the app build, which makes export-time exclusion low-risk.
|
||||
- Cross-repo automation on GitHub needs separate auth; the default workflow token is not enough for another private repo.
|
||||
|
||||
Go deeper:
|
||||
- Simple one-way sync: fresh clone -> `git filter-repo --path docs/ --invert-paths` -> force-push to a second remote.
|
||||
- Faster repeated projection: evaluate `splitsh-lite` if export speed or repeatability becomes important.
|
||||
- Advanced bidirectional filtered views: `josh` is the serious option, but it is heavier than this repo likely needs.
|
||||
|
||||
Date: 2026-04-04
|
||||
|
||||
## Scope
|
||||
|
||||
Research whether `doesitarm` should support a `private-main`-style branch or
|
||||
remote that automatically tracks the public default branch while excluding
|
||||
paths such as `docs/`, and identify better patterns if they exist.
|
||||
|
||||
## Short Answer
|
||||
|
||||
Yes, you can push to a different remote from the public repo.
|
||||
|
||||
No, the durable pattern is not a long-lived `private-main` merge branch that
|
||||
keeps deleting `docs/` after every merge. Git branches and merges are full-tree
|
||||
operations, and sparse checkout does not change that.
|
||||
|
||||
For this repo, the cleanest pattern is:
|
||||
|
||||
1. Keep `origin/master` as the source branch.
|
||||
2. Add a second remote or second repository for the private target.
|
||||
3. Run a one-way export job on each push to `master`.
|
||||
4. In that job, create a sanitized tree/history with `docs/` removed.
|
||||
5. Force-push the sanitized result to the private remote branch.
|
||||
|
||||
If the real requirement is that future docs stay private, the better topology
|
||||
is the reverse: keep the canonical repo private and generate the public export.
|
||||
|
||||
Inference:
|
||||
That last recommendation is based on the current repo state: `docs/` is already
|
||||
committed on `origin/master`, so a private mirror without `docs/` only changes
|
||||
future distribution, not past public exposure.
|
||||
|
||||
## What The Repo Already Knows
|
||||
|
||||
- The default remote today is only `origin`, pointed at the public GitHub repo.
|
||||
- The default tracked branch is `origin/master`, not `main`.
|
||||
- There is no checked-in `.github/` workflow that already handles cross-repo
|
||||
sync.
|
||||
- `docs/` currently contains repo-local planning and research material:
|
||||
[docs/app-flow.md](/Users/athena/Code/doesitarm/docs/app-flow.md),
|
||||
[docs/plans/app-test-typescript-refactor.md](/Users/athena/Code/doesitarm/docs/plans/app-test-typescript-refactor.md),
|
||||
and dated research memos under
|
||||
[docs/research](/Users/athena/Code/doesitarm/docs/research).
|
||||
- `docs/` is already present in `origin/master` history as of 2026-04-04.
|
||||
- The build surface in
|
||||
[package.json](/Users/athena/Code/doesitarm/package.json)
|
||||
does not appear to depend on `docs/`.
|
||||
|
||||
## What The Evidence Says
|
||||
|
||||
- Different remote from the public repo:
|
||||
yes. Git supports separate remotes, and the `git remote` docs explicitly say
|
||||
that when fetch and push use different locations, you should use two separate
|
||||
remotes rather than pretending they are the same remote.
|
||||
- Long-lived branch with path exclusions:
|
||||
not a native Git capability. Git merges operate on full trees, not
|
||||
"everything except these directories."
|
||||
- Sparse checkout:
|
||||
not the answer here. The `git sparse-checkout` docs describe it as a working
|
||||
directory reduction feature, and they note that operations such as merge or
|
||||
rebase may still materialize paths outside the sparse specification.
|
||||
- `git filter-repo`:
|
||||
good fit for one-way export. The Git project now recommends it instead of
|
||||
`filter-branch`, and its docs support `--invert-paths` for "keep everything
|
||||
except these paths" rewrites. That matches "mirror the repo except `docs/`."
|
||||
- `splitsh-lite`:
|
||||
promising when you want repeatable projections into standalone repos and care
|
||||
about performance. Its README supports split prefixes that can include
|
||||
exclusions and uses a history cache, which is more appropriate than a manual
|
||||
merge branch when this becomes a repeated sync lane.
|
||||
- `josh`:
|
||||
the advanced option. Its repo describes a proxy Git server that exposes
|
||||
filtered histories as standalone repos and synchronizes between original and
|
||||
filtered views. This is the closest thing to a "real" selective mirror
|
||||
system, but it adds operational weight.
|
||||
- GitHub Actions auth:
|
||||
the default `github.token` is scoped to the current repository. If a workflow
|
||||
in the public repo needs to push to a different private repo, you need a PAT,
|
||||
deploy key, or GitHub App token instead.
|
||||
|
||||
## What Works
|
||||
|
||||
- A second remote or second repository for the private target.
|
||||
- A one-way generated branch or repo, not a hand-maintained merge branch.
|
||||
- Rebuilding the private export from `origin/master` every run.
|
||||
- Treating the private mirror as generated output with force-pushes allowed.
|
||||
- Keeping development on one source branch and one source-of-truth repo.
|
||||
|
||||
## What To Avoid
|
||||
|
||||
- Do not maintain `private-main` by repeatedly merging `master` and deleting
|
||||
`docs/`. That creates unnecessary churn and eventual conflict debt.
|
||||
- Do not use sparse checkout as if it were a publishing filter.
|
||||
- Do not make the generated private mirror a peer source of truth unless you
|
||||
also adopt a projection system designed for bidirectional sync.
|
||||
- Do not rely on the default GitHub Actions token for pushes to another repo.
|
||||
- Do not assume this setup hides `docs/` historically; those files are already
|
||||
in the public remote history.
|
||||
|
||||
## Best Patterns
|
||||
|
||||
## 1. Best Fit For This Repo: one-way export to a second remote
|
||||
|
||||
Use a second remote, for example `private`, pointing at a separate private
|
||||
GitHub repository. On each push to `origin/master`, run an automation that:
|
||||
|
||||
1. checks out `master`
|
||||
2. authenticates to the private repo with a PAT, deploy key, or GitHub App
|
||||
3. creates a fresh export clone or export worktree
|
||||
4. rewrites out `docs/` and any other excluded paths
|
||||
5. force-pushes the sanitized result to the private repo branch
|
||||
|
||||
Why this is the best fit:
|
||||
|
||||
- it matches the repo's current single-source workflow
|
||||
- it does not depend on path-aware merges that Git does not have
|
||||
- it keeps excluded-path logic in one place
|
||||
- it is easy to reason about and recover from
|
||||
|
||||
Tradeoffs:
|
||||
|
||||
- exported commit SHAs will differ from public `master`
|
||||
- the private mirror should be treated as generated/read-only
|
||||
|
||||
## 2. Better If Repeated Projection Becomes Core: `splitsh-lite`
|
||||
|
||||
If you end up publishing multiple filtered mirrors or need fast repeated
|
||||
updates, `splitsh-lite` is worth a spike. It is built for turning repository
|
||||
views into standalone histories and caching the work.
|
||||
|
||||
Tradeoffs:
|
||||
|
||||
- more specialized operational knowledge
|
||||
- less obvious to future maintainers than a simple export script
|
||||
|
||||
## 3. Better Only For Advanced Bidirectional Partial Views: `josh`
|
||||
|
||||
If the real requirement becomes "developers commit through filtered views and
|
||||
changes synchronize both directions," `josh` is the pattern to study.
|
||||
|
||||
Tradeoffs:
|
||||
|
||||
- significant infrastructure/runtime overhead
|
||||
- far more complexity than `doesitarm` appears to need today
|
||||
|
||||
## 4. Adjacent But Not The Same Problem: GitHub Private Mirrors
|
||||
|
||||
GitHub's `private-mirrors` app is relevant if the goal is to collaborate
|
||||
privately around a public repository and upstream later. It is not the right
|
||||
answer for "same repo minus `docs/`," but it is worth noting as a neighboring
|
||||
pattern.
|
||||
|
||||
## Recommendation
|
||||
|
||||
For `doesitarm`, use a separate private repository plus a generated sync job.
|
||||
Name the target branch after the actual default branch in this repo, for
|
||||
example `private-master` or simply `master` on the private remote.
|
||||
|
||||
Do not implement this as a merge branch.
|
||||
|
||||
If the aim is just "same code, different remote, minus `docs/`," a generated
|
||||
one-way mirror is the right level of machinery.
|
||||
|
||||
If the aim is "keep future internal docs private," move the source of truth to
|
||||
a private repo and generate the public mirror from that private origin.
|
||||
|
||||
## Missing Information
|
||||
|
||||
- Whether the private target is intended to be read-only/generated or whether
|
||||
anyone will commit directly to it.
|
||||
- Whether `docs/` is the only excluded path or just the first example.
|
||||
- Whether the real goal is secrecy, deployment hygiene, or private-only
|
||||
collaboration before publishing.
|
||||
|
||||
## Source Links
|
||||
|
||||
- Git remote docs:
|
||||
https://git-scm.com/docs/git-remote
|
||||
- Git sparse-checkout docs:
|
||||
https://git-scm.com/docs/git-sparse-checkout
|
||||
- `git-filter-repo` repository:
|
||||
https://github.com/newren/git-filter-repo
|
||||
- `git-filter-repo` manual:
|
||||
https://www.mankier.com/1/git-filter-repo
|
||||
- `splitsh-lite` repository:
|
||||
https://github.com/splitsh/lite
|
||||
- `josh` repository:
|
||||
https://github.com/josh-project/josh
|
||||
- `actions/checkout` README:
|
||||
https://github.com/actions/checkout
|
||||
- GitHub App auth in GitHub Actions:
|
||||
https://docs.github.com/en/enterprise-cloud@latest/apps/creating-github-apps/authenticating-with-a-github-app/making-authenticated-api-requests-with-a-github-app-in-a-github-actions-workflow
|
||||
- GitHub deploy keys:
|
||||
https://docs.github.com/v3/guides/managing-deploy-keys
|
||||
- GitHub Private Mirrors app:
|
||||
https://github.com/github-community-projects/private-mirrors
|
||||
- Stack Overflow, partial sharing of Git repositories:
|
||||
https://stackoverflow.com/questions/278270/partial-sharing-of-git-repositories
|
||||
|
||||
## Source Quality Notes
|
||||
|
||||
- HN and Lobsters searches on 2026-04-04 did not surface a clearly better
|
||||
mainstream pattern than the Git/GitHub docs plus the specialized projection
|
||||
tools above.
|
||||
- Primary docs and project READMEs were materially more useful than forum
|
||||
commentary for this question.
|
||||
|
|
@ -0,0 +1,396 @@
|
|||
# Public Repo Security And Monorepo Patterns For doesitarm
|
||||
|
||||
Tease: The safest version of this plan keeps `doesitarm` public, but treats credentials, imports, downloaded app artifacts, and privileged automation as private operational surfaces.
|
||||
|
||||
Lede: For `doesitarm` on 2026-04-04, the best-fit pattern is a Kriasoft-style public monorepo with clear `apps/`, `packages/`, `db/`, and `infra/` boundaries, plus hardened GitHub Actions, GitHub-hosted runners for public workflows, D1 local development via Wrangler, and private storage for secrets, backups, and quarantined artifacts.
|
||||
|
||||
Why it matters:
|
||||
- The current repo is about to add higher-risk surfaces: D1, automated app discovery, archive downloading, scheduled jobs, and more Cloudflare automation.
|
||||
- In a public repo, CI/CD mistakes matter as much as application code mistakes. Workflow files, tokens, logs, and runner choices become part of the threat model.
|
||||
- The current repo already has one immediate security problem: a workflow prints secret-derived files to CI logs.
|
||||
|
||||
Go deeper:
|
||||
- Keep the code public; keep secrets, raw data, and operational state private.
|
||||
- Refactor toward a monorepo shape early so new ingestion, scanner, D1, and infra code do not spread across a flat root.
|
||||
- Adopt OSS-friendly GitHub hardening: read-only default `GITHUB_TOKEN`, pinned actions, CODEOWNERS on workflow/infra/db paths, secret scanning, private vulnerability reporting, and no self-hosted runners for public PRs.
|
||||
|
||||
Date: 2026-04-04
|
||||
|
||||
## Scope
|
||||
|
||||
Research security considerations and common open-source repository patterns for a
|
||||
setup like `doesitarm`:
|
||||
|
||||
- public GitHub repository
|
||||
- Cloudflare Workers and D1
|
||||
- scheduled automation
|
||||
- automated downloading and scanning of third-party app archives
|
||||
- prospective monorepo refactor in the style of
|
||||
`kriasoft/react-starter-kit`
|
||||
|
||||
This memo is intended to drive updates to
|
||||
[app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md).
|
||||
|
||||
## Short Answer
|
||||
|
||||
Do not move the whole repo private.
|
||||
|
||||
Instead:
|
||||
|
||||
1. Keep the application and infrastructure code public.
|
||||
2. Move secrets, imported raw data, D1 operational state, downloaded artifacts,
|
||||
quarantined samples, and any sensitive fixtures to private systems.
|
||||
3. Refactor into a monorepo early, using a Kriasoft-style structure adapted to
|
||||
this repo's existing pnpm/Netlify/Astro/Workers setup.
|
||||
4. Harden GitHub Actions before expanding automation.
|
||||
|
||||
Best-fit recommendation:
|
||||
|
||||
- Public monorepo with `apps/`, `packages/`, `db/`, `infra/`, `scripts/`,
|
||||
and `docs/`
|
||||
- GitHub-hosted runners for public workflows
|
||||
- GitHub environment secrets with required reviewers for production deploys
|
||||
- Cloudflare D1 local development and tests via Wrangler `--local`,
|
||||
`preview_database_id`, and test harnesses like `unstable_dev()`/Miniflare
|
||||
- Private object storage or equivalent for raw app archives, import dumps,
|
||||
and quarantine material
|
||||
|
||||
Inference:
|
||||
This is the right fit because the repo is open source and community-facing, but
|
||||
the risky parts are operational, not architectural. Public code is compatible
|
||||
with good security here; public credentials and public operational data are not.
|
||||
|
||||
## What The Repo Already Knows
|
||||
|
||||
- The repo is currently flat-rooted, not organized as a workspace monorepo.
|
||||
- There is no checked-in D1 configuration or local D1 bootstrap yet.
|
||||
- There is Cloudflare deployment automation in
|
||||
[deploy-cloudflare-workers.yml](/Users/athena/Code/doesitarm/.github/workflows/deploy-cloudflare-workers.yml).
|
||||
- That workflow currently decodes secret-backed `.env` / `wrangler.toml` files
|
||||
and prints them with `cat`, which is a real security issue in CI logs.
|
||||
- The site build still depends on remote/env-backed feeds such as
|
||||
`SCANS_SOURCE`, `COMMITS_SOURCE`, `HOMEBREW_SOURCE`, `GAMES_SOURCE`, and
|
||||
`VFUNCTIONS_URL`.
|
||||
- The scanner and planned discovery pipeline will process untrusted third-party
|
||||
files, including archive formats like ZIP, DMG, and PKG.
|
||||
- `.env` is ignored at the root, and per-worker `wrangler.toml` files are
|
||||
already ignored in worker subdirectories.
|
||||
|
||||
## What The Evidence Says
|
||||
|
||||
### 1. Public repos can stay public if the operational boundary is private
|
||||
|
||||
GitHub's own docs assume public repositories will use:
|
||||
|
||||
- repository or environment secrets
|
||||
- restricted organization secret access
|
||||
- private vulnerability reporting
|
||||
- automatic secret scanning on public repos
|
||||
|
||||
That is strong evidence that the normal pattern is not "make the repo private";
|
||||
it is "keep sensitive operational material out of the repo and out of logs."
|
||||
|
||||
### 2. Default GitHub Actions posture should be least privilege
|
||||
|
||||
GitHub recommends:
|
||||
|
||||
- minimum required `GITHUB_TOKEN` permissions
|
||||
- default repository token permission set to read-only
|
||||
- escalating permissions only per job
|
||||
- using a GitHub App token if a job needs more than `GITHUB_TOKEN` can provide
|
||||
|
||||
This matches what open-source repos increasingly do for deploy, release, and
|
||||
cross-repo automation.
|
||||
|
||||
### 3. Secrets are still easy to leak through logs and workflow behavior
|
||||
|
||||
GitHub's secure-use docs explicitly warn that:
|
||||
|
||||
- redaction is not guaranteed for transformed values
|
||||
- structured blobs like JSON/YAML are poor secret formats
|
||||
- non-secret values should be masked explicitly with `::add-mask::`
|
||||
- exposed secrets in logs should trigger deletion/rotation
|
||||
|
||||
For `doesitarm`, this directly applies to the current workflow that prints
|
||||
secret-derived config files into CI output.
|
||||
|
||||
### 4. Public repos should avoid self-hosted runners for untrusted PRs
|
||||
|
||||
GitHub explicitly recommends self-hosted runners only with private
|
||||
repositories, because forks of public repositories can run dangerous code on
|
||||
them through pull requests.
|
||||
|
||||
For this repo, that means:
|
||||
|
||||
- do not put public PR workflows on a local machine or other long-lived
|
||||
self-hosted runner
|
||||
- do not run untrusted archive-processing jobs on a self-hosted runner that
|
||||
also holds production credentials
|
||||
|
||||
### 5. `pull_request_target` remains a common footgun
|
||||
|
||||
GitHub Security Lab's `Preventing pwn requests` guidance is still the clearest
|
||||
implementation reference:
|
||||
|
||||
- `pull_request_target` plus checking out/building PR code is dangerous
|
||||
- untrusted PR code should run in an unprivileged `pull_request` workflow
|
||||
- privileged follow-up actions should happen through `workflow_run` with
|
||||
carefully handled artifacts
|
||||
|
||||
HN discussion around real workflow exploits reinforces the same point: the
|
||||
problem is not theoretical.
|
||||
|
||||
### 6. Common OSS hardening patterns for GitHub workflows are now well-defined
|
||||
|
||||
GitHub secure-use guidance and OpenSSF best-practice guidance converge on:
|
||||
|
||||
- pin actions to full commit SHAs
|
||||
- restrict allowed actions where possible
|
||||
- guard `.github/workflows/` with `CODEOWNERS`
|
||||
- keep default branch protected
|
||||
- require reviews and passing checks
|
||||
- use code scanning / dependency review / secret scanning / Dependabot
|
||||
- use private vulnerability reporting for public repos
|
||||
|
||||
These are standard public-repo practices, not enterprise-only overkill.
|
||||
|
||||
### 7. Cloudflare D1 already supports local-first development and tests
|
||||
|
||||
Cloudflare's D1 docs explicitly support:
|
||||
|
||||
- `wrangler dev` local mode
|
||||
- `preview_database_id`
|
||||
- `wrangler d1 migrations apply --local`
|
||||
- test setups using Miniflare and `unstable_dev()`
|
||||
|
||||
That means D1 does not require a private repo or remote-only workflow. It fits
|
||||
the "run locally on this machine, then automate" plan well.
|
||||
|
||||
### 8. Cloudflare Workflows and observability make Cloudflare a credible later home for ingestion
|
||||
|
||||
Cloudflare Workflows now position themselves as durable multi-step execution
|
||||
with retries, persisted state, and debugging. Workers Logs and Traces provide
|
||||
native observability. That is enough evidence to treat Cloudflare as a viable
|
||||
later landing zone for scheduled ingestion and scan orchestration.
|
||||
|
||||
Inference:
|
||||
GitHub Actions is still the easier first scheduler because it is already in the
|
||||
repo, but Cloudflare Workflows has matured enough to stay in the plan as a
|
||||
serious later option.
|
||||
|
||||
### 9. Kriasoft's monorepo shape is a good architectural fit, but not every exact convention should be copied blindly
|
||||
|
||||
`kriasoft/react-starter-kit` is a public monorepo with:
|
||||
|
||||
- `apps/`
|
||||
- `packages/`
|
||||
- `db/`
|
||||
- `docs/`
|
||||
- `infra/`
|
||||
- `scripts/`
|
||||
|
||||
It also documents a public template env pattern where committed `.env`
|
||||
contains placeholders/defaults and `.env.local` contains real credentials.
|
||||
|
||||
That shape is a strong fit for `doesitarm`, but I would adapt the env pattern
|
||||
slightly for safety and clarity:
|
||||
|
||||
- keep a committed public template file such as `.env.example`
|
||||
- keep real credentials in `.env.local`, `.dev.vars`, GitHub environment
|
||||
secrets, and Cloudflare secrets
|
||||
|
||||
Inference:
|
||||
Kriasoft's folder layout is the part worth copying directly. The exact env-file
|
||||
naming should follow the least-confusing safe convention for this repo.
|
||||
|
||||
## Common Open-Source Patterns That Fit doesitarm
|
||||
|
||||
### Public code, private state
|
||||
|
||||
Keep public:
|
||||
|
||||
- app code
|
||||
- scanner code
|
||||
- D1 schema and migrations
|
||||
- workflow definitions
|
||||
- docs and plans
|
||||
|
||||
Keep private:
|
||||
|
||||
- deploy credentials and tokens
|
||||
- raw Google Sheets exports or database backups
|
||||
- downloaded app archives
|
||||
- quarantine samples
|
||||
- private test fixtures that would create redistribution or abuse risk
|
||||
- operational dashboards and alert destinations
|
||||
|
||||
### Workspace monorepo with clear trust boundaries
|
||||
|
||||
Best-fit structure for `doesitarm`:
|
||||
|
||||
- `apps/web/` — Astro site and app-test UI
|
||||
- `apps/default-worker/` — current `doesitarm-default`
|
||||
- `apps/analytics-worker/` — current `workers/analytics`
|
||||
- `apps/ingest/` or `apps/discovery/` — CLI/admin surface for discovery jobs
|
||||
- `packages/scanner-core/` — shared scan engine and file-format logic
|
||||
- `packages/source-runners/` — Homebrew/GitHub/download-page source runners
|
||||
- `packages/data-model/` — shared D1 schema types, DTOs, validation
|
||||
- `packages/site-build/` — list/build/export helpers
|
||||
- `db/` — D1 migrations, seeds, import scripts, local test DB helpers
|
||||
- `infra/` — Wrangler config, deploy config, policy docs
|
||||
- `scripts/` — repo automation
|
||||
- `docs/` — plans, research, operational docs
|
||||
|
||||
### Repo template files, not repo secrets
|
||||
|
||||
Common OSS pattern:
|
||||
|
||||
- commit `.env.example` or placeholder-only `.env`
|
||||
- ignore `.env.local`, `.dev.vars`, and `.wrangler/`
|
||||
- keep Cloudflare secrets in Workers secrets / GitHub environment secrets
|
||||
|
||||
### Hardened GitHub Actions for public forks
|
||||
|
||||
Common OSS pattern:
|
||||
|
||||
- default `permissions: { contents: read }`
|
||||
- explicit per-job escalation only
|
||||
- require approval for fork PR workflows where appropriate
|
||||
- no self-hosted runners for public PRs
|
||||
- no `pull_request_target` workflows that checkout/build PR code
|
||||
|
||||
### Supply-chain hygiene for workflows
|
||||
|
||||
Common OSS pattern:
|
||||
|
||||
- pin actions to full SHAs
|
||||
- restrict allowed actions
|
||||
- Dependabot for action updates
|
||||
- CodeQL / code scanning for workflow vulnerabilities
|
||||
- OpenSSF Scorecards for ongoing hygiene checks
|
||||
|
||||
### Disclosure and scanning defaults
|
||||
|
||||
Common OSS pattern:
|
||||
|
||||
- enable private vulnerability reporting
|
||||
- enable secret scanning and push protection
|
||||
- keep a `SECURITY.md` policy
|
||||
|
||||
## What Works
|
||||
|
||||
- Keeping the repo public while moving secrets and sensitive data out of git
|
||||
- Refactoring to a monorepo before adding more D1/discovery complexity
|
||||
- Treating workflow files, `infra/`, and `db/` as protected surfaces with
|
||||
`CODEOWNERS`
|
||||
- Using GitHub-hosted runners for public CI and scheduled jobs
|
||||
- Using environment-specific secrets with required reviewers for production
|
||||
deployment jobs
|
||||
- Using D1 local mode and local migrations as part of normal development
|
||||
- Using Cloudflare Logs/Traces or equivalent observability for scheduled jobs
|
||||
- Storing raw archives and quarantine material in private object storage rather
|
||||
than in the repo
|
||||
|
||||
## What To Avoid
|
||||
|
||||
- Do not move the whole repo private as a substitute for secrets hygiene
|
||||
- Do not keep the current workflow behavior that prints secret-derived files to
|
||||
CI logs
|
||||
- Do not use self-hosted runners for public PR workflows
|
||||
- Do not run archive downloads/extraction in privileged workflows that also have
|
||||
deploy credentials
|
||||
- Do not combine `pull_request_target` with explicit PR checkout/build steps
|
||||
- Do not keep adding discovery/D1/worker code into the current flat root
|
||||
- Do not commit raw import dumps, app archives, or structured secret blobs
|
||||
|
||||
## Recommendation
|
||||
|
||||
For `doesitarm`, the strongest next-step package is:
|
||||
|
||||
1. Refactor toward a Kriasoft-style monorepo shape adapted to pnpm.
|
||||
2. Add a security-hardening stage before expanding automation.
|
||||
3. Keep the repo public.
|
||||
4. Keep secrets, raw operational data, and archive/quarantine material private.
|
||||
5. Start scheduled discovery on GitHub-hosted runners with hardened workflows.
|
||||
6. Keep Cloudflare Workflows as a second-phase target for durable ingestion.
|
||||
|
||||
Immediate high-priority actions to capture in the plan:
|
||||
|
||||
1. Remove secret printing from
|
||||
[deploy-cloudflare-workers.yml](/Users/athena/Code/doesitarm/.github/workflows/deploy-cloudflare-workers.yml)
|
||||
and rotate affected secrets.
|
||||
2. Add repo policy and tooling for:
|
||||
- read-only default `GITHUB_TOKEN`
|
||||
- pinned actions
|
||||
- `CODEOWNERS` for `.github/workflows/`, `infra/`, and `db/`
|
||||
- secret scanning / push protection
|
||||
- private vulnerability reporting
|
||||
3. Add ignored local-secret files for the new D1/Workers workflow:
|
||||
- `.env.local`
|
||||
- `.dev.vars`
|
||||
- `.wrangler/`
|
||||
4. Keep public PR CI on GitHub-hosted runners only.
|
||||
5. Store raw archives/import snapshots outside the repo.
|
||||
|
||||
## Missing Information
|
||||
|
||||
- Whether the future ingestion runtime is expected to stay GitHub-first or
|
||||
eventually move fully to Cloudflare Workers/Workflows.
|
||||
- Whether there are legal or vendor-policy constraints around storing downloaded
|
||||
app archives long term.
|
||||
- Whether the monorepo refactor should keep Netlify as-is or consolidate more
|
||||
runtime surfaces onto Cloudflare.
|
||||
|
||||
## Source Links
|
||||
|
||||
- GitHub Docs, `GITHUB_TOKEN` least-privilege and GitHub App escalation:
|
||||
https://docs.github.com/en/actions/tutorials/authenticate-with-github_token
|
||||
- GitHub Docs, secrets in Actions, fork-secret behavior, environment reviewers,
|
||||
OIDC, and masking:
|
||||
https://docs.github.com/en/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions
|
||||
- GitHub Docs, secure use reference, pinning actions, CODEOWNERS, code scanning,
|
||||
Dependabot, and Scorecards:
|
||||
https://docs.github.com/en/actions/reference/security/secure-use
|
||||
- GitHub Docs, self-hosted runner warning for public repositories:
|
||||
https://docs.github.com/en/actions/how-tos/manage-runners/self-hosted-runners/add-runners
|
||||
- GitHub Docs, limiting self-hosted runners in organizations:
|
||||
https://docs.github.com/en/organizations/managing-organization-settings/disabling-or-limiting-github-actions-for-your-organization
|
||||
- GitHub Docs, approval requirements for fork PR workflows:
|
||||
https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/approving-workflow-runs-from-public-forks
|
||||
- GitHub Docs, repository Actions settings and fork workflow controls:
|
||||
https://docs.github.com/github/administering-a-repository/managing-repository-settings/disabling-or-limiting-github-actions-for-a-repository
|
||||
- GitHub Docs, secret scanning for public repositories:
|
||||
https://docs.github.com/github/administering-a-repository/about-token-scanning
|
||||
- GitHub Docs, enabling secret scanning / push protection:
|
||||
https://docs.github.com/en/code-security/how-tos/secure-your-secrets/detect-secret-leaks/enabling-secret-scanning-for-your-repository
|
||||
- GitHub Docs, enabling push protection:
|
||||
https://docs.github.com/en/code-security/secret-scanning/enabling-secret-scanning-features/enabling-push-protection-for-your-repository
|
||||
- GitHub Docs, private vulnerability reporting:
|
||||
https://docs.github.com/en/code-security/security-advisories/working-with-repository-security-advisories/configuring-private-vulnerability-reporting-for-a-repository
|
||||
- GitHub Security Lab, `pull_request_target` / `workflow_run` guidance:
|
||||
https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/
|
||||
- OpenSSF GitHub configuration best practices:
|
||||
https://best.openssf.org/SCM-BestPractices/github/
|
||||
- Kriasoft React Starter Kit:
|
||||
https://github.com/kriasoft/react-starter-kit
|
||||
- Cloudflare D1 local development:
|
||||
https://developers.cloudflare.com/d1/best-practices/local-development/
|
||||
- Cloudflare Workers observability:
|
||||
https://developers.cloudflare.com/workers/observability/
|
||||
- Cloudflare Workers logs:
|
||||
https://developers.cloudflare.com/workers/observability/logs/
|
||||
- Cloudflare Workers traces:
|
||||
https://developers.cloudflare.com/workers/observability/traces/
|
||||
- Cloudflare Workflows overview:
|
||||
https://developers.cloudflare.com/workflows/
|
||||
|
||||
## Source Quality Notes
|
||||
|
||||
- Highest-confidence sources in this memo are GitHub Docs, GitHub Security Lab,
|
||||
OpenSSF, Cloudflare Docs, and the Kriasoft repository itself.
|
||||
- HN/Lobsters did not surface a materially better competing pattern in this
|
||||
pass; the most useful HN signal reinforced GitHub Security Lab's warning on
|
||||
`pull_request_target`.
|
||||
- The recommendation to keep the repo public but move operational data private
|
||||
is a synthesis from official guidance plus this repo's current shape and risk
|
||||
surface.
|
||||
Loading…
Add table
Add a link
Reference in a new issue