mirror of
https://github.com/ThatGuySam/doesitarm.git
synced 2026-05-18 06:44:46 -07:00
Capture the next discovery, security, compatibility-data, and dual-deploy planning work, and ignore local Vercel/env state that should not be committed. This keeps the operational research with the repo while avoiding accidental local-config churn. Constraint: Must not alter production runtime behavior Rejected: Fold research notes into the runtime fix commit | obscures the user-facing app-test correction with planning-only material Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep .omx local state untracked even when committing broad workspace updates Tested: Document review only Not-tested: No runtime verification required for docs and ignore rules
600 lines
22 KiB
Markdown
600 lines
22 KiB
Markdown
# Desktop App Compatibility Data Strategy For doesitarm
|
|
|
|
Tease: In 2026, the winning play is not "AI SEO." It is publishing a high-trust, machine-readable compatibility corpus that is genuinely useful on the open web and selectively keeping the operational and proprietary layers private.
|
|
|
|
Lede: For `doesitarm` on 2026-04-04, the best-fit strategy is to treat the site as an entity-and-evidence graph for desktop software compatibility: publish canonical app pages, provenance-rich evidence, structured exports, and selective machine-readable surfaces for discovery; keep raw crawls, binary artifacts, candidate matches, scoring logic, and operational intelligence private.
|
|
|
|
Why it matters:
|
|
- This project can outlive the Apple Silicon transition if the core model is "desktop software compatibility knowledge," not just "Apple Silicon list posts."
|
|
- Google's 2025 AI search guidance still rewards the same fundamentals: unique content, crawlable pages, textual clarity, and trustworthy evidence, not special AI-only tricks.
|
|
- OpenAI and Anthropic now expose separate search, user-action, and training bots, which means "open versus closed" is no longer binary. You can choose visibility, training access, and operational exposure separately.
|
|
|
|
Go deeper:
|
|
- Think of the public site as a citation layer and decision-support layer, not as the full warehouse.
|
|
- Publish public facts, provenance, timestamps, and curated exports. Keep raw ingestion, low-confidence candidates, and monetizable workflow intelligence private.
|
|
- Treat `llms.txt` and Markdown exports as helpful secondary surfaces, not as the core strategy. The core strategy is still clean HTML, canonical URLs, structured data, sitemaps, and useful pages.
|
|
|
|
Date: 2026-04-04
|
|
|
|
## Scope
|
|
|
|
Research how to think about a long-lived desktop app compatibility database as a
|
|
content, SEO, and AI-discoverability system in 2026, including:
|
|
|
|
- best practices for public content architecture
|
|
- how LLM-driven discovery changes the picture
|
|
- what data should likely stay public versus private
|
|
- what audiences this data can serve
|
|
- tradeoffs between more-open and more-closed approaches
|
|
|
|
## Short Answer
|
|
|
|
Build `doesitarm` as a public knowledge product with a private operating system
|
|
underneath it.
|
|
|
|
Publicly, publish:
|
|
|
|
- canonical app pages
|
|
- compatibility status by platform/environment
|
|
- evidence summaries and source links
|
|
- timestamps, changelogs, and history
|
|
- stable IDs, taxonomy, and machine-readable metadata
|
|
- a limited public API or snapshot exports for high-value reuse
|
|
|
|
Privately, keep:
|
|
|
|
- raw crawls and downloaded binaries
|
|
- candidate entities before review
|
|
- normalization, dedupe, and confidence logic
|
|
- crawler logs, abuse rules, and infrastructure controls
|
|
- enrichment that creates monetizable leverage rather than user value on the open web
|
|
|
|
The biggest strategic shift from 2018 to 2026 is this:
|
|
|
|
1. Search still rewards useful original pages.
|
|
2. AI discovery mostly rides on those same pages.
|
|
3. Separate crawler controls now let you be open for search while staying more closed for training.
|
|
4. The moat is less "having any compatibility data at all" and more:
|
|
verification quality, provenance, freshness, historical depth, and workflow speed.
|
|
|
|
Inference:
|
|
No single source states that exact four-part conclusion. It is the synthesis that
|
|
best fits the repo state plus current Google, OpenAI, Anthropic, Cloudflare,
|
|
HN, and Lobsters evidence.
|
|
|
|
## What The Repo Already Knows
|
|
|
|
- The project already acts like a compatibility corpus, not just a blog:
|
|
[README.md](/Users/athena/Code/doesitarm/README.md) is a manually curated,
|
|
source-linked compatibility list.
|
|
- The repo already has a plan to move toward a canonical database and discovery
|
|
pipeline:
|
|
[docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md).
|
|
- The public site already exposes crawlable pages, a sitemap, and permissive
|
|
crawling:
|
|
[static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt),
|
|
[static/sitemap-index.xml](/Users/athena/Code/doesitarm/static/sitemap-index.xml).
|
|
- The current public JSON already exposes useful app-level fields such as name,
|
|
aliases, status, bundle IDs, related links, scan details, and device support:
|
|
[static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json).
|
|
- The current structured data implementation is narrow and video-centric:
|
|
[helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js),
|
|
[helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js),
|
|
[helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js).
|
|
- I did not find a checked-in `llms.txt`, `llms-full.txt`, or per-page Markdown
|
|
export surface.
|
|
- I also did not find `SoftwareApplication` or `Dataset` structured data on app
|
|
or dataset pages.
|
|
|
|
Inference:
|
|
`doesitarm` already has enough public data shape to become a strong
|
|
machine-readable corpus. The main gap is not "inventing the dataset." The gap is
|
|
formalizing and publishing the right layers of it.
|
|
|
|
## What The Evidence Says
|
|
|
|
### 1. Google AI search still wants normal SEO fundamentals, not special AI tricks
|
|
|
|
Google's current AI-features guidance says there are no extra technical
|
|
requirements for AI Overviews or AI Mode beyond normal Search eligibility.
|
|
Google explicitly says you do not need new AI files or special schema just to
|
|
appear in AI features.
|
|
|
|
What does matter:
|
|
|
|
- crawl access
|
|
- internal links
|
|
- page experience
|
|
- important content in textual form
|
|
- structured data matching visible text
|
|
- unique, non-commodity content
|
|
|
|
This is the strongest argument against building an "AI discoverability" strategy
|
|
around gimmicks alone.
|
|
|
|
### 2. Large-scale thin template pages are a real risk
|
|
|
|
Google's helpful-content and spam-policy guidance is directly relevant to
|
|
programmatic compatibility sites:
|
|
|
|
- people-first content is favored
|
|
- pages made mainly to attract search visits are a warning sign
|
|
- scaled content abuse includes generating many low-value pages, including from
|
|
feeds or automated transformations
|
|
|
|
That means a compatibility database can absolutely win in search, but only if
|
|
its pages add decision-making value. Thin pages that just restate a status field
|
|
are dangerous.
|
|
|
|
### 3. Compatibility content should look more like tested reviews than like directory filler
|
|
|
|
Google's reviews guidance is a good proxy for compatibility pages because users
|
|
often arrive with a purchase, migration, or workflow decision in mind.
|
|
|
|
The guidance consistently rewards:
|
|
|
|
- original research
|
|
- first-hand evidence
|
|
- quantitative measurements where relevant
|
|
- comparisons
|
|
- what changed across versions
|
|
- benefits and drawbacks
|
|
|
|
For `doesitarm`, that maps cleanly onto:
|
|
|
|
- status by environment
|
|
- last verified date
|
|
- evidence links
|
|
- scanner output or screenshots where appropriate
|
|
- "what changed" changelog notes
|
|
- comparison pages like native vs Rosetta vs virtualization vs cloud workaround
|
|
|
|
### 4. Dataset markup is useful, but it should describe real dataset landing pages
|
|
|
|
Google's dataset documentation recommends canonical landing pages plus dataset
|
|
metadata such as `sameAs`, `isBasedOn`, identifiers, license, and download
|
|
distribution metadata.
|
|
|
|
That is a strong fit for curated exports such as:
|
|
|
|
- a public daily or weekly compatibility snapshot
|
|
- a historical archive by date
|
|
- vendor- or category-specific exports
|
|
- a Windows-on-ARM or future-transition slice later on
|
|
|
|
Important nuance:
|
|
Google's dataset docs are about Dataset Search discovery, not a substitute for
|
|
general web SEO. Dataset markup helps when you actually publish datasets.
|
|
|
|
### 5. `SoftwareApplication` markup fits the entity model, but Google rich-result requirements are narrower
|
|
|
|
Schema.org's `SoftwareApplication` type supports fields that are very relevant
|
|
here, including:
|
|
|
|
- `applicationCategory`
|
|
- `downloadUrl`
|
|
- `featureList`
|
|
- `operatingSystem`
|
|
- `softwareRequirements`
|
|
- `softwareVersion`
|
|
- `supportingData`
|
|
|
|
Google also has a software-app structured-data feature, but its rich-result
|
|
requirements are more commerce-shaped, including `offers.price` and
|
|
review/rating support. That means:
|
|
|
|
- use `SoftwareApplication` semantics where they match the visible page truth
|
|
- do not invent store-like fields just to chase rich results
|
|
- use dataset markup for exports and software/entity markup for canonical app
|
|
pages
|
|
|
|
### 6. AI discoverability is now bot-by-bot, not one global yes/no
|
|
|
|
OpenAI and Anthropic both now distinguish between different AI access modes.
|
|
|
|
OpenAI:
|
|
|
|
- `OAI-SearchBot` is for search inclusion
|
|
- `GPTBot` is for training
|
|
- `ChatGPT-User` is for user-triggered actions
|
|
|
|
Anthropic:
|
|
|
|
- `ClaudeBot` is for training
|
|
- `Claude-SearchBot` is for search quality
|
|
- `Claude-User` is for user-triggered retrieval
|
|
|
|
This is strategically important. You no longer need to choose only between:
|
|
|
|
- fully public for every AI purpose
|
|
- fully blocked for every AI purpose
|
|
|
|
You can allow discovery while disallowing training, or allow search while
|
|
tightly managing user-action access, depending on your goals.
|
|
|
|
### 7. `llms.txt` is real, but it is still a secondary signal
|
|
|
|
Cloudflare has implemented `llms.txt`, `llms-full.txt`, and per-page Markdown
|
|
exports, and Simon Willison has highlighted similar docs-map patterns as useful
|
|
for agent tooling.
|
|
|
|
That said:
|
|
|
|
- Google explicitly says no special AI text files are required for AI features
|
|
- OpenAI's discoverability guidance focuses on crawler access, `noindex`, and
|
|
citation/linking, not `llms.txt`
|
|
- HN and Lobsters discussions show real skepticism around AI crawler incentives
|
|
and how consistently emerging conventions are respected
|
|
|
|
Best interpretation:
|
|
|
|
- `llms.txt` is worth adding because it is cheap and increasingly recognized
|
|
- it should not be treated as the core lever
|
|
- the core lever is still strong public pages plus clean machine-readable
|
|
content
|
|
|
|
### 8. AI-friendly plain-text and Markdown surfaces do have practical value
|
|
|
|
Cloudflare's docs work here is the clearest practical example:
|
|
|
|
- per-page Markdown versions
|
|
- an index file
|
|
- bulk text export
|
|
- semantic HTML
|
|
- `noindex` on low-value or confusing pages
|
|
|
|
This is less about search ranking and more about:
|
|
|
|
- making retrieval cheaper and more accurate for agents
|
|
- improving citation quality
|
|
- reducing token waste
|
|
- giving your own future agents and partners a stable ingest format
|
|
|
|
For a compatibility corpus, that suggests public Markdown or JSON exports are
|
|
worth doing for the canonical facts layer.
|
|
|
|
### 9. Freshness and URL discovery matter more as the corpus grows
|
|
|
|
Google recommends sitemaps and Search Console monitoring.
|
|
IndexNow gives faster change pings for engines that support it, including Bing.
|
|
|
|
For a frequently updated compatibility corpus, this argues for:
|
|
|
|
- canonical landing pages
|
|
- clean sitemap generation
|
|
- changelog feeds or update streams
|
|
- optional IndexNow support for faster non-Google discovery
|
|
|
|
### 10. The crawl environment is getting more adversarial
|
|
|
|
Cloudflare Radar reported AI and search crawling growth of 18% from May 2024 to
|
|
May 2025 across its measured cohort, with `GPTBot` up 305%.
|
|
HN and Lobsters operator discussions show why this matters in practice:
|
|
|
|
- some AI crawlers create real infrastructure cost
|
|
- incentives are less aligned than classic web search
|
|
- operators increasingly need bot-specific controls, rate limiting, and
|
|
selective exposure
|
|
|
|
This is the best evidence for keeping raw and high-cost surfaces private even if
|
|
you lean more open on the public facts layer.
|
|
|
|
## Ways This Data Can Create Value
|
|
|
|
### Human audiences
|
|
|
|
- End users deciding whether they can keep using a favorite app on new hardware.
|
|
- IT, procurement, and upgrade planners deciding when a transition is safe.
|
|
- Developers and vendors tracking native support gaps and competitive pressure.
|
|
- Journalists and analysts covering platform transitions.
|
|
- Researchers and historians studying how ecosystems adapt to hardware changes.
|
|
|
|
### Machine audiences
|
|
|
|
- Search engines indexing canonical app, category, and comparison pages.
|
|
- LLM search products citing your pages as evidence.
|
|
- RAG systems consuming public snapshots or APIs.
|
|
- Agents answering migration, procurement, or troubleshooting questions.
|
|
- Internal `doesitarm` automation using the same canonical public layer as a
|
|
stable reference surface.
|
|
|
|
### Business-model value
|
|
|
|
- Audience growth from high-intent compatibility queries.
|
|
- Affiliate or sponsored monetization on truly decision-support pages.
|
|
- Paid APIs, bulk exports, or enterprise dashboards.
|
|
- Vendor intelligence and alerting.
|
|
- Historical transition data as a differentiated research asset.
|
|
|
|
Inference:
|
|
The public facts are likely to commoditize over time. The durable value is the
|
|
combination of breadth, freshness, provenance, history, and tooling layered on
|
|
top of those facts.
|
|
|
|
## What Should Likely Stay Public
|
|
|
|
Public-by-default fields:
|
|
|
|
- stable app identifier and canonical URL
|
|
- app name, aliases, vendor, category, platform family
|
|
- compatibility status by environment
|
|
- environment dimensions such as CPU architecture, OS family/version, native
|
|
vs translation vs virtualization
|
|
- bundle IDs and installer/package metadata where safe and user-useful
|
|
- last verified date, first seen date, last changed date
|
|
- public evidence summary and source links
|
|
- changelog summary for status changes
|
|
- category and comparison pages built from real user tasks
|
|
- curated JSON, CSV, or Parquet snapshot exports
|
|
- public structured data and sitemaps
|
|
|
|
Public page types that seem high-value:
|
|
|
|
- canonical app pages
|
|
- category pages
|
|
- "best alternatives if not native yet" pages
|
|
- transition pages such as "best native DAWs on Apple Silicon"
|
|
- comparison pages by use case, hardware generation, and workaround path
|
|
- dataset landing pages for bulk exports
|
|
|
|
## What Should Likely Stay Private
|
|
|
|
Private-by-default fields:
|
|
|
|
- raw crawled HTML and downloaded ZIP/DMG/PKG artifacts
|
|
- extracted binaries and quarantine samples
|
|
- low-confidence matches and candidate entities
|
|
- dedupe, normalization, and scoring heuristics
|
|
- reviewer notes, moderation notes, and dispute state
|
|
- crawler logs, IP intelligence, WAF rules, and abuse signatures
|
|
- affiliate economics, contact records, outreach state, and deal terms
|
|
- internal confidence models, embeddings, and experimental feature engineering
|
|
- unpublished source mappings and scrape recipes that are costly to build
|
|
|
|
Why keep these private:
|
|
|
|
- operational risk
|
|
- legal and hosting risk
|
|
- abuse resistance
|
|
- clearer moat
|
|
- lower copyability
|
|
|
|
## Different Ways To Think About The Database
|
|
|
|
### 1. Directory / programmatic SEO system
|
|
|
|
Upside:
|
|
- fastest traffic growth if executed well
|
|
|
|
Downside:
|
|
- easiest to drift into thin pages and scaled-content abuse
|
|
- weakest long-term moat
|
|
|
|
Use this frame only if every template answers a real question better than a
|
|
generic directory would.
|
|
|
|
### 2. Public knowledge graph with evidence
|
|
|
|
Upside:
|
|
- strongest fit for search, citations, and trust
|
|
- best long-term reuse across Apple, Windows, and future transitions
|
|
|
|
Downside:
|
|
- requires stronger data modeling and provenance discipline
|
|
|
|
This is the best framing for `doesitarm`.
|
|
|
|
### 3. Public publication layer over a private intelligence system
|
|
|
|
Upside:
|
|
- best balance of discoverability and defensibility
|
|
- easiest path to enterprise/API products later
|
|
|
|
Downside:
|
|
- more operational complexity
|
|
|
|
This is the recommended operating model.
|
|
|
|
### 4. Mostly closed database with selective public summaries
|
|
|
|
Upside:
|
|
- strongest direct control over assets
|
|
|
|
Downside:
|
|
- weakest SEO and AI discoverability
|
|
- hardest to build brand authority from the data itself
|
|
|
|
This makes sense only if monetization depends more on closed workflows than on
|
|
being the public authority.
|
|
|
|
## Open Vs Closed Strategy Options
|
|
|
|
## Option 1. Open facts, private operations
|
|
|
|
Publish:
|
|
|
|
- canonical pages
|
|
- evidence summaries
|
|
- limited exports
|
|
- structured data
|
|
|
|
Keep private:
|
|
|
|
- raw ingestion
|
|
- candidate pipeline
|
|
- scoring and ops
|
|
|
|
Tradeoff:
|
|
Best overall balance of discoverability, trust, and defensibility.
|
|
|
|
## Option 2. Open pages, paid API / paid bulk data
|
|
|
|
Publish:
|
|
|
|
- strong pages for discovery and citations
|
|
- free lightweight API or delayed snapshots
|
|
|
|
Charge for:
|
|
|
|
- real-time API
|
|
- higher limits
|
|
- historical depth
|
|
- enterprise filters and alerts
|
|
|
|
Tradeoff:
|
|
Strong monetization path, but requires clearer product packaging.
|
|
|
|
## Option 3. Fully open data commons
|
|
|
|
Publish:
|
|
|
|
- everything except unsafe raw binaries/secrets
|
|
|
|
Tradeoff:
|
|
Maximum goodwill, citation, and reuse.
|
|
Minimum moat unless monetization shifts to services, sponsorship, or community
|
|
leadership.
|
|
|
|
## Option 4. Selective access / crawler monetization layer
|
|
|
|
Publish:
|
|
|
|
- normal web pages
|
|
|
|
Control:
|
|
|
|
- which bots crawl
|
|
- whether training is allowed
|
|
- whether some crawlers must pay
|
|
|
|
Tradeoff:
|
|
Promising middle path, especially as crawler monetization standards mature, but
|
|
still early and not something to build the whole strategy around yet.
|
|
|
|
## Recommendation
|
|
|
|
For `doesitarm`, use Option 1 now, with a path to Option 2 later.
|
|
|
|
Concretely:
|
|
|
|
1. Treat the database as transition-agnostic.
|
|
Use dimensions like `platform_family`, `cpu_arch`, `translation_layer`,
|
|
`virtualization_layer`, `os_version`, `artifact_type`, and
|
|
`verification_method` so the same model can cover Apple Silicon, Windows on
|
|
ARM, or the next Apple transition.
|
|
|
|
2. Build a public canonical facts layer.
|
|
Each app should have a canonical page with:
|
|
status, environments, timestamps, evidence links, and short synthesis.
|
|
|
|
3. Build a public dataset layer.
|
|
Publish periodic snapshots with dataset landing pages, license, provenance,
|
|
versioning, and download metadata.
|
|
|
|
4. Keep ingestion and raw evidence private.
|
|
Store raw downloads, scrape traces, matching logic, and low-confidence
|
|
candidates outside the public repo and public site.
|
|
|
|
5. Add public machine-readable surfaces in this order:
|
|
- `SoftwareApplication`-style entity markup where it truthfully matches page content
|
|
- dataset landing pages plus `Dataset` / `DataDownload` metadata for exports
|
|
- stable JSON or CSV snapshots
|
|
- `llms.txt` and Markdown exports as secondary aids
|
|
|
|
6. Make public pages citation-friendly.
|
|
Add clear authorship, methodology, "how we know", last verified date, and
|
|
source links.
|
|
|
|
7. Avoid index bloat.
|
|
Keep canonical entity and high-intent comparison pages indexable.
|
|
Use `noindex` or crawl controls for low-value filter permutations and stale
|
|
or confusing pages.
|
|
|
|
8. Measure before deciding how open to be.
|
|
Track:
|
|
- Search Console web traffic
|
|
- ChatGPT referral traffic via `utm_source=chatgpt.com`
|
|
- bot traffic by user agent
|
|
- crawl cost versus referral value
|
|
|
|
Inference:
|
|
The best long-term moat is not withholding all facts. It is being the most
|
|
trusted and most reusable source for those facts, while keeping the expensive
|
|
and differentiating machinery private.
|
|
|
|
## Near-Term Next Steps For doesitarm
|
|
|
|
1. Add a public data-contract document describing the canonical app entity,
|
|
environment entity, evidence entity, and snapshot dataset.
|
|
2. Expand app pages from "status page" to "evidence page":
|
|
include methodology, last verified date, change history, and source
|
|
attribution.
|
|
3. Add structured data intentionally:
|
|
entity markup for app pages, dataset markup for exports, not generic markup
|
|
everywhere.
|
|
4. Add a public snapshot export and a dataset landing page.
|
|
5. Add a bot-policy matrix to `robots.txt` planning:
|
|
Google search, OpenAI search, Anthropic search, training bots, and user bots.
|
|
6. Add `llms.txt` only after the public canonical and export layers are clean.
|
|
7. Keep filters/search-result pages from becoming the primary indexable surface.
|
|
|
|
## Source Links
|
|
|
|
- Repo context:
|
|
- [README.md](/Users/athena/Code/doesitarm/README.md)
|
|
- [docs/plans/app-discovery-d1-automation.md](/Users/athena/Code/doesitarm/docs/plans/app-discovery-d1-automation.md)
|
|
- [static/robots.txt](/Users/athena/Code/doesitarm/static/robots.txt)
|
|
- [helpers/structured-data.js](/Users/athena/Code/doesitarm/helpers/structured-data.js)
|
|
- [helpers/listing-page.js](/Users/athena/Code/doesitarm/helpers/listing-page.js)
|
|
- [helpers/config-node.js](/Users/athena/Code/doesitarm/helpers/config-node.js)
|
|
- [static/api/app/spotify.json](/Users/athena/Code/doesitarm/static/api/app/spotify.json)
|
|
|
|
- Google AI features and AI search:
|
|
- https://developers.google.com/search/docs/appearance/ai-features
|
|
- https://developers.google.com/search/blog/2025/05/succeeding-in-ai-search
|
|
- https://developers.google.com/search/docs/fundamentals/creating-helpful-content
|
|
- https://developers.google.com/search/docs/essentials/spam-policies
|
|
- https://developers.google.com/search/docs/fundamentals/using-gen-ai-content
|
|
|
|
- Google review and structured-data guidance:
|
|
- https://developers.google.com/search/docs/appearance/reviews-system
|
|
- https://developers.google.com/search/docs/specialty/ecommerce/write-high-quality-reviews
|
|
- https://developers.google.com/search/docs/appearance/structured-data/software-app
|
|
- https://developers.google.com/search/docs/appearance/structured-data/dataset
|
|
- https://developers.google.com/search/docs/appearance/structured-data/sd-policies
|
|
- https://developers.google.com/search/docs/crawling-indexing/crawling-managing-faceted-navigation
|
|
- https://developers.google.com/search/docs/crawling-indexing/block-indexing
|
|
|
|
- Schema and dataset modeling:
|
|
- https://schema.org/SoftwareApplication
|
|
|
|
- OpenAI:
|
|
- https://help.openai.com/en/articles/12627856-publishers-and-developers-faq
|
|
- https://help.openai.com/en/articles/9237897-chatgpt-search
|
|
- https://platform.openai.com/docs/gptbot
|
|
|
|
- Anthropic:
|
|
- https://support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
|
|
- https://docs.anthropic.com/en/docs/build-with-claude/search-results
|
|
|
|
- Cloudflare:
|
|
- https://developers.cloudflare.com/style-guide/how-we-docs/ai-consumability/
|
|
- https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/
|
|
- https://blog.cloudflare.com/introducing-pay-per-crawl/
|
|
|
|
- Discovery and freshness:
|
|
- https://www.indexnow.org/index
|
|
|
|
- Practitioner and discussion context:
|
|
- https://simonwillison.net/2025/Oct/24/claude-code-docs-map/
|
|
- https://news.ycombinator.com/item?id=41072549
|
|
- https://lobste.rs/s/dmuad3/mitigating_sourcehut_s_partial_outage
|
|
|
|
## Source Quality Notes
|
|
|
|
- Google Search Central, OpenAI, Anthropic, Schema.org, IndexNow, and Cloudflare
|
|
were the primary sources for current guidance.
|
|
- The HN and Lobsters links were useful for operator sentiment and failure modes,
|
|
not as primary authority for ranking behavior.
|
|
- `llms.txt` appears real and increasingly implemented, but the strongest
|
|
current evidence still says it is supplemental rather than foundational.
|