Voices of Search Podcast
Episode: Blocking LLMs from Proprietary Data?
Date: April 2, 2026
Host: Tyson
Guest: Kaspar Siminsky, Senior Director at Search Brothers & former Google Search Team
Episode Overview
In this concise episode, Tyson and Kaspar Siminsky discuss the challenges and realities of keeping proprietary data away from Large Language Models (LLMs) and crawlers, especially for enterprise-level organizations. The conversation addresses the binary nature of web visibility and offers practical advice for those concerned about sensitive or proprietary content.
Key Discussion Points & Insights
The Binary Nature of Web Accessibility
-
Defining Proprietary Data
Kaspar opens by questioning how proprietary the data is:- If it's truly sensitive or critical, it arguably shouldn't be available on the public web at all.
-
Visibility Equals Crawlability
- "If it's public, if it's accessible, it's going to get crawled." — Kaspar Siminsky (02:25)
- The web operates on a clear binary: content is either available to all, including bots and LLMs, or it's entirely restricted.
-
The Inevitability of Leaks
- Even with various crawling restrictions, there's always a risk:
- "If it's crawled by some bots, chances are it's going to leak." — Kaspar Siminsky (02:25)
- Even with various crawling restrictions, there's always a risk:
Practical Approach for Enterprises
-
Protecting Truly Proprietary Data
- The most effective approach: don’t put sensitive content online if you want to guarantee it stays off LLMs and away from crawlers.
-
Limits of Technical Barriers
- Robots.txt, CAPTCHAs, and other barriers can help, but they aren't foolproof against determined actors or evolving crawling strategies.
Notable Quotes & Memorable Moments
-
On the Hypothetical of Blocking LLMs:
- "If it's really proprietary and if it's something that we do not want to get scraped and crawled, ultimately then it shouldn't be accessible in the first place."
— Kaspar Siminsky (00:41)
- "If it's really proprietary and if it's something that we do not want to get scraped and crawled, ultimately then it shouldn't be accessible in the first place."
-
On Public Content Risks:
- "If it's public, if it's accessible, it's going to get crawled. Ultimately, it's kind of like a binary choice."
— Kaspar Siminsky (02:25)
- "If it's public, if it's accessible, it's going to get crawled. Ultimately, it's kind of like a binary choice."
Important Segment Timestamps
- 00:22: Episode premise; introduction of guest Kaspar Siminsky—expert credentials and introduction of the topic.
- 00:41: Discussion begins on what it means to block LLMs from proprietary data.
- 02:25: Kaspar outlines the binary nature of online content and the risk of exposure if made public.
Tone & Takeaway
The conversation is frank and practical—Kaspar avoids technical jargon and opts for real-world logic: Unless you're willing to keep your proprietary data offline, you must assume that it could eventually be accessed by LLMs and crawlers. For enterprises especially, this means rethinking which assets are truly suitable for online exposure.
Bottom line: If you don't want it scraped, don’t let it be visible online—no technical fix is foolproof.
Summary At-a-Glance
- If it's on the web, it may be accessed by LLMs/crawlers—there’s no middle ground.
- Protect truly sensitive information by keeping it offline.
- Technical solutions offer only limited protection; the best defense is not publishing.
- Enterprise decision-makers must make careful, binary choices about data exposure.
For more information about Kaspar Siminsky, visit Search Brothers or check the show notes for his LinkedIn profile.
