Liberating Trapped Data
You own documents you can't search. Data you can see but can't use. The coherentist move: build the extraction pipeline once and stop paying friction tax forever.
You own it. You just can't use it.
There's a particular kind of frustration that accumulates in the background. You have a PDF—a contract, a research paper, a scanned receipt, tax documents from three years ago. You can see the text. You can read it with your eyes. But you can't search it. Can't quote it. Can't copy the relevant paragraph into the document you're actually writing.
The data is yours. You paid for it, created it, received it. And yet you can't access it in any useful way. It's trapped behind glass.
You experience this as a minor inconvenience. A few minutes here and there. Manually retyping what's already written. Squinting at a scan to find the number you need. It doesn't feel like a significant cost because each instance is small.
But here's what the small costs hide: they change your behavior.
The Invisible Narrowing
Friction doesn't just slow you down. It shapes what you attempt.
When searching a PDF takes more effort than it's worth, you stop searching PDFs. When quoting a scanned document means retyping it, you quote less often. When your archive is visible but not accessible, you use your archive less.
This is the real cost of trapped data—not the minutes lost to manual extraction, but the invisible narrowing of what you even try to do. The ideas you don't pursue because the source material is too annoying to work with. The connections you don't make because your documents can't talk to each other.
You adjust to the friction without noticing. The trap isn't just access. It's the slow forgetting that access was ever the point.
The Recurring Tax
Here's how you deal with trapped data: you pay the friction tax every time.
Need that clause from the contract? Open the PDF, squint, retype. Need to quote that research paper? Screenshot, OCR it with whatever tool is handy, clean up the errors, paste.
Each transaction is small. But it recurs. And recurring small costs compound into something significant—not just in time, but in the relationship you develop with your own information.
You start to think of your archive as a burden rather than a resource. Something to manage rather than something to use. The documents that should extend your memory become weights that slow your thinking.
This is the trap within the trap: you learn to work around your own data instead of with it.
The One-Time Payment
The coherentist move is different. Instead of paying the friction tax repeatedly, you pay once to eliminate it permanently.
Build the extraction pipeline. Run OCR against your archive. Transform the PDFs you can see into the text you can search. Convert trapped documents into accessible data.
Simon Willison has been mapping this territory for years. The tooling exists: browser-based OCR, command-line utilities that process entire folders, AI-powered extraction that handles messy scans. Mistral's specialist model processes a thousand pages for a dollar. Open-source alternatives run locally for the cost of electricity. The barrier isn't cost or complexity. It's recognizing that the barrier exists at all.
This isn't about the specific tools. It's about the relationship with your own information. The question isn't "how do I extract text from this PDF?" It's "why am I still paying to access data I already own?"
The pattern: identify recurring friction, eliminate it permanently, never solve the same problem twice.
Liberation as Infrastructure
Consider what changes when your archive becomes searchable.
Your past thinking becomes available to your present thinking. The notes from three projects ago can surface when they're relevant. The contract clause you vaguely remember exists can be found instead of reconstructed. Your documents become companions instead of containers—they participate in your work instead of sitting inert until manually activated.
This is the infrastructure mindset: the effort you invest in access pays dividends every time you access. The hour spent building an extraction pipeline saves hundreds of hours of manual friction across every future use.
But more than the time saved, there's a qualitative shift in how you work. When your data is accessible, you develop different habits. You search before you reconstruct. You connect before you isolate. You build on what you've already done instead of starting from scratch because starting from scratch is easier than finding what exists.
The infrastructure isn't just efficient. It changes what's possible.
The Compound Effect
Once you build the extraction pipeline, something shifts. You don't just have searchable documents. You develop a different relationship with information entirely.
You archive more generously, knowing archived data remains useful. You connect ideas across sources, because the sources are accessible. You build on past work, because past work isn't buried under friction.
The pipeline creates possibilities that create more pipelines. The habit of liberation compounds into a different way of being informed.
This is the coherentist principle at work: small structural changes create cascading effects. What looks like a utility—extracting text from PDFs—is actually an infrastructure decision about how your mind extends into your archive.
The Question
You have documents you can see but can't search. Data you own but can't use. Every time you manually extract what's already there, you're paying a tax on your own information.
The tools exist. The process is understood. The one-time investment is trivial compared to the recurring cost.
The only question is whether you keep paying the friction tax—or whether you build the pipeline once and liberate your data for good.
Solve once, reuse forever. That's not optimization. That's agency.
Sources: Simon Willison's work on OCR tools, Mistral OCR, olmOCR, and patterns of AI-powered document extraction