Casting Spells on Transcripts

Jan 12, 2026

Disclaimer: This article was generated copy-paste by GPT 5.2 but as a distillation of a VERY lengthy debate with it about thinking through what I wanted. TBH I think AI outputs are fine as long as they are a transformation of significiant human input. Maybe worth an article on its own… anyway onto the article…

Some days it’s “I need a better read-it-later.” Other days it’s “I need a bookmarking system.” Or “I should get serious about RSS again.” Or “maybe I just need a better notes app.” And if you squint hard enough, you can always convince yourself the answer is one more tool.

But the thing I actually want isn’t another place to put links.

What I want is a way to take a link — especially a long YouTube video — and turn it into useful, copyable text that I can operate on with all the self-hosted AI stuff I already run. I’m not trying to “save the internet.” I’m trying to compile it.

And I’m demanding about it in a way that makes most “product” answers feel wrong.

Because the moment the solution drifts into “store everything forever,” I can already see the future. I’ll be running a media archive. I’ll be dealing with retention policy and backups and disk growth and broken downloads and metadata drift. I’ll have invented my own part-time job and called it productivity.

I don’t want a second brain. I want a wand.

The shape of the itch

The mobile part matters more than people admit.

Most of my “this might be important” moments happen on my phone: I’m in YouTube, or a browser, or a random app that opens a link, and I want to send that thing somewhere without breaking my flow.

“Somewhere” is usually a queue. But I don’t want the queue to be the end state. I want the queue to be the front door to transformation.

The outcome I care about is embarrassingly simple:

full transcript, copyable
a decent TL;DR I can paste
maybe a few derived views (tasks, claims, glossary) depending on mood
optionally: embed it so later I can query it like a corpus

If I can do that from my phone, then my homelab GPU box stops being a science project and starts being a daily tool.

Two traps I’m trying not to walk into

The first trap is archive gravity.

It starts innocently: “I’ll just download the audio for Whisper.” Then you realize the tool can also download the video. And maybe you keep it. And maybe you keep the mp3 too. And you add thumbnails. And metadata. And now you’re building a library, not a pipeline.

I don’t want to preserve the original content. I want to preserve the work product: the transcript and whatever I derived from it.

The second trap is bespoke pipeline rot.

The more you build a fully custom ingestion system, the more you’re signing up to maintain it. If you’ve done enough infra you know the exact shape of the decay:

job orchestration
retries and idempotence
concurrency and backpressure
secret management
monitoring, failure modes, “why did this run twice”
schema changes, tool upgrades, scraped-site breakage

The first time you wire up yt-dlp → mp3 → whisper → summary → embeddings, you feel like you’re printing money.

The tenth time you’re debugging it from your phone while standing in line somewhere, it becomes obvious you built a thing that requires attention. And attention is the one resource this whole project is supposed to save.

So I keep looking for a design that stays small, stays legible, and doesn’t turn into an obsession.

The idea that finally clicked: spells

This is the part where it stopped feeling like “workflow automation” and started feeling like something I’d actually use.

I don’t want a system that tries to guess what I want and runs a bunch of stuff automatically.

I want buttons.

I want to be able to say: do this to that.

transcribe
summarize
extract tasks
ingest into RAG (maybe into different “collections”)
generate a few views I can paste elsewhere

Calling those “spells” is corny in the best way, because it forces the right constraints. A spell is explicit. It’s intentional. It’s something you ask for, not something that happens to you.

And once you accept that, the follow-on idea becomes obvious:

Spells should be composable.

Not in a “draw a spaghetti graph in a GUI” way. In a dependency sense.

If I press “RAG ingest,” the system shouldn’t shrug and fail because there’s no transcript yet. It should know that ingesting requires chunking, and chunking requires a transcript (or at least readable text), and it should build what it needs.

That’s not “AI automation.” That’s dependency resolution. That’s a build system.

Which leads to the actual primitive I’ve been missing in all these tools:

Artifacts, not fields

Most bookmarking / read-later tools give you one place to put “notes,” and maybe one summary field if they’re feeling fancy.

That’s not enough. It’s the wrong shape.

I want a bookmark to accumulate named outputs:

transcript.txt — the thing I actually want to copy
transcript.vtt — timestamps, because timestamps are power
tldr.md — a summary that’s clearly a derived artifact, not “the truth”
tasks.md — if I’m in that mood
index.md — a tiny table-of-contents so the bookmark becomes a shelf, not a blob

If the system can attach multiple assets per bookmark, suddenly everything else falls into place:

the bookmark is the anchor (URL + metadata)
assets are the outputs (the only thing I care about keeping)
the pipeline produces assets
downstream spells consume assets

And because assets are explicit, the “skip if already exists” behavior becomes clean. The pipeline doesn’t need a mystical global state. It can look at what’s attached, decide what’s missing, and proceed.

This is the moment I realized I wasn’t searching for a “better summary field.” I was searching for a place to put build outputs.

Thin inbox, thick outputs

Once I had the artifact idea, my earlier discomfort with “self-hosted everything” started to sharpen into a requirement.

The inbox has to stay thin.

I don’t want my bookmarking tool to become my archive. I don’t want it to silently store raw MP4s and MP3s forever. I don’t want “convenient defaults” that turn into a storage tax.

But I do want my derived outputs to persist. That’s the point. Those are small, stable, and actually useful later.

That’s why object storage suddenly becomes part of the story. If the system can attach assets and store them in something MinIO/S3-shaped, then the inbox stays light and the outputs scale without me thinking about disk layout.

The raw media can be ephemeral. Download, process, delete. The transcript is the durable artifact.

That feels right.

The mobile UX problem: I refuse to type tags

There’s a version of this where “spells” are invoked by typing a tag like !transcribe.

That works. It’s also not how I want to live.

If the whole thing is meant to reduce friction, then typing little incantations on my phone is a self-own. It turns “quick share” into “mini admin task.”

So the UX I actually want is stupid-simple:

Share → a tiny “Spells” target opens → big buttons:

Transcript + TL;DR
Transcript only
RAG → Home
Tasks

Tap one. Done. Queue it. Go back to my life.

This is where a share-target PWA (or some minimal intermediate share receiver) becomes interesting. Not because I want yet another UI, but because I want the UI to be exactly the size of my intent.

A tiny front door that lets me pick a spell bundle without typing anything.

Everything else can happen on the server.

Where orchestration fits (and where it doesn’t)

At this point the temptation is to start building “the perfect runner.”

I’m trying to avoid that.

The runner’s job is boring:

accept an intent (“cast this bundle on that URL”)
resolve dependencies based on existing artifacts
run the necessary steps
write assets back

That’s it.

I don’t need a cathedral. I need something restart-safe, idempotent, and not cute.

This is where Windmill entered the conversation for me: not because I want a workflow GUI, but because it already has the primitives that bespoke pipelines tend to re-implement poorly: jobs, retries, and a notion of a flow. If I can treat each spell as a reusable primitive and use flows as bundles, that’s attractive. It’s not magical; it’s just offloading plumbing.

But I’m intentionally keeping the design portable in my head. If I decide Windmill isn’t the right fit later, the mental model still stands: artifacts, dependencies, manual-first spells.

What I’ve actually decided (so far)

I haven’t implemented the stack. This is me admitting I’ve been thinking the issue to death and finally got to something that feels like a stable shape.

The shape looks like:

an inbox that can hold links and attach assets
thin by default (don’t hoard raw media)
assets stored in object storage
spells are manual-first and invoked from the phone
bundles are just “targets” that pull prerequisites
artifacts are the source of truth; “summary fields” are just projections if I want them

That’s the whole thing.

It’s not a tutorial, because I don’t have the scar tissue yet. It’s a distillation of what I’ve hashed out — including with ChatGPT — because the framing shift (from “bookmarks” to “build outputs”) is what I think is worth sharing.

I can tell when I’m onto something because it reduces the design, not expands it.

The moment I stopped trying to build a better place to save links and started trying to build a better way to produce artifacts from them, the rest of the decisions became almost boring.

And boring is exactly what I want from infrastructure that’s supposed to be used casually, from a phone, without turning me into a caretaker.

blog.bios.dev

Discussion about this post

Ready for more?