LinkToTale — YouTube cartoon to picture book (CLI)
TL;DR
Paste a YouTube link, get a self-contained HTML picture book. The CLI downloads the cartoon, extracts frames, detects scene changes, and uses GPT-4.1 to write a 25–35 page children’s story around the real timeline — in both Polish and English. Cost: ~$0.50 per 12-minute cartoon.
Source code: github.com/rafalmcichon/linktotale (MIT license)
Context
- Stack: Node.js,
yt-dlp,ffmpeg,sharp, OpenAI API (GPT-4.1 + GPT-4.1-mini Vision) - Single file: ~1,100 lines, no framework, no bundler
- Output: two standalone HTML files (
book-pl.html,book-en.html) — open in a browser, no server needed
What I observed
My son loves his cartoons. But even 10–15 minutes a day was enough to make him noticeably restless afterward — harder to settle, harder to focus. Cutting cartoons entirely felt unrealistic (he already knew the characters, the stories meant something to him).
What worked better: we’d read a book about the same characters, then watch a short clip — or reverse the order. Afterward we’d rebuild the scenes with Lego, acting out what happened. That loop — read, watch, play — kept him calm and engaged in a way that pure screen time never did.
The gap: the stories he loved existed only as video. There was no book version to hold, read together, or use as a starting point for play.
What I built
A CLI pipeline that turns any subtitled YouTube cartoon into a picture book:
1) Download + extract
yt-dlppulls the video and both PL/EN subtitle tracksffmpegextracts one frame per second and overlays subtitle text on matching frames
2) Scene intelligence
sharpcomputes perceptual hashes (16x16 grayscale) to detect scene changes (>12% diff)- Dialogue gaps (3+ seconds of silence) are identified — these are action-only moments that need narration
3) AI story generation
- GPT-4.1-mini Vision analyzes candidate frames: is this important (new action, emotion, location) or skippable (logo, credits, black)?
- GPT-4.1 receives the full chronological timeline + up to 10 key frames, then writes a proper children’s story as JSON (one page = one frame + text)
- A second Vision pass verifies each page got the best-matching frame, not just the default pick
4) HTML output
- Two self-contained HTML books with page-by-page navigation, keyboard/swipe support, animated cover, and a styled “The End” screen
What makes it work
- Factual accuracy — the AI prompts enforce chronological fidelity and penalize hallucination; the story follows what actually happens in the cartoon
- Vision verification — every page is checked against surrounding frames to ensure visual-text alignment
- Caching — frame analyses are stored in JSON so reruns skip redundant API calls
--bookflag — regenerate just the book without re-downloading or re-extracting
Results
- A 12-minute cartoon produces a complete bilingual picture book in under 10 minutes for ~$0.50
- The books are something we actually use — we read them at bedtime, and my son recognizes every scene
- The read-watch-play loop became easier to sustain because the “read” part finally existed
What I took from it
- The problem wasn’t screen time in the abstract. It was that the content my son loved had no form he could interact with away from the screen.
- Converting existing stories into books — ones he already had feelings about — worked better than trying to replace them with something new.
- Minimal tech was the right call. Single file, no dependencies beyond what’s needed. Complexity was wrong here — for the tool and for the parenting approach behind it.
What I’d do next
- Add support for cartoons without subtitles (pure vision-based narration from frame analysis alone).
- Experiment with page-level audio narration (text-to-speech) for kids who can’t read yet.
- Let the child pick a character and generate a “what happens next” continuation as a new book.
- Package it as a simple web app so non-technical parents can use it.