LinkToTale — YouTube cartoon to picture book (CLI)

TL;DR

Paste a YouTube link, get a self-contained HTML picture book. The CLI downloads the cartoon, extracts frames, detects scene changes, and uses GPT-4.1 to write a 25–35 page children’s story around the real timeline — in both Polish and English. Cost: ~$0.50 per 12-minute cartoon.

Source code: github.com/rafalmcichon/linktotale (MIT license)

Context

Stack: Node.js, yt-dlp, ffmpeg, sharp, OpenAI API (GPT-4.1 + GPT-4.1-mini Vision)
Single file: ~1,100 lines, no framework, no bundler
Output: two standalone HTML files (book-pl.html, book-en.html) — open in a browser, no server needed

What I observed

My son loves his cartoons. But even 10–15 minutes a day was enough to make him noticeably restless afterward — harder to settle, harder to focus. Cutting cartoons entirely felt unrealistic (he already knew the characters, the stories meant something to him).

What worked better: we’d read a book about the same characters, then watch a short clip — or reverse the order. Afterward we’d rebuild the scenes with Lego, acting out what happened. That loop — read, watch, play — kept him calm and engaged in a way that pure screen time never did.

The gap: the stories he loved existed only as video. There was no book version to hold, read together, or use as a starting point for play.

What I built

A CLI pipeline that turns any subtitled YouTube cartoon into a picture book:

1) Download + extract

yt-dlp pulls the video and both PL/EN subtitle tracks
ffmpeg extracts one frame per second and overlays subtitle text on matching frames

2) Scene intelligence

sharp computes perceptual hashes (16x16 grayscale) to detect scene changes (>12% diff)
Dialogue gaps (3+ seconds of silence) are identified — these are action-only moments that need narration

3) AI story generation

GPT-4.1-mini Vision analyzes candidate frames: is this important (new action, emotion, location) or skippable (logo, credits, black)?
GPT-4.1 receives the full chronological timeline + up to 10 key frames, then writes a proper children’s story as JSON (one page = one frame + text)
A second Vision pass verifies each page got the best-matching frame, not just the default pick

4) HTML output

Two self-contained HTML books with page-by-page navigation, keyboard/swipe support, animated cover, and a styled “The End” screen

What makes it work

Factual accuracy — the AI prompts enforce chronological fidelity and penalize hallucination; the story follows what actually happens in the cartoon
Vision verification — every page is checked against surrounding frames to ensure visual-text alignment
Caching — frame analyses are stored in JSON so reruns skip redundant API calls
--book flag — regenerate just the book without re-downloading or re-extracting

Results

A 12-minute cartoon produces a complete bilingual picture book in under 10 minutes for ~$0.50
The books are something we actually use — we read them at bedtime, and my son recognizes every scene
The read-watch-play loop became easier to sustain because the “read” part finally existed

What I took from it

The problem wasn’t screen time in the abstract. It was that the content my son loved had no form he could interact with away from the screen.
Converting existing stories into books — ones he already had feelings about — worked better than trying to replace them with something new.
Minimal tech was the right call. Single file, no dependencies beyond what’s needed. Complexity was wrong here — for the tool and for the parenting approach behind it.

What I’d do next

Add support for cartoons without subtitles (pure vision-based narration from frame analysis alone).
Experiment with page-level audio narration (text-to-speech) for kids who can’t read yet.
Let the child pick a character and generate a “what happens next” continuation as a new book.
Package it as a simple web app so non-technical parents can use it.