How We Built YouTube2Text: Extracting YouTube Transcripts with Python and Cloudflare Workers

YouTube has captions on most videos, but getting them out as clean text is surprisingly painful. There is no official “download transcript” API for casual use, third-party tools are ad-ridden, and most Python libraries that once worked have been broken by YouTube’s constant anti-scraping changes.

So we built YouTube2Text — a free tool where you paste a YouTube link and get the full transcript in seconds. No signup, no API key, no nonsense. In this post, we’ll walk through exactly how we built it with Python, FastAPI, and a Cloudflare Worker proxy.

The Architecture at a Glance

YouTube2Text has three moving parts:

FastAPI backend — receives a YouTube URL, extracts the video ID, fetches the transcript, parses it, and returns structured JSON.
Cloudflare Worker — a lightweight proxy that makes YouTube requests from Cloudflare’s IP range (more on why later).
React frontend — a single-page app where users paste a URL and see the transcript with copy/download options.

The whole thing is stateless — no database, no user accounts, no session storage. A request comes in, we fetch the transcript, return it, done.

The Hard Part: Getting Transcripts Out of YouTube

YouTube doesn’t have a public “get transcript” endpoint. Under the hood, captions are served through an internal API called InnerTube, the same API the YouTube app and website use. Here’s the challenge: YouTube actively blocks cloud server IPs from accessing InnerTube. If your server is on AWS, GCP, or any major cloud provider, you’ll get LOGIN_REQUIRED errors even for public videos.

The popular youtube-transcript-api Python library works fine from your laptop but fails on a VPS:

from youtube_transcript_api import YouTubeTranscriptApi

ytt_api = YouTubeTranscriptApi()
transcript = ytt_api.fetch("dQw4w9WgXcQ", languages=["en"])
# Works locally ✓
# Fails on cloud servers ✗ — LOGIN_REQUIRED

We needed a way to make these requests from IPs that YouTube doesn’t block.

The Cloudflare Worker Proxy

Cloudflare Workers run on Cloudflare’s edge network. Their IPs are treated differently by YouTube — they’re not flagged as “cloud datacenter” IPs the way AWS or Hetzner are. The free tier gives you 100,000 requests per day, more than enough for a transcript tool.

Our Worker does one thing: it POSTs to YouTube’s InnerTube API and returns the caption data. Here’s the core of it:

const INNERTUBE_URL = "https://www.youtube.com/youtubei/v1/player?key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8";

const INNERTUBE_CLIENTS = [
  { name: "ANDROID", version: "20.10.38", clientId: "3" },
  { name: "WEB", version: "2.20240313.05.00", clientId: "1" },
  { name: "MWEB", version: "2.20240313.05.00", clientId: "2" },
  { name: "ANDROID_EMBEDDED_PLAYER", version: "20.10.38", clientId: "55" },
  { name: "TVHTML5_SIMPLY_EMBEDDED_PLAYER", version: "2.0", clientId: "85" },
  { name: "IOS", version: "19.29.1", clientId: "5" },
];

async function handleGetTranscript({ videoId, lang = "en" }) {
  for (let attempt = 0; attempt < 6; attempt++) {
    const client = INNERTUBE_CLIENTS[attempt % INNERTUBE_CLIENTS.length];

    const innerResp = await fetch(INNERTUBE_URL, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "X-Youtube-Client-Name": client.clientId,
        "X-Youtube-Client-Version": client.version,
      },
      body: JSON.stringify({
        videoId,
        context: {
          client: { clientName: client.name, clientVersion: client.version },
        },
      }),
    });

    const data = await innerResp.json();
    const tracks = data.captions
      ?.playerCaptionsTracklistRenderer?.captionTracks;

    if (tracks?.length) {
      let track = tracks.find((t) => t.languageCode === lang) || tracks[0];
      let url = track.baseUrl.replace(/[&?]fmt=[^&]*/g, "") + "&fmt=json3";

      const ttResp = await fetch(url);
      const transcript = await ttResp.text();
      if (transcript)
        return Response.json({ videoId, transcript });
    }
  }

  return Response.json({ error: "transcript_failed" }, { status: 404 });
}

A few things to notice:

Client Rotation

YouTube’s InnerTube API responds differently depending on which “client” you claim to be. Some clients get blocked more often than others. We rotate through six clients (ANDROID, WEB, MWEB, iOS, etc.) across retry attempts. Different clients have different block rates, and rotating significantly improves the overall success rate.

Security

An open proxy on Cloudflare’s network is a bad idea. Every request must include an X-Proxy-Secret header that matches a secret stored in the Worker’s environment:

const secret = request.headers.get("X-Proxy-Secret");
if (secret !== env.PROXY_SECRET) {
  return Response.json({ error: "unauthorized" }, { status: 401 });
}

Browser Headers Matter

YouTube does bot detection. Including proper Sec-Fetch-* and Sec-Ch-Ua headers makes a big difference in whether you get a full response or a degraded one:

const BROWSER_HEADERS = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
  "Sec-Ch-Ua": '"Chromium";v="131", "Not_A Brand";v="24"',
  "Sec-Ch-Ua-Mobile": "?0",
  "Sec-Ch-Ua-Platform": '"Windows"',
  "Sec-Fetch-Dest": "document",
  "Sec-Fetch-Mode": "navigate",
  "Sec-Fetch-Site": "none",
  "Sec-Fetch-User": "?1",
};

The Python Backend

The FastAPI backend is minimal. One endpoint, one service module, clean Pydantic models.

Video ID Extraction

YouTube URLs come in many flavors: youtube.com/watch?v=, youtu.be/, /embed/, /shorts/, or even a raw 11-character video ID. One function handles them all:

import re

def extract_video_id(url: str) -> str | None:
    patterns = [
        r"(?:v=|/v/|youtu\.be/)([a-zA-Z0-9_-]{11})",
        r"(?:embed/)([a-zA-Z0-9_-]{11})",
        r"(?:shorts/)([a-zA-Z0-9_-]{11})",
    ]
    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return match.group(1)
    # Maybe it's already a video ID
    if re.match(r"^[a-zA-Z0-9_-]{11}$", url):
        return url
    return None

Dual-Strategy Fetching

Even with the Worker proxy, individual requests can fail. We use a two-strategy approach with retries:

import requests

def _fetch_via_worker(video_id: str, lang: str) -> dict:
    # Strategy 1: Worker does the full flow
    # (InnerTube POST → timedtext fetch → returns transcript)
    for attempt in range(3):
        resp = requests.post(
            settings.worker_proxy_url,
            json={
                "action": "get_transcript",
                "videoId": video_id,
                "lang": lang,
            },
            headers={
                "Content-Type": "application/json",
                "X-Proxy-Secret": settings.worker_proxy_secret,
            },
            timeout=30,
        )
        if resp.status_code == 200:
            content = resp.json().get("transcript", "")
            if content:
                return _parse_transcript_content(video_id, content)
        time.sleep(0.3)

    # Strategy 2: Worker gets caption track URLs,
    # backend fetches the actual transcript directly
    for attempt in range(3):
        resp = requests.post(
            settings.worker_proxy_url,
            json={
                "action": "get_caption_tracks",
                "videoId": video_id,
                "lang": lang,
            },
            headers={
                "Content-Type": "application/json",
                "X-Proxy-Secret": settings.worker_proxy_secret,
            },
            timeout=30,
        )
        if resp.status_code == 200:
            tracks = resp.json().get("tracks", [])
            if tracks:
                track = next(
                    (t for t in tracks if t.get("languageCode") == lang),
                    tracks[0],
                )
                content = _fetch_timedtext(track["baseUrl"])
                if content:
                    return _parse_transcript_content(video_id, content)
        time.sleep(0.3)

    raise ValueError(
        "YouTube is temporarily blocking transcript requests. "
        "Please try again in a few seconds."
    )

Why two strategies? The timedtext URLs that YouTube returns are signed — they contain auth tokens that work from any IP. So even if the Worker can get the caption track list but can’t fetch the actual transcript (maybe it hit a rate limit), our backend server can fetch the timedtext URL directly. Strategy 2 splits the work: Worker handles discovery, backend handles download.

Parsing Two Transcript Formats

YouTube serves transcripts in two formats: JSON3 (newer, more common) and XML (older). Our parser tries JSON3 first and falls back to XML:

import json
import html as html_module

def _parse_json3_transcript(content: str) -> list[dict]:
    data = json.loads(content)
    segments = []
    for event in data.get("events", []):
        if "segs" not in event:
            continue
        text = "".join(
            seg.get("utf8", "") for seg in event["segs"]
        ).strip()
        if not text or text == "\n":
            continue
        start_ms = event.get("tStartMs", 0)
        duration_ms = event.get("dDurationMs", 0)
        segments.append({
            "start": round(start_ms / 1000, 2),
            "duration": round(duration_ms / 1000, 2),
            "text": html_module.unescape(text),
        })
    return segments


def _parse_xml_transcript(content: str) -> list[dict]:
    from defusedxml import ElementTree as ET
    root = ET.fromstring(content)
    segments = []
    for elem in root.iter("text"):
        text = (elem.text or "").strip()
        if not text:
            continue
        segments.append({
            "start": round(float(elem.get("start", 0)), 2),
            "duration": round(float(elem.get("dur", 0)), 2),
            "text": html_module.unescape(text),
        })
    return segments


def _parse_transcript_content(video_id: str, content: str) -> dict:
    try:
        segments = _parse_json3_transcript(content)
    except (json.JSONDecodeError, KeyError):
        segments = _parse_xml_transcript(content)

    plain_text = " ".join(s["text"] for s in segments)
    return {
        "video_id": video_id,
        "segments": segments,
        "text": plain_text,
        "segment_count": len(segments),
    }

Note the use of defusedxml instead of the stdlib xml.etree.ElementTree. The standard library’s XML parser is vulnerable to billion-laugh and external entity attacks. Since we’re parsing content from an external source (YouTube), defusedxml is the right call.

The API Endpoint

The FastAPI route is clean — Pydantic handles validation, and errors map to sensible HTTP status codes:

from fastapi import APIRouter, HTTPException
from pydantic import BaseModel

router = APIRouter(prefix="/api", tags=["transcript"])

class TranscriptRequest(BaseModel):
    url: str
    lang: str = "en"

class TranscriptResponse(BaseModel):
    video_id: str
    text: str
    segments: list[TranscriptSegment]
    segment_count: int

@router.post("/transcript", response_model=TranscriptResponse)
async def fetch_transcript(req: TranscriptRequest):
    video_id = extract_video_id(req.url)
    if not video_id:
        raise HTTPException(
            status_code=400,
            detail="Invalid YouTube URL or video ID.",
        )
    try:
        result = get_transcript(video_id, lang=req.lang)
    except ValueError as e:
        raise HTTPException(status_code=422, detail=str(e))
    return result

Key Learnings

Building YouTube2Text taught us a few things worth sharing:

1. Cloudflare Workers Are Great for IP Bypass

If you’re building anything that scrapes or proxies requests to services that block cloud IPs, Cloudflare Workers are a fantastic middle ground. They’re free (100K requests/day), fast (edge-deployed worldwide), and their IPs aren’t in the usual “datacenter” blocklists. YouTube doesn’t block them (at least not consistently). This pattern — backend on your VPS, Worker as a fetch proxy — is reusable for many scraping projects.

2. YouTube’s InnerTube API Has Multiple “Clients”

YouTube doesn’t have one API — it has many. The ANDROID client, the WEB client, the iOS client, the embedded player client — they all hit the same InnerTube endpoint but with different clientName and clientVersion values. Some clients are more aggressively rate-limited than others. Rotating through them on retries dramatically improves reliability.

3. Signed URLs Are Your Friend

The timedtext URLs that YouTube returns for caption tracks contain embedded authentication tokens. Once you have the URL, any IP can fetch it — the auth is in the URL itself, not tied to the requester’s IP. This is why our dual-strategy approach works: if the Worker can get the caption tracks but fails on the transcript fetch, we can fetch the timedtext URL directly from our backend. The signed URL doesn’t care where the request comes from.

4. Always Use defusedxml for External XML

Python’s built-in xml.etree.ElementTree is vulnerable to XML bombs and external entity injection. When parsing XML from the internet, always use defusedxml. It’s a drop-in replacement with the same API — there’s zero reason not to.

5. Stateless Is Underrated

YouTube2Text has no database, no user sessions, no caching layer, no background jobs. A request comes in, we fetch the transcript, we return it. The entire state lives in the request/response cycle. This makes deployment trivial (just a Docker container), scaling straightforward (add more containers), and debugging simple (every request is independent). Not every app can be stateless, but when yours can, embrace it.

Try It

You can try YouTube2Text right now — paste any YouTube link and get the transcript in seconds. No signup required, completely free.

The entire stack is Python (FastAPI) + React + a Cloudflare Worker, deployed with Docker. If you’re building something similar, the patterns above — especially the Worker proxy and dual-strategy retry — should save you some headaches.