SRT, SCC & WebVTT Parsing Workflows

Closed caption parsing is the first hard gate in any broadcast or OTT pipeline: before a file can be normalized, validated, muxed, or played out, its timecodes, control codes, and text payloads have to be decoded deterministically and checked against the regulatory thresholds set out in FCC 47 CFR § 79.1, the SMPTE ST 334 family, and the W3C WebVTT specification. This guide is the reference for engineers building those decoders in Python — covering Scenarist Closed Caption (SCC), SubRip Text (SRT), and Web Video Text Tracks (WebVTT) as a single ingest-to-playout workflow, with working code, the exact frame and character-rate limits each format must satisfy, and the production failure modes that break compliance audits.

The three formats are not interchangeable. SCC is a byte-level, state-machine-driven carrier for CEA-608 data with hard 32-character lines and drop-frame timecode; SRT is human-readable, untyped, and silent about positioning or frame cadence; WebVTT is structurally rich but carries CSS and region constructs that legacy decoders cannot render. Treating all three as “subtitle text” is the single most common architectural mistake in this domain. Each demands its own decoder, but all three converge on the same downstream contract: frame-quantized timestamps, enforced reading-rate limits, and a structured violation log that maps every defect back to a specific clause.

Regulatory & Engineering Context

The constraints on a caption parser are not stylistic preferences — they are statutory. In the United States, FCC 47 CFR § 79.1 requires that captions be accurate, synchronous, complete, and properly placed. The Commission’s caption quality rules define synchronicity as captions that “coincide with the corresponding spoken words and sounds to the greatest extent possible,” which production teams operationalize as a sync tolerance of roughly ±2 frames for pre-recorded content. Missing that window — by dropping a control code, mis-converting a drop-frame timecode, or snapping a cue to the wrong frame boundary — is a reportable compliance defect, not a cosmetic glitch. The detailed audit procedure for these rules lives in the FCC Part 79 compliance checklist.

At the transport layer, SMPTE ST 334-1 defines how CEA-608 and CEA-708 caption data is carried as vertical ancillary (VANC) data in the SDI domain, and SMPTE ST 12-1 defines the drop-frame timecode model that SCC files inherit. CEA-608 itself fixes the display grid at 32 columns across a maximum of four rows — a constraint a parser must enforce when it reconstructs roll-up and pop-on memory. On the web side, the W3C WebVTT specification governs cue syntax, region geometry, and the HH:MM:SS.mmm timestamp grammar, while IMSC1/TTML governs the XML-based exchange formats used in many OTT packagers. International regimes add parallel obligations: the Ofcom Code on Subtitling Standards and the CRTC’s Canadian requirements impose their own reading-speed and timing thresholds that a multi-territory pipeline must satisfy simultaneously.

The engineering consequence is that parsing is a stateful, timecode-bound metadata operation, not string manipulation. A compliant decoder reconstructs the presentation state of the caption stream frame by frame, validates each transition against the governing clause, and produces output that downstream stages — packaging, muxing, playout — can trust without re-checking. Everything below is built on that premise.

Pipeline Stage Map

Parsing does not exist in isolation; it is one stage in an end-to-end chain that moves a raw caption asset from ingest to playout. Understanding where each operation sits clarifies what a parser is responsible for and what it must hand off.

The ingest stage detects encoding and strips byte-order marks before a single cue is read — getting this wrong corrupts every downstream step. Parsing converts each format into a common internal cue model (start, end, text, position, style). Normalization snaps timestamps to valid frame boundaries, resolves overlapping cues, and throttles dense dialogue to the reading-rate limit. Validation runs the regulatory rule set and emits a structured violation log. Only then does the asset move to packaging and playout. High-volume operations run these stages concurrently across thousands of assets, which is why async batch caption processing is treated as a first-class concern rather than an afterthought.

Format & Standard Overview

The table below summarizes the structural and regulatory properties that drive parser design for each format. These are the values your code enforces; keep them in one place rather than scattering magic numbers through the implementation.

Property	SCC (Scenarist)	SRT (SubRip)	WebVTT
Underlying standard	CEA-608 / SMPTE ST 334-1	de facto (no formal spec)	W3C WebVTT
Encoding	ASCII hex byte-pairs	UTF-8 (often Windows-1252)	UTF-8 (BOM permitted)
Timecode form	`HH:MM:SS:FF` / `HH:MM:SS;FF` (drop-frame)	`HH:MM:SS,mmm`	`HH:MM:SS.mmm`
Frame-bound?	Yes — 29.97 DF native	No — arbitrary ms	No — arbitrary ms
Positioning	Preamble address codes (15 rows × 32 cols)	None	`line`/`position`/`align`, regions
Styling	Mid-row codes (limited)	Inline tags (non-standard)	CSS + `STYLE` blocks
Max chars/line	32 (hard, CEA-608)	none enforced	none enforced
Display modes	pop-on, roll-up, paint-on	static only	static + region scroll
Primary use	linear broadcast, cable, SDI	OTT/VOD ingest	adaptive bitrate web

Reference thresholds

Every limit a parser enforces, with its governing source:

Threshold	Value	Source
Sync tolerance (pre-recorded)	±2 frames	FCC 47 CFR § 79.1 (synchronicity)
Frame duration @ 23.976p	41.7083 ms	SMPTE ST 12-1
Frame duration @ 25p (PAL)	40.000 ms	EBU / SMPTE ST 12-1
Frame duration @ 29.97 DF	33.3667 ms	SMPTE ST 12-1 (drop-frame)
Min display duration	1.0 s (pre-recorded)	FCC caption quality rules
Reading rate (US)	200–250 WPM	FCC accessibility guidance
Reading rate (Ofcom)	~160–180 WPM	Ofcom Code on Subtitling
Max chars/line (CEA-608)	32	CEA-608 / SMPTE ST 334-1
Max rows on screen	4	CEA-608

The deep-dive sections below each take one stage of the parse-and-normalize work, lead with the minimal working implementation, and then explain the architectural reasoning.

Decoding SCC: a stateful CEA-608 byte machine

The engineering problem with SCC is that the file does not contain captions — it contains a serialized stream of CEA-608 control events. A line such as 01:00:02:00\t9420 9420 94ae 94ae 942c 942c encodes a Resume Caption Loading command, a Preamble Address Code, and an End-of-Caption command as hexadecimal byte-pairs with odd parity. You cannot read the text without replaying the state machine that builds and flips the display buffer. The decoder must track the active display mode, mask the parity bit before interpreting any byte as ASCII, and convert the drop-frame timecode into a millisecond presentation time within the ±2-frame tolerance.

import struct

# CEA-608 control codes (parity-masked, channel 1)
RCL = 0x9420   # Resume Caption Loading (pop-on)
EOC = 0x942F   # End Of Caption (flip buffer)  -- note: 0x942F not 0x942C
EDM = 0x942C   # Erase Displayed Memory
CR  = 0x942D   # Carriage Return (roll-up)
FPS = 30000 / 1001  # 29.97 NTSC drop-frame  -- SMPTE ST 12-1

def smpte_df_to_ms(tc: str) -> float:
    """Drop-frame 'HH:MM:SS;FF' -> milliseconds.
    FCC 47 CFR §79.1 synchronicity requires this be exact to within ±2 frames."""
    hh, mm, ss, ff = (int(x) for x in tc.replace(";", ":").split(":"))
    total_minutes = 60 * hh + mm
    # SMPTE ST 12-1: drop 2 frames every minute except every 10th minute
    dropped = 2 * (total_minutes - total_minutes // 10)
    frame_number = ((hh * 3600 + mm * 60 + ss) * 30 + ff) - dropped
    return (frame_number / FPS) * 1000.0

def decode_608_pair(word: int) -> str:
    """Mask parity (bit 7) and return the two ASCII characters of a text pair."""
    hi = word >> 8 & 0x7F   # strip odd-parity bit -- CEA-608 §
    lo = word & 0x7F
    return "".join(chr(c) for c in (hi, lo) if 0x20 <= c <= 0x7F)

def parse_scc(path: str):
    cues, text, start_ms, mode = [], "", None, None
    with open(path, "r", encoding="ascii") as fh:
        for line in fh:
            line = line.strip()
            if not line or line.startswith("Scenarist") or line.startswith(";"):
                continue
            tc, _, payload = line.partition("\t")
            for hexword in payload.split():
                word = struct.unpack(">H", bytes.fromhex(hexword))[0]
                masked = word & 0x7F7F
                if masked in (RCL, CR):           # begin/continue caption
                    mode, start_ms = masked, smpte_df_to_ms(tc)
                elif masked in (EOC, EDM):         # flush display buffer
                    if text and start_ms is not None:
                        cues.append((start_ms, smpte_df_to_ms(tc), text.strip()))
                    text = ""
                elif masked > 0x9000:              # other control code -- ignore here
                    continue
                else:
                    text += decode_608_pair(word)   # printable text pair
    return cues

The architectural point is that the parser keys off control codes (RCL, EOC, EDM, CR), not whitespace or line breaks, and it masks parity (& 0x7F7F) before every comparison because SCC bytes carry odd parity that will otherwise corrupt both control-code matching and text extraction. The drop-frame conversion is the part that most often introduces sync defects: skipping the dropped correction produces a drift that grows by two frames per minute and silently exceeds the FCC tolerance after only a few minutes of program. Roll-up versus pop-on handling, mid-row formatting, and the full preamble-address grid are covered in depth in Parsing SCC with Python libraries, and the encoding pitfalls specific to legacy SCC exports are handled in fixing UTF-8 encoding errors in SCC files.

Normalizing SRT: frame-quantized timestamps & reading-rate limits

SRT is trivial to read and dangerous to trust. The format imposes no frame cadence, so a file produced by a transcription tool will contain timestamps like 00:00:04,233 that fall between valid frame boundaries for the target cadence. Feeding those un-quantized timestamps into an HLS or DASH packager produces cue drift and decoder desynchronization. The normalization problem is therefore: snap every boundary to the nearest valid frame, guarantee a non-negative duration that meets the minimum display window, and throttle any cue whose characters-per-second exceeds the reading-rate limit.

import pysrt
import numpy as np

FPS = 24000 / 1001            # 23.976p  -- SMPTE ST 12-1
FRAME_MS = 1000.0 / FPS       # 41.7083 ms
MIN_DISPLAY_MS = 1000         # FCC caption quality rules -- 1.0 s minimum
MAX_CPS = 17                  # ~200 WPM reading-rate ceiling, FCC guidance

def snap_to_frame(ms: float) -> int:
    """Quantize a millisecond value to the nearest whole frame boundary.
    Integer frame index avoids float drift across long-form assets."""
    frame = int(np.rint(ms / FRAME_MS))      # nearest frame -- ±2 frame tolerance
    return int(round(frame * FRAME_MS))

def normalize_srt(path: str) -> pysrt.SubRipFile:
    subs = pysrt.open(path, encoding="utf-8")
    for cue in subs:
        start = snap_to_frame(cue.start.ordinal)   # .ordinal = ms since 00:00
        end = snap_to_frame(cue.end.ordinal)
        if end - start < MIN_DISPLAY_MS:           # enforce FCC min display window
            end = start + MIN_DISPLAY_MS
        chars = len(cue.text_without_tags.replace("\n", " "))
        cps = chars / ((end - start) / 1000.0)
        if cps > MAX_CPS:                          # extend window to meet reading rate
            end = start + int((chars / MAX_CPS) * 1000)
            end = snap_to_frame(end)
        cue.start.ordinal, cue.end.ordinal = start, end
    return subs

The reason for integer-frame arithmetic — computing a frame index with np.rint and only then converting back to milliseconds — is that floating-point millisecond math accumulates error across a feature-length asset and eventually pushes late cues outside the ±2-frame sync window. The minimum-display and CPS clamps extend the cue’s end rather than truncating its text, preserving completeness as required by § 79.1. Overlap resolution between adjacent cues, drop-frame versus non-drop targets, and variable-frame-rate sources are detailed in SRT timestamp normalization; the lossless cross-format case is covered in converting SCC to SRT without timing loss.

Extracting & validating WebVTT cues

WebVTT is the richest of the three formats and therefore the easiest to ship in a non-compliant state. A .vtt file can carry an optional header, STYLE blocks containing arbitrary CSS, REGION definitions, and per-cue settings — most of which legacy hardware decoders ignore or choke on. The extraction problem is to separate the header and style blocks from cue payloads, validate every timestamp against the W3C HH:MM:SS.mmm grammar, strip unsupported styling for broadcast targets, and flag overlaps and out-of-bounds positioning before the cues are mapped onto a frame-accurate timeline.

import webvtt   # webvtt-py

LINE_MAX = 32   # CEA-608 safe-area cap -- SMPTE ST 334-1 when destined for 608 output

def ts_to_ms(ts: str) -> int:
    """Parse W3C WebVTT 'HH:MM:SS.mmm' (hours optional) to milliseconds."""
    parts = ts.split(":")
    if len(parts) == 2:               # MM:SS.mmm  -- W3C WebVTT §timestamp
        parts = ["0"] + parts
    h, m, rest = parts
    s, ms = rest.split(".")
    return ((int(h) * 60 + int(m)) * 60 + int(s)) * 1000 + int(ms)

def validate_vtt(path: str):
    violations, prev_end = [], 0
    for cue in webvtt.read(path):
        start, end = ts_to_ms(cue.start), ts_to_ms(cue.end)
        if end <= start:                                  # zero/negative duration
            violations.append((cue.start, "non_positive_duration"))
        if start < prev_end:                              # overlapping cue windows
            violations.append((cue.start, "cue_overlap"))
        for ln in cue.text.splitlines():
            if len(ln) > LINE_MAX:                        # exceeds 608 safe area
                violations.append((cue.start, f"line_len={len(ln)}"))
        prev_end = max(prev_end, end)
    return violations

The validator treats the W3C grammar as an executable schema: a timestamp that does not parse, a non-positive duration, or a line that exceeds the 32-column CEA-608 safe area is a defect with a clause behind it, not a warning to be ignored. Stripping CSS and resolving percentage-based positioning for broadcast output, plus mapping cues onto SDI timecode, are covered in WebVTT cue extraction & validation and mapping WebVTT cues to broadcast timelines.

Scaling the workflow: encoding detection & concurrent batches

A single-file parser is correct but not operational. Production pipelines ingest thousands of assets per day from heterogeneous sources — legacy FTP drops, OTT vendor portals, archive restores — and those files arrive with mismatched encodings (UTF-8, Windows-1252, UTF-16 with a BOM) and truncated payloads. The scaling problem has two halves: detect and normalize encoding deterministically before parsing, and run the parsers concurrently without saturating CPU or disk. The first half is solved with charset_normalizer; the second with asyncio and a bounded worker pool.

import asyncio
from concurrent.futures import ProcessPoolExecutor
from charset_normalizer import from_bytes

def read_normalized(path: str) -> str:
    """Heuristically detect encoding and strip BOM before any parse runs."""
    raw = open(path, "rb").read()
    best = from_bytes(raw).best()                 # charset_normalizer detection
    text = str(best) if best else raw.decode("utf-8", "replace")
    return text.lstrip("")                  # strip BOM -- avoids cue-1 corruption

async def run_batch(paths, parse_fn, max_workers=8):
    """Parse a batch concurrently with a bounded process pool (backpressure)."""
    loop = asyncio.get_running_loop()
    sem = asyncio.Semaphore(max_workers)          # cap concurrency -> bounded memory
    results = {}
    with ProcessPoolExecutor(max_workers=max_workers) as pool:
        async def worker(p):
            async with sem:
                results[p] = await loop.run_in_executor(pool, parse_fn, p)
        await asyncio.gather(*(worker(p) for p in paths))
    return results

The semaphore is the load-bearing element: it bounds in-flight work so memory stays flat even when the input set is large, which is what prevents the OOM kills that plague naive batch runners. Detecting encoding before parsing — rather than catching UnicodeDecodeError mid-stream — keeps the parser deterministic and lets the BOM strip happen exactly once. Worker-pool sizing, backpressure-aware queues, and quarantine routing for failed files are developed fully in async batch caption processing.

Failure modes & gotchas

The defects below account for the large majority of caption parsing failures that reach production. Each has a cheap detection test and a known remediation.

BOM and encoding mismatch. A UTF-8 BOM or a Windows-1252 smart quote misread as UTF-8 corrupts the first cue or throws mid-parse. Detect: sniff the first bytes and run charset_normalizer before decoding. Fix: normalize to UTF-8 and lstrip("") once, at ingest.
Drop-frame timecode mis-conversion. Treating a ;-delimited SCC timecode as non-drop omits the two-frames-per-minute correction, producing drift that exceeds the ±2-frame FCC tolerance after a few minutes. Detect: compare computed end-of-program time against asset duration. Fix: apply the SMPTE ST 12-1 drop formula (drop 2 frames every minute except every 10th).
Orphaned or parity-corrupted control codes. An SCC stream with a Resume Caption Loading code but no matching End Of Caption leaves the display buffer un-flushed, causing phantom captions or decoder lockups. Detect: assert balanced RCL/EOC pairs per state segment. Fix: mask parity before matching and discard unpaired control words with a logged warning.
Un-quantized SRT timestamps. Arbitrary millisecond boundaries that do not align to the target cadence cause cue drift once packaged into HLS/DASH. Detect: check each boundary modulo the frame duration. Fix: snap to the nearest integer frame as in the normalization step.
Cue overlap and sub-minimum duration. Overlapping windows stack captions; sub-second durations flicker below readability and breach the minimum-display rule. Detect: compare each cue’s start against the previous end and its duration against the 1.0 s floor. Fix: clamp the end forward and resolve overlaps deterministically.
CSS/region payloads in WebVTT bound for broadcast. STYLE and REGION constructs and percentage positioning that legacy 608/708 decoders cannot render. Detect: scan for STYLE/REGION blocks and non-integer positions. Fix: strip styling and map to the CEA-608 row/column grid.
Reading rate above the accessibility ceiling. Dense dialogue exceeding ~200 WPM (or the stricter Ofcom limit) fails accessibility review even when timing is technically valid. Detect: compute characters-per-second per cue. Fix: extend the display window or, where editorially permitted, condense — see enforcing character-rate limits in QC.

Compliance telemetry & audit trail

Parsing and validation only deliver value if every defect is captured as machine-readable evidence that maps to the clause it breaches. Regulatory audits require immutable proof of validation, so the parser should emit one structured record per violation — keyed by asset, cue timecode, rule, and severity — to JSON for streaming consumers and Parquet for columnar analytics. The schema below is the contract the downstream scheduled QC report generation and automated sync drift detection stages depend on.

import json, hashlib, pyarrow as pa, pyarrow.parquet as pq

def emit_telemetry(asset_id: str, violations: list[dict], out_prefix: str):
    """Write violation records to JSON + Parquet with a content hash for WORM audit."""
    records = [
        {
            "asset_id": asset_id,
            "cue_tc": v["tc"],
            "rule": v["rule"],            # e.g. 'fcc_79_1_sync', 'cea608_line_len'
            "clause": v["clause"],        # e.g. 'FCC 47 CFR §79.1'
            "severity": v["severity"],    # 'block' | 'warn'
            "measured": v["measured"],
            "threshold": v["threshold"],
        }
        for v in violations
    ]
    blob = json.dumps(records, sort_keys=True).encode()
    digest = hashlib.sha256(blob).hexdigest()      # chain-of-custody hash
    with open(f"{out_prefix}.json", "wb") as fh:
        fh.write(blob)
    table = pa.Table.from_pylist(records)
    pq.write_table(table, f"{out_prefix}.parquet")  # columnar store for dashboards
    return digest

Hashing the serialized records gives each report a verifiable fingerprint that can be stored in a write-once repository, producing the unbroken chain of custody an FCC or Ofcom inquiry expects. Because every record names both the measured value and the threshold it failed, remediation tooling can prioritize blocking defects over warnings and route assets automatically. Wiring this telemetry into a release gate so non-compliant builds cannot ship is covered in CI/CD gating for caption builds.

SCC parser state machine — Parsing SCC with Python libraries
Frame-quantized SRT timing — SRT timestamp normalization
W3C cue validation — WebVTT cue extraction & validation
Concurrent batch ingest — Async batch caption processing
Format selection tradeoffs — SCC vs SRT vs WebVTT architecture
Regulatory audit procedure — FCC Part 79 compliance checklist
Validation & reporting layer — Automated QC validation & reporting

Part of: Broadcast Media Closed Captioning & QC Automation — the parse and normalize stages of the end-to-end caption pipeline.

SRT, SCC & WebVTT Parsing Workflows

Continue reading