Why not just use len(text) divided by duration?

len() counts Unicode code points, so it over-counts combining marks, ZWJ emoji, and multi-code-point graphemes in languages like Hindi or Arabic. A CEA-608 decoder paints one cell per grapheme cluster, so the regex \X token gives the count that matches decoder buffer occupancy.

How do overlapping cues change the character rate?

When two or more rows are on screen at once their glyph load lands on a single CEA-608 channel, so their introduction rates must be summed. A per-cue average misses this; a sweep-line over all cue boundaries reports the true combined rate that can overrun the decoder.

What is the difference between the burst and sustained limits?

The burst limit (30 cps) is the short-window envelope a decoder can absorb for a frame or two; the sustained limit (20 cps, or 17 cps under EBU-TT-D) is the rolling 1-second average a viewer can actually read. Enforcing both separately avoids failing legitimate momentary spikes while still catching dense walls of text.

How to enforce FCC character rate limits programmatically

The naive character-rate check — len(cue.text) / cue.duration per cue — passes files that a CEA-608 decoder cannot actually paint. It fails in exactly two cases that show up constantly in real broadcast assets: overlapping cues, where two or more rows are on screen at once and their glyph load sums onto a single decoder channel, and multi-language payloads, where len() over-counts combining marks and multi-code-point graphemes so a compliant Hindi or Arabic track is flagged while a genuinely dense one slips through. The constraint being enforced is FCC 47 CFR § 79.1 (captions must be complete and readable), bounded physically by the CEA-608 channel — two caption bytes per frame, ≈ 29.97 characters per second of airtime at 29.97 fps. The operational gate this page implements is 20 CPS sustained over any rolling 1-second window, with short bursts tolerated to 30 CPS. Getting there requires counting painted graphemes, not code points, and measuring rate against a merged timeline, not per-cue averages.

Minimal working implementation

import regex   # PyPI 'regex' 2023+: \X matches one full grapheme cluster
import pysrt   # pysrt 1.1.2+: SRT parsing, integer-millisecond ordinals
from dataclasses import dataclass

# FCC 47 CFR § 79.1 — captions must be complete and readable.
# CEA-608 physical ceiling at 29.97 fps is ~29.97 cps (2 bytes/frame).
SUSTAINED_CPS = 20.0    # operational sustained gate over a rolling window
BURST_CPS     = 30.0    # short-burst decoder-survival envelope (CEA-608)
WINDOW_MS     = 1000    # 1.0 s rolling window for the sustained check


@dataclass(frozen=True)
class Cue:
    start_ms: int
    end_ms: int
    glyphs: int          # visible grapheme count, not len(text)


def visible_glyphs(text: str) -> int:
    # Count painted grapheme clusters only: combining marks fold into their
    # base glyph (one decoder cell) and whitespace/control codes paint nothing,
    # mirroring how a CEA-608 decoder accounts for buffer occupancy.
    return sum(1 for g in regex.findall(r"\X", text) if not g.isspace())


def load_cues(path: str) -> list[Cue]:
    # utf-8-sig strips a leading BOM so the first cue is never corrupted.
    return [
        Cue(s.start.ordinal, s.end.ordinal, visible_glyphs(s.text))
        for s in pysrt.open(path, encoding="utf-8-sig")
        if s.end.ordinal > s.start.ordinal
    ]


def density_series(cues: list[Cue]) -> list[tuple[int, int, float]]:
    # Sweep-line over every cue edge so overlapping cues sum their introduction
    # rate onto one timeline (CEA-608 paints concurrent rows simultaneously).
    edges = sorted({e for c in cues for e in (c.start_ms, c.end_ms)})
    series = []
    for a, b in zip(edges, edges[1:]):
        rate = sum(
            c.glyphs / ((c.end_ms - c.start_ms) / 1000.0)
            for c in cues if c.start_ms < b and c.end_ms > a
        )
        series.append((a, b, rate))           # (seg_start, seg_end, cps)
    return series


def find_violations(cues: list[Cue]) -> list[dict]:
    series = density_series(cues)
    viols = []
    for a, b, rate in series:
        if rate > BURST_CPS:                   # instantaneous burst breach
            viols.append({"at_ms": a, "cps": round(rate, 1), "kind": "burst"})
        # Sustained: glyphs introduced over a trailing 1.0 s window == cps.
        w0, w1 = a, a + WINDOW_MS
        sustained = sum(
            r * (min(e, w1) - max(s, w0)) / 1000.0
            for s, e, r in series if s < w1 and e > w0
        )
        if sustained > SUSTAINED_CPS:
            viols.append({"at_ms": a, "cps": round(sustained, 1),
                          "kind": "sustained"})
    return viols


if __name__ == "__main__":
    import sys
    bad = find_violations(load_cues(sys.argv[1]))
    for v in bad:
        print(f"{v['at_ms']:>9} ms  {v['cps']:>5} cps  {v['kind']}")
    sys.exit(1 if bad else 0)   # non-zero exit fails the build gate

Code walkthrough

Grapheme counting (visible_glyphs). The single most common false result in rate validation comes from treating len(text) as glyph count. In Devanagari, Tamil, or Arabic, one painted character can span several Unicode code points — a base consonant plus combining vowel signs and viramas. The regex module’s \X token matches a complete extended grapheme cluster (Unicode UAX #29), so a base-plus-combining sequence counts as the one cell a decoder actually paints. Whitespace is excluded because a CEA-608 decoder’s buffer pressure tracks painted glyphs, not spaces; spaces are counted separately only for characters-per-line checks. This is the same canonical, BOM-safe cue model produced by SRT timestamp normalization upstream, which is why load_cues opens with utf-8-sig — a stray BOM otherwise inflates the first cue’s glyph count by one and skews its CPS.

The sweep-line density series (density_series). This is the part a per-cue check cannot do. Each cue is modeled as introducing its glyphs linearly across its own span, giving it an introduction rate of glyphs / duration_seconds. The function collects every cue boundary into a sorted set of edges, then walks adjacent edge pairs to produce timeline segments where the active-cue set is constant. Within each segment it sums the rates of all overlapping cues — exactly mirroring CEA-608/708 behaviour, where two simultaneously displayed rows place their combined glyph load on one channel. A pop-on cue that overlaps the tail of a roll-up cue therefore produces a segment whose rate is the sum of both, which is the value that actually overruns the decoder buffer and the value § 79.1 readability hinges on.

The dual-rate gate (find_violations). Two ceilings are evaluated against that merged series. The burst test flags any single segment whose instantaneous rate exceeds 30 CPS — the short-window envelope a CEA-608 decoder can absorb before dropping glyphs. The sustained test integrates the rate over a trailing 1-second window: because the window is exactly 1.0 s, the integral of CPS over it is numerically equal to the glyphs introduced, so comparing that sum to 20 directly yields a CPS verdict. Splitting the two means a legitimate one-frame spike of dense dialogue is not failed the same way a sustained wall of text is, which keeps the gate from generating noise that engineers learn to ignore.

The exit contract. The __main__ block prints each violation with its onset, measured CPS, and kind, then exits non-zero when any violation exists. That non-zero exit is the entire integration surface a build gate needs — no log parsing, no JSON round-trip.

Edge cases and known gotchas

Zero-duration and inverted cues. A cue with end_ms <= start_ms divides by zero in the rate calculation. load_cues filters these out, but log them rather than dropping silently — an inverted cue is usually a drop-frame timecode wraparound that needs fixing at the parser, not here.
Combining-mark-only sequences. A cue that opens with a stray combining mark (no base) makes \X emit a grapheme cluster that is not whitespace but paints oddly. It still counts as one cell, which matches decoder behaviour, but flag such cues for review — they often signal an encoding fault upstream.
Emoji and ZWJ sequences. A flag or family emoji is several code points joined by zero-width joiners; \X correctly folds it to one grapheme. Plain len() would have counted six or seven, manufacturing a phantom rate violation.
Dense overlap stacking. Three or more roll-up rows briefly co-resident at a speaker change can push a single short segment past 30 CPS legitimately for one or two frames. The burst-vs-sustained split absorbs this; do not lower BURST_CPS to silence it.
Jurisdictional ceilings. EBU-TT-D / BBC reader-comfort guidance caps sustained density nearer 17 CPS. Drive SUSTAINED_CPS from config and enforce the lowest ceiling the target market requires rather than hard-coding 20.

Threshold reference

Limit	Value	Window	Source
CEA-608 physical ceiling	≈ 29.97 cps	per second of airtime	CEA-608, 2 bytes/frame @ 29.97 fps
Operational sustained gate	20 cps	rolling 1.0 s	FCC 47 CFR § 79.1 (readability)
Short-burst envelope	30 cps	per segment	CEA-608 decoder buffer
Reader-comfort target	17 cps (≈ 160–180 WPM)	sustained	EBU-TT-D / BBC subtitle guidelines

Integration hook

This routine is the glyph-accurate, overlap-aware core that drops into the general sliding-window validator described in enforcing character rate limits in QC: swap this routine’s per-cue CPS estimator for density_series, and feed its violation list into the same gating verdict. Because it returns a plain list of dicts and exits non-zero on failure, it slots without modification into CI/CD gating for caption builds, and its per-onset CPS values are the raw signal that scheduled QC report generation aggregates into compliance dashboards. Run it before automated sync drift detection, since unresolved density spikes inflate timestamp jitter and produce false drift warnings.

Detecting sync drift in automated QC pipelines — PTS alignment that depends on clean, validated density signals.
Generating daily QC reports with Python — turning per-onset CPS verdicts into scheduled audit output.
SRT timestamp normalization — the millisecond-epoch cue model this check consumes.
FCC Part 79 compliance checklist — where the § 79.1 readability obligation this gate enforces is defined.

Part of: Enforcing Character Rate Limits in QC — itself part of the Automated QC Validation & Reporting reference.