SRT Timestamp Normalization

SRT timestamp normalization is the deterministic temporal gate that sits between raw subtitle ingest and any compliance check that assumes frame-accurate cues. The compliance gap it closes is specific: SRT (SubRip) timestamps carry millisecond precision with no notion of a frame grid, so a file authored in a NLE, exported from a speech-to-text engine, or round-tripped through a vendor tool routinely lands cue edges between video frames. Downstream that surfaces as decoder flicker, overlapping cues the muxer rejects, and synchronization that already burns part of the ±2-frame budget before FCC Part 79 compliance ever runs. Normalization removes that ambiguity by quantizing every start and end to the nearest frame boundary for the target rate (33.367 ms at 29.97 fps, 40.000 ms at 25 fps), forbidding cue overlap at a hard 0 ms tolerance, and preserving a minimum two-frame gap so consecutive cues clear the decoder cleanly.

This step is part of the SRT, SCC & WebVTT parsing workflows track. It runs after a file has been parsed into a cue model and before it is validated, converted, or embedded — every threshold below is read from the reference table so the same routine re-targets another frame rate by swapping one config value rather than editing arithmetic.

Pipeline stage & prerequisites

Normalization executes at the post-parse, pre-validate boundary. Upstream, the SRT has already been decoded to a cue list — either parsed directly or produced by converting SCC to SRT without timing loss, which is the most common source of off-grid timecodes because SCC frame counts get translated into milliseconds and inherit rounding error. Downstream, the normalized cues feed the static acceptance gates in FCC Part 79 compliance, cross-format parity against WebVTT cue extraction and validation, and continuous measurement by automated sync drift detection. Placing the gate here is what keeps it cheap: an off-grid edge fixed at ingest is a config-driven snap; the same edge caught at playout is a re-render.

The target frame rate is not guessed — read it from the companion media with ffprobe so the grid matches the actual timebase rather than a hard-coded default.

Tool / library	Version (tested)	Role in normalization
`pysrt`	1.1.2	Parse/serialize SRT, expose `SubRipTime.ordinal` in ms
`ffprobe` (FFmpeg)	6.0+	Read the exact `r_frame_rate` rational of the target video
`numpy`	1.26+	Vectorized snapping and overlap resolution at batch scale
Python	3.10+	`fractions.Fraction` for exact NTSC rationals, `dataclasses`

Step-by-step implementation

Step 1 — Build an exact frame grid from the source rate

The single most common normalization bug is treating 29.97 fps as 30 fps, which accumulates ~3.6 s of drift per hour. Derive frame duration from the exact SMPTE rational, never from a rounded 33.367 literal.

from fractions import Fraction

# SMPTE ST 12-1 — broadcast rates as exact rationals (never rounded floats)
FRAME_RATES = {
    "23.976": Fraction(24000, 1001),
    "24":     Fraction(24, 1),
    "25":     Fraction(25, 1),       # EBU / PAL
    "29.97":  Fraction(30000, 1001), # NTSC
    "30":     Fraction(30, 1),
    "59.94":  Fraction(60000, 1001),
    "60":     Fraction(60, 1),
}

def frame_ms(fps: Fraction) -> float:
    # Milliseconds per frame; one frame is the smallest legal time quantum
    return float(Fraction(1000, 1) / fps)

Pull the rate from the companion video instead of assuming one, so the grid matches the timebase the cues will actually be muxed against:

import subprocess, json
from fractions import Fraction

def probe_frame_rate(media_path: str) -> Fraction:
    # ffprobe reads stream metadata only — no decode — to get the exact rational
    out = subprocess.run(
        ["ffprobe", "-v", "error", "-select_streams", "v:0",
         "-show_entries", "stream=r_frame_rate", "-of", "json", media_path],
        capture_output=True, text=True, check=True,
    )
    num, den = json.loads(out.stdout)["streams"][0]["r_frame_rate"].split("/")
    return Fraction(int(num), int(den))   # e.g. 30000/1001 for NTSC

Step 2 — Parse cues with pysrt

pysrt handles the comma decimal separator, BOM-prefixed files, and index renumbering, and exposes each timestamp as SubRipTime whose .ordinal attribute is the absolute offset in milliseconds — the integer form every later step operates on.

import pysrt

def load_cues_ms(path: str):
    # utf-8-sig drops a leading BOM so the first index parses cleanly
    subs = pysrt.open(path, encoding="utf-8-sig")
    # .ordinal is total milliseconds; integer math avoids float drift
    return [(item.index, item.start.ordinal, item.end.ordinal, item.text)
            for item in subs], subs

Step 3 — Snap each cue edge to the frame grid

Round each edge to the nearest frame, then convert back to milliseconds. Rounding (not truncation) keeps the maximum introduced error to half a frame, well inside the FCC 47 CFR § 79.1(j) ±2-frame sync budget.

def snap_ms(t_ms: int, fpms: float) -> int:
    # Snap an absolute ms offset to the nearest frame boundary
    frame_index = round(t_ms / fpms)        # nearest whole frame
    return int(round(frame_index * fpms))   # back to ms, half-frame max error

Step 4 — Enforce minimum duration and the inter-cue gap

EBU Tech 3350 and ATSC A/53 prohibit overlapping cues outright (0 ms tolerance), and consumer decoders need a clear gap between cues to avoid flicker. Enforce a minimum on-screen duration and a two-frame gap after snapping.

MIN_GAP_FRAMES = 2          # decoder flicker guard between consecutive cues
MIN_DURATION_FRAMES = 15    # ~0.5 s readable floor at 30 fps grid

def enforce_duration(start_ms: int, end_ms: int, fpms: float) -> tuple[int, int]:
    min_dur = int(round(MIN_DURATION_FRAMES * fpms))
    if end_ms - start_ms < min_dur:        # too brief to read -> extend tail
        end_ms = start_ms + min_dur
    return start_ms, end_ms

Step 5 — Resolve overlaps in a single forward pass and write back

Walk the cues in order, carrying the previous cue’s end. If a snapped start lands before that end, push it forward by the required gap; re-apply the duration floor if the push collapsed the cue. Then write the integer milliseconds straight back into the pysrt objects so the serializer emits canonical HH:MM:SS,mmm timecodes.

import pysrt

def normalize_file(in_path: str, out_path: str, fps: Fraction) -> dict:
    fpms = frame_ms(fps)
    gap = int(round(MIN_GAP_FRAMES * fpms))   # 0 ms overlap tolerance + flicker guard
    subs = pysrt.open(in_path, encoding="utf-8-sig")

    prev_end = 0
    adjustments = 0
    for item in subs:
        start = snap_ms(item.start.ordinal, fpms)   # frame-snap both edges
        end   = snap_ms(item.end.ordinal, fpms)

        if start < prev_end + gap:                  # EBU 3350 — no overlap, keep gap
            start = prev_end + gap
        start, end = enforce_duration(start, end, fpms)

        if (start, end) != (item.start.ordinal, item.end.ordinal):
            adjustments += 1
        # SubRipItem accepts ms via the .ordinal setter on its time objects
        item.start.ordinal = start
        item.end.ordinal = end
        prev_end = end

    subs.clean_indexes()        # renumber 1..N after any reordering
    subs.save(out_path, encoding="utf-8")
    return {"cues": len(subs), "adjusted": adjustments}

At batch scale the per-cue loop is the bottleneck; the same logic vectorizes cleanly. Load edges into a numpy array, snap with one division, and resolve overlaps with a running maximum:

import numpy as np

def snap_array(edges_ms: np.ndarray, fpms: float) -> np.ndarray:
    # Vectorized equivalent of snap_ms over an (N, 2) start/end array
    return np.rint(np.rint(edges_ms / fpms) * fpms).astype(np.int64)

Threshold reference table

Every limit the routine reads lives here, not in prose, so a config swap re-points the same code at another frame rate or jurisdiction.

Parameter	Value	Spec clause	Notes
Frame duration (23.976)	41.708 ms	SMPTE ST 12-1	`Fraction(24000, 1001)`
Frame duration (25)	40.000 ms	EBU Tech 3264	PAL / EBU integer rate
Frame duration (29.97)	33.3667 ms	SMPTE ST 12-1	`Fraction(30000, 1001)`, never `33.367` literal
Frame duration (59.94)	16.6833 ms	SMPTE ST 12-1	`Fraction(60000, 1001)`
Sync tolerance	±2 frames (±66.7 ms @ 29.97)	47 CFR § 79.1(j)	Snapping introduces ≤ ½ frame error
Cue overlap tolerance	0 ms	EBU Tech 3350 / ATSC A/53	Hard prohibition on overlapping cues
Minimum inter-cue gap	2 frames (66.7 ms @ 29.97)	EBU Tech 3350	Decoder flicker guard
Minimum display duration	~0.5 s (15 frames @ 30)	Ofcom / EBU guidance	Readability floor
Maximum display duration	~7 s	EBU / Ofcom guidance	Flag cues held beyond this

Verification & test pattern

Lock each threshold behind a pytest fixture so a config change that loosens the grid fails CI instead of shipping off-grid cues. Test the boundary on both sides — an edge already on a frame must be unchanged, and an edge a millisecond off must snap.

import pytest
from fractions import Fraction
from normalize import frame_ms, snap_ms, MIN_GAP_FRAMES

@pytest.fixture
def ntsc_fpms():
    return frame_ms(Fraction(30000, 1001))   # ~33.3667 ms

def test_on_grid_edge_is_stable(ntsc_fpms):
    on_grid = int(round(100 * ntsc_fpms))    # exactly frame 100
    assert snap_ms(on_grid, ntsc_fpms) == on_grid

def test_off_grid_edge_snaps_to_nearest(ntsc_fpms):
    off = int(round(100 * ntsc_fpms)) + 5    # 5 ms past frame 100
    snapped = snap_ms(off, ntsc_fpms)
    assert abs(snapped - off) <= ntsc_fpms / 2   # ≤ half-frame error

def test_consecutive_cues_keep_two_frame_gap(ntsc_fpms):
    gap = int(round(MIN_GAP_FRAMES * ntsc_fpms))
    # cue B starting 1 ms after cue A ends must be pushed out by `gap`
    prev_end = int(round(100 * ntsc_fpms))
    start = prev_end + 1
    if start < prev_end + gap:
        start = prev_end + gap
    assert start - prev_end >= gap

For a full-file assertion, re-parse the output and confirm every edge is an exact multiple of the frame pitch and no two cues touch:

import pysrt
from fractions import Fraction
from normalize import frame_ms

def assert_normalized(path: str, fps: Fraction):
    fpms = frame_ms(fps)
    subs = pysrt.open(path, encoding="utf-8")
    prev_end = -1
    for item in subs:
        for edge in (item.start.ordinal, item.end.ordinal):
            assert abs(edge - round(edge / fpms) * fpms) < 1.0   # on-grid
        assert item.start.ordinal > prev_end                     # no overlap
        prev_end = item.end.ordinal

Troubleshooting / failure modes

Cumulative drift of ~3.6 s per hour after normalization : Root cause: the grid was built from 30.0 (or a rounded 33.367) instead of the exact NTSC rational. Fix: source frame duration from Fraction(30000, 1001) via probe_frame_rate, and keep all arithmetic on integer milliseconds until the final serialize.

First cue fails to parse / index is off by one : Root cause: a UTF-8 BOM glued to the first index, a known SubRip export artifact. Fix: open with encoding="utf-8-sig"; for deeper encoding repair see fixing UTF-8 encoding errors in SCC files, whose detection logic applies to SRT too.

Snapped cues now overlap even though the source did not : Root cause: two edges rounded toward the same frame, collapsing the gap below 0 ms. Fix: run the forward-pass gap enforcement (Step 5) after snapping, never before — snapping can create overlaps that only a post-pass resolves.

Output timecodes use a period instead of a comma : Root cause: the writer emitted WebVTT-style . separators into an .srt file. Fix: serialize through pysrt (subs.save), which always emits the SubRip HH:MM:SS,mmm form; reserve the period separator for WebVTT deliverables.

A 25 fps caption file drifts against a 23.976 fps mezzanine : Root cause: normalizing to the wrong grid because the rate was assumed, not probed. Fix: always read r_frame_rate from the target video; mismatched timebases are a conversion problem covered in SCC vs SRT vs WebVTT architecture.

Very short cues vanish on consumer decoders : Root cause: a sub-frame or single-frame cue snapped to zero or below the readable floor. Fix: apply the minimum-duration extension in Step 4; flag any cue whose original span was under one frame for editorial review rather than silently extending it.

Operational notes

Normalization is CPU-light and I/O-bound: the cost is reading and writing files and the one ffprobe probe per asset, not the arithmetic. Size a worker pool to roughly the physical core count and give each worker one asset at a time; a feature-length SRT is tens of thousands of cues but each cue is a few integers and a short string, so per-worker resident memory stays trivial. When normalizing a whole library against one known rate, batch the probe step — probe each distinct companion video once and cache the Fraction, rather than re-probing per cue file. For multi-file batch runs that fan normalization out across queues, the concurrency and backpressure patterns are covered in async batch caption processing.

Keep the routine stateless and idempotent: normalizing an already-normalized file must be a no-op (the adjusted count returns 0), which lets the QC layer re-run assets after an upstream fix without special-casing. Emit the per-file {"cues", "adjusted"} summary plus the largest single edge adjustment to a structured log; any adjustment exceeding one frame is a signal of source corruption or a frame-rate mismatch worth surfacing to automated sync drift detection rather than silently absorbing.

Converting SCC to SRT without timing loss — Upstream conversion whose frame→ms rounding is the most common source of off-grid SRT edges.
Parsing SCC with Python libraries — SCC state-machine decoding that feeds the cue model this step normalizes.
WebVTT cue extraction and validation — Apply the identical frame grid to keep SRT and WebVTT deliverables in temporal parity.
Async batch caption processing — Worker-pool and backpressure patterns for running normalization across a library.
FCC Part 79 compliance checklist — The downstream sync gate whose ±2-frame budget this step protects.

Part of: SRT, SCC & WebVTT Parsing Workflows.

SRT Timestamp Normalization

Continue reading

Related in Parsing Workflows