Detecting Sync Drift in Automated QC Pipelines

You need a stateless check that, given a caption file and its companion media, returns a deterministic pass/fail verdict for synchronization and a drift curve you can archive as audit evidence — cheap enough to run on every asset at ingest. The failure mode is caption onset slipping away from video presentation timestamps (PTS) gradually: the asset passes a start-of-program spot check, then crosses the FCC Part 79 compliance synchronicity ceiling minutes later, on air. A commonly enforced limit derived from FCC 47 CFR § 79.1 is ±2 frames, which at 29.97 fps is ±66.7 ms; the detector below distinguishes that slow linear accumulation from sudden step jumps (encoder buffer resets, mid-roll re-anchoring) and gates both.

Minimal working drift detector

import json
import subprocess
from decimal import Decimal, ROUND_HALF_UP
from fractions import Fraction
from pathlib import Path

import pysrt  # pip install pysrt

# FCC 47 CFR § 79.1 — captions must be synchronous; a ±2-frame budget is the
# de-facto enforcement tolerance. At 29.97 fps, 2 frames = 66.7 ms.
FRAME_RATE = Fraction(30000, 1001)            # 29.97 NTSC drop-frame
LINEAR_TOLERANCE_MS = Decimal("66.7")         # cumulative drift ceiling (±2 frames)
STEP_TOLERANCE_MS = Decimal("83.3")           # single-jump ceiling (~2.5 frames)
SAMPLE_INTERVAL_S = Decimal("10")             # probe density along the timeline


def probe_video_pts(media: Path) -> list[Decimal]:
    """Packet-level PTS in seconds — no frame decode, so it is ingest-cheap."""
    cmd = ["ffprobe", "-v", "quiet", "-print_format", "json",
           "-select_streams", "v:0", "-show_entries", "packet=pts_time", str(media)]
    out = subprocess.run(cmd, capture_output=True, text=True, check=True).stdout
    packets = json.loads(out).get("packets", [])
    return [Decimal(p["pts_time"]) for p in packets
            if p.get("pts_time") not in (None, "N/A", "")]


def caption_onsets(srt_path: Path) -> list[Decimal]:
    """SRT cue start times in seconds via pysrt's real ordinal API."""
    subs = pysrt.open(str(srt_path), encoding="utf-8-sig")  # tolerate UTF-8 BOM
    return [Decimal(s.start.ordinal) / 1000 for s in subs]  # ordinal is ms


def sample(series: list[Decimal], step: Decimal = SAMPLE_INTERVAL_S) -> list[Decimal]:
    """One value per fixed program-time window, so both series align by index."""
    picked, nxt = [], series[0] if series else Decimal(0)
    for t in series:
        if t >= nxt:
            picked.append(t)
            nxt += step
    return picked


def detect_drift(media: Path, srt_path: Path) -> dict:
    video = sample(probe_video_pts(media))
    caption = sample(caption_onsets(srt_path))
    n = min(len(video), len(caption))
    if n < 2:
        return {"compliant": None, "reason": "insufficient samples"}

    offsets = [caption[i] - video[i] for i in range(n)]
    baseline = offsets[0]                                  # allow a fixed lip-sync offset
    drift = [o - baseline for o in offsets]                # evaluate only deviation
    max_linear = max(abs(d) for d in drift) * 1000
    max_step = max(abs(drift[i] - drift[i - 1]) for i in range(1, n)) * 1000

    q = lambda v: v.quantize(Decimal("0.1"), rounding=ROUND_HALF_UP)
    return {
        "max_linear_drift_ms": q(max_linear),
        "max_step_drift_ms": q(max_step),
        "compliant": max_linear <= LINEAR_TOLERANCE_MS and max_step <= STEP_TOLERANCE_MS,
    }


if __name__ == "__main__":
    print(detect_drift(Path("program.mxf"), Path("program.srt")))

Code walkthrough

probe_video_pts shells out to ffprobe and reads packet=pts_time rather than decoding frames. Packet PTS gives frame-accurate anchor points at a fraction of the cost of a decode, which is what makes the check affordable on every ingest instead of only on flagged assets. subprocess.run(..., check=True) makes a malformed container fail loudly so a worker quarantines it rather than silently emitting an empty series.

caption_onsets uses pysrt’s real API: pysrt.open returns a SubRipFile, and each item’s .start.ordinal is the cue onset already expressed in integer milliseconds, so dividing by 1000 yields exact seconds with no string parsing. Opening with encoding="utf-8-sig" strips a leading byte-order mark if present — a BOM left in the first cue index is a routine cause of a corrupt first onset. For SCC or WebVTT sources, swap this function for the matching parser (parsing SCC with Python libraries and WebVTT cue extraction and validation); the rest of the detector is format-agnostic.

sample reduces both timelines to one point per fixed program-time window so the two series align positionally by index. Interval sampling (every 10 s here) keeps I/O bounded on multi-hour assets while preserving enough resolution to catch both slow accumulation and abrupt steps.

detect_drift is where the compliance logic lives. It pairs samples by index, then subtracts offsets[0] as a baseline: a constant lip-sync offset that is within tolerance and never grows is allowed, so only cumulative deviation is judged — this is what stops a deliberate fixed offset from being flagged as a § 79.1 violation. max_linear is the worst absolute deviation across the program; max_step is the largest jump between adjacent samples, isolating encoder buffer resets and mid-roll re-anchoring from gradual timebase scaling error. All arithmetic uses Decimal to avoid the IEEE 754 accumulation errors that corrupt frame-accurate timing math, and the verdict gates against both ceilings independently. The Fraction(30000, 1001) rate is exposed because the underlying physical origin of linear drift is a 1.001× timebase mismatch (29.97 drop-frame processed as 30.00 non-drop introduces ~3.6 frames per hour).

Threshold reference

Metric	Limit	Frame equivalent (29.97 fps)	Source
Linear (cumulative) drift	±66.7 ms	±2 frames	FCC 47 CFR § 79.1 synchronicity
Step (single-jump) drift	±83.3 ms	~±2.5 frames	Operator hard ceiling
Sample interval	10 s	299–300 frames	Detector tuning
Timebase scaling drift	~3.6 frames/hr	1.001×	29.97 DF → 30.00 NDF mismatch

Edge cases & known gotchas

Variable frame rate (VFR) sources: packet PTS spacing is uneven, so index-paired samples can straddle different real times. Probe r_frame_rate/avg_frame_rate first and reject or CFR-normalize VFR inputs before trusting the verdict.
Drop-frame vs non-drop confusion: the Fraction(30000, 1001) rate must match the asset. Assuming 30.00 against a 29.97 source manufactures phantom linear drift at exactly the 1.001× rate.
BOM in the first cue: a UTF-8 BOM glued to cue index 1 corrupts the first onset and poisons the baseline; encoding="utf-8-sig" neutralizes it.
Single-cue or sparse files: fewer than two aligned samples returns compliant: None, not a false pass — sparse caption tracks (sports lower-thirds, music programs) need a different sampling floor.
Intentional lip-sync offset: without baseline subtraction a constant 40 ms house offset reads as a permanent violation; the baseline term is load-bearing, not cosmetic.

Where this plugs in

This detector is the per-asset core of automated sync drift detection — that parent overview covers the worker model, drift-curve artifact, and WORM audit store that wrap this function. Run it after SRT timestamp normalization has snapped cues to the frame grid (off-grid edges otherwise show up as small spurious step drift), and fan it out at library scale with the queue and backpressure patterns in async batch caption processing. Its boolean verdict and metrics feed the build gate and the aggregated report described below.

Automated sync drift detection — Parent reference: the validator worker, drift curve, and audit trail this snippet sits inside.
CI/CD gating for caption builds — Turn the compliant boolean into a non-zero exit code that fails the build before mux.
Scheduled QC report generation — Aggregate drift metrics across asset groups to surface systemic encoder misconfiguration.
Enforcing character rate limits in QC — The sibling timing gate that runs alongside drift on the same parsed cue model.

Part of: Automated QC Validation & Reporting.