Generating Daily QC Reports with Python

You need a single batch job that consumes a day’s caption deliverables and emits one machine-readable verdict per asset — provenance digest, cue count, every reading-rate breach, and a pass/fail flag — serialized so the same input always produces the same bytes. The failure mode this prevents is unprovable compliance: a report that scores assets correctly today but cannot be reproduced months later when an FCC inquiry asks which exact file passed and against which threshold. The reading-rate ceiling enforced below is derived from FCC 47 CFR § 79.1(j) — captions must track the program’s pace — with a sustained cap of 20 characters per second (CPS) for adult-rate content, the value CEA-708 authoring guidance commonly applies. Determinism is the other hard requirement: the report must hash identically across machines and scheduler retries so it can be committed to write-once storage.

Minimal working implementation

import hashlib
import json
from datetime import datetime, timezone
from pathlib import Path

import pysrt  # pip install pysrt==1.1.2

# FCC 47 CFR § 79.1(j) — captions must match program reading rate. CEA-708
# authoring guidance caps sustained presentation at ~20 CPS for adult content.
MAX_CPS = 20.0
RUN_UTC = datetime.now(timezone.utc).replace(microsecond=0).isoformat()


def sha256_file(path: Path) -> str:
    """Chain-of-custody digest, streamed at 64 KiB so memory stays O(1) per asset."""
    h = hashlib.sha256()
    with path.open("rb") as fh:
        for chunk in iter(lambda: fh.read(65536), b""):
            h.update(chunk)
    return h.hexdigest()


def audit_asset(path: Path) -> dict:
    """Parse one SRT and flag every cue over the § 79.1 reading-rate ceiling."""
    subs = pysrt.open(str(path), encoding="utf-8-sig")   # encoding strips a UTF-8 BOM
    violations = []
    for cue in subs:
        dur_s = (cue.end.ordinal - cue.start.ordinal) / 1000   # .ordinal is integer ms
        chars = len(cue.text_without_tags.replace("\n", ""))   # ignore markup + breaks
        cps = chars / dur_s if dur_s > 0 else float("inf")     # zero-dur cue = hard fail
        if cps > MAX_CPS:
            violations.append({
                "cue": cue.index,
                "start_ms": cue.start.ordinal,
                "measured_cps": round(cps, 2),
            })
    return {
        "asset": path.name,
        "sha256": sha256_file(path),
        "cue_count": len(subs),
        "violations": violations,
        "compliant": not violations,
    }


def generate_daily_report(ingest_dir: Path, run_id: str) -> str:
    assets = sorted(ingest_dir.glob("*.srt"))            # stable, sorted input order
    results = [audit_asset(p) for p in assets]
    report = {
        "schema_version": "1.2.0",
        "run_id": run_id,
        "generated_utc": RUN_UTC,
        "threshold_cps": MAX_CPS,
        "assets": results,
        "summary": {
            "total": len(results),
            "passed": sum(r["compliant"] for r in results),
            "failed": sum(not r["compliant"] for r in results),
        },
    }
    # sort_keys=True => identical bytes for identical input, so the report hash is
    # stable across machines and scheduler retries (required for WORM audit storage).
    return json.dumps(report, indent=2, sort_keys=True, ensure_ascii=False)


if __name__ == "__main__":
    out = generate_daily_report(Path("/ingest/today"), run_id="2026-06-28")
    Path("qc-report-2026-06-28.json").write_text(out, encoding="utf-8")

Code walkthrough

sha256_file reads each asset in 64 KiB chunks through iter(lambda: fh.read(65536), b"") rather than fh.read(), so a single multi-hour deliverable never loads into memory whole. The digest is computed before parsing, which fixes the asset’s identity at the moment it entered the report — this is the chain-of-custody record a regulator needs to confirm that the file that passed is byte-for-byte the file that aired. Hashing belongs in the same pass as validation so the proof and the verdict cannot drift apart.

audit_asset uses pysrt’s real object model: pysrt.open returns a SubRipFile whose items expose .index, .start, .end, and .text. Each SubRipTime carries an integer .ordinal in milliseconds, so cue duration is exact integer subtraction with no float timestamp parsing. text_without_tags strips <i>/<b> markup before counting characters, and removing \n collapses the two display lines into one rate calculation — counting the line break would understate CPS on a tightly timed two-liner. A zero-duration cue yields inf and fails immediately, because a cue with no on-screen time is itself a § 79.1 synchronicity defect, not a divide-by-zero to swallow. Opening with encoding="utf-8-sig" discards a leading byte-order mark; a BOM glued to cue index 1 is a routine cause of a corrupted first cue and a poisoned character count.

generate_daily_report makes the artifact reproducible. sorted(ingest_dir.glob("*.srt")) removes filesystem enumeration order from the output, and json.dumps(..., sort_keys=True) guarantees that identical telemetry serializes to identical bytes — the property that lets you store the report’s own SHA-256 alongside it and detect any later tampering. ensure_ascii=False preserves accented and non-Latin caption characters as UTF-8 instead of \uXXXX escapes, keeping the report faithful to the source. The summary block gives a scheduler or dashboard a single integer triplet to alert on without re-parsing the per-asset detail.

Threshold reference

Value	Setting	Source
Reading rate (adult)	20 CPS sustained	FCC 47 CFR § 79.1(j) / CEA-708 authoring
Children’s content rate	~17 CPS	CEA-708 authoring guidance
Zero-duration cue	hard fail	§ 79.1 synchronicity defect
Report retention	13 months minimum	FCC recordkeeping for captioned programming
Digest algorithm	SHA-256	NIST FIP 180-4

Edge cases & known gotchas

Empty or missing cue text: text_without_tags on a blank cue gives an empty string and cps == 0 (a pass), but an empty cue is usually an authoring error. Add an explicit zero-character check if your house rules treat blank cues as defects.
Overlapping cues: pysrt reads cues positionally and does not detect overlap; two cues sharing screen time each pass their own CPS check while the combined on-screen text exceeds the readable rate. Run an overlap pass before trusting per-cue CPS.
Non-SRT deliverables in the folder: glob("*.srt") silently skips SCC, TTML, and WebVTT files. For mixed ingest, parse SCC and WebVTT with their own readers and normalize to the same cue model before scoring — see the parsing links below.
Wall-clock RUN_UTC: computed at import time, so a long-running batch stamps every asset with the start time, not its own completion. That is intentional for a “daily run” identity, but do not treat it as a per-asset timestamp.
Idempotency: writing qc-report-<date>.json will overwrite a prior run for the same date. Gate the write on a content hash, or fail loudly, if a scheduler retry must never replace an already-signed report.

Where this plugs in

This script is the aggregation-and-serialization core of scheduled QC report generation; that parent overview wraps it with the idempotent trigger, Parquet/JSON dual output, and immutable storage policy. In a full pipeline the per-cue rate check shown here is the same one isolated in enforcing character rate limits in QC, and the compliant flag feeds straight into CI/CD gating for caption builds. Run it after SRT timestamp normalization has snapped cue edges to the frame grid so durations are exact, and merge its output with the metrics from automated sync drift detection for a single per-asset verdict that satisfies FCC Part 79 compliance.

For library-scale ingest, fan audit_asset out across workers with the queue and backpressure model in async batch caption processing, keeping each worker on an isolated memory space, then merge the per-asset dicts before the single deterministic json.dumps.

Scheduled QC report generation — Parent reference: the scheduler trigger, output schema, and immutable storage this script slots into.
How to enforce FCC character rate limits programmatically — The standalone reading-rate gate behind the CPS check used here.
Detecting sync drift in automated QC pipelines — The companion timing detector whose metrics merge into the same daily report.
Integrating caption QC into GitHub Actions CI — Turn the report’s failed count into a build-failing exit code.

Part of: Automated QC Validation & Reporting.