Scheduled QC Report Generation

Scheduled QC report generation is the gate that turns scattered validation telemetry into a single, signed compliance verdict on a fixed cadence. Individual checks — drift, reading rate, syntax — each emit per-cue events, but a regulator does not ask “was cue 412 in drift?”; they ask “was this asset compliant when it aired, and can you prove it?”. This step answers that question deterministically: it runs on a schedule, aggregates every QC dimension for an asset window, scores it against jurisdictional thresholds, and writes an immutable artifact whose every line traces back to a clause. It sits inside Automated QC Validation & Reporting as the reporting and audit layer downstream of the live validators.

Problem framing

The engineering failure this step prevents is unprovable compliance. A pipeline can run flawless per-cue checks and still fail an FCC or Ofcom inquiry if it cannot reconstruct, months later, the exact verdict an asset received and the threshold configuration that produced it. The report is the evidence, so it must be deterministic (the same telemetry always yields the same score), idempotent (a scheduler retry never produces a second conflicting report), and immutable once written.

Three constraints make this harder than a nightly cron one-liner:

Trigger idempotency. Airflow, Kubernetes CronJob, and AWS EventBridge all guarantee at-least-once delivery, not exactly-once. A naive job that re-runs on retry double-counts violations and emits two reports for one window. The job must derive a deterministic run key from the asset UUID and validation-run ID and refuse to re-emit under that key.
Aggregation window alignment. A daily window that does not align to the ingest boundary either clips the tail of a late-arriving asset or double-counts cues that straddle midnight. The window is defined on the validation-event timestamp, not on wall-clock job-start time.
Score reproducibility. A composite score computed from a mutable threshold file is worthless as evidence. The exact threshold set — and a hash of it — must be embedded in the report so the verdict is reproducible against the rules that were live at airtime, even after the rules change.

A spike in any single dimension is only meaningful once aggregated against the others, which is why this report consumes the verdicts emitted by enforcing character rate limits in QC and automated sync drift detection rather than re-deriving them.

Pipeline stage & prerequisites

Report generation runs in the post-validation, pre-archive stage. By the time the scheduler fires, every upstream validator has already written its per-cue verdicts into a telemetry store — typically a time-series store (InfluxDB) or a relational schema (PostgreSQL). The report job never re-decodes media or re-parses caption files; it consumes the normalized verdict rows produced by the parsers described in SRT, SCC & WebVTT parsing workflows and the QC stages that follow them. Running the report before the validators have committed their rows is the classic ordering bug: the aggregate silently scores a partial window as a pass.

Required tooling:

Dependency	Minimum version	Role
Python	3.9+	`dataclasses`, `list[...]` generics, `hashlib`, `pathlib`
`pandas`	1.5+	Grouped aggregation of per-cue telemetry into per-asset rows
`pyarrow`	12.0+	Columnar Parquet telemetry output for the warehouse
`SQLAlchemy`	2.0+	Parameterized, connection-pooled reads from the verdict store
`croniter`	1.4+	Validating and aligning the schedule window to ingest boundaries

The scheduler itself (Airflow, Kubernetes CronJob, EventBridge) is orthogonal — the job is a plain idempotent callable, so it runs identically under any orchestrator. The verdict rows it reads are the same canonical model emitted by CI/CD gating for caption builds during the build-time pass, which is why a report can score both pre-broadcast batches and CI runs without branching.

Step-by-step implementation

1. Derive an idempotent run key from the trigger manifest

The scheduler delivers a manifest of asset UUIDs and a validation-run ID. Hash them into a deterministic run key so an at-least-once retry maps to the same key and can be deduplicated before any work happens.

import hashlib
import json
from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class RunManifest:
    run_id: str
    asset_ids: tuple[str, ...]
    window_start_ms: int
    window_end_ms: int

def load_manifest(path: Path) -> RunManifest:
    """Parse the scheduler-delivered manifest into an immutable run descriptor."""
    raw = json.loads(path.read_text(encoding="utf-8"))
    return RunManifest(
        run_id=raw["run_id"],
        asset_ids=tuple(sorted(raw["asset_ids"])),  # sort -> order-independent key
        window_start_ms=int(raw["window_start_ms"]),
        window_end_ms=int(raw["window_end_ms"]),
    )

def run_key(manifest: RunManifest) -> str:
    """Deterministic dedupe key — identical across at-least-once scheduler retries."""
    payload = f"{manifest.run_id}|{'|'.join(manifest.asset_ids)}".encode("utf-8")
    return hashlib.sha256(payload).hexdigest()

2. Read the window’s verdict rows with a parameterized query

Pull only the telemetry inside the aligned window, parameterized against injection. The query targets the verdict store the validators write to, never the raw media.

import pandas as pd
from sqlalchemy import create_engine, text

def fetch_verdicts(dsn: str, manifest: RunManifest) -> pd.DataFrame:
    """Load per-cue QC verdicts for the manifest's assets and time window."""
    engine = create_engine(dsn, pool_size=4, pool_pre_ping=True)
    query = text(
        """
        SELECT asset_id, cue_index, cps, sync_drift_ms, line_length, ts_ms
        FROM qc_verdicts
        WHERE asset_id = ANY(:asset_ids)
          AND ts_ms >= :w_start AND ts_ms < :w_end   -- half-open window, no straddle
        """
    )
    with engine.connect() as conn:
        return pd.read_sql(
            query, conn,
            params={
                "asset_ids": list(manifest.asset_ids),
                "w_start": manifest.window_start_ms,
                "w_end": manifest.window_end_ms,
            },
        )

3. Aggregate per-cue telemetry into per-asset metrics

Collapse thousands of cue rows into one row per asset with groupby. Vectorized aggregation avoids a Python-level loop over cues and keeps memory bounded by asset count, not cue count.

def aggregate_by_asset(df: pd.DataFrame) -> pd.DataFrame:
    """Reduce per-cue verdict rows to per-asset summary metrics."""
    if df.empty:
        return df
    grouped = df.groupby("asset_id").agg(
        cue_count=("cue_index", "count"),
        max_cps=("cps", "max"),
        max_drift_ms=("sync_drift_ms", lambda s: s.abs().max()),  # signed -> magnitude
        over_length=("line_length", lambda s: int((s > 32).sum())),  # CEA-608 32-col cap
    )
    return grouped.reset_index()

4. Score each asset against an embedded threshold set

The score is computed from a frozen ComplianceThreshold whose hash is later stamped into the report. Each penalty cites the clause it enforces, so the verdict is self-documenting evidence.

from dataclasses import dataclass, asdict
from typing import Any

@dataclass(frozen=True)
class ComplianceThreshold:
    cps_warning: float = 20.0      # operational sustained ceiling
    cps_fail: float = 30.0         # short-burst decoder-survival envelope
    drift_tolerance_ms: float = 100.0  # FCC 47 CFR § 79.1 synchronicity budget
    line_length_limit: int = 32    # CEA-608 32 columns per row
    pass_floor: float = 85.0       # composite score required to pass

def threshold_hash(t: ComplianceThreshold) -> str:
    """Hash the active rule set so the verdict is reproducible post-hoc."""
    return hashlib.sha256(json.dumps(asdict(t), sort_keys=True).encode()).hexdigest()[:16]

def score_asset(row: pd.Series, t: ComplianceThreshold) -> dict[str, Any]:
    """Weighted composite score; each violation maps to its regulatory clause."""
    violations: list[dict[str, Any]] = []
    if row.max_cps > t.cps_fail:
        violations.append({"type": "CPS_HARD_FAIL", "value": float(row.max_cps),
                           "weight": -15, "clause": "FCC 47 CFR § 79.1 (completeness)"})
    elif row.max_cps > t.cps_warning:
        violations.append({"type": "CPS_WARNING", "value": float(row.max_cps),
                           "weight": -5, "clause": "EBU-TT-D reading-rate guidance"})
    if row.max_drift_ms > t.drift_tolerance_ms:
        violations.append({"type": "SYNC_DRIFT_EXCEEDED", "value": float(row.max_drift_ms),
                           "weight": -20, "clause": "FCC 47 CFR § 79.1 (synchronicity)"})
    if row.over_length > 0:
        violations.append({"type": "LINE_LENGTH_VIOLATION", "count": int(row.over_length),
                           "weight": -10, "clause": "CEA-608 32-column limit"})

    score = max(0.0, 100.0 + sum(v["weight"] for v in violations))
    return {
        "asset_id": row.asset_id,
        "compliance_score": score,
        "status": "PASS" if score >= t.pass_floor else "FAIL",
        "violations": violations,
    }

5. Serialize to the three output targets

The scored frame forks into three artifacts: JSON for CI/CD gating, Parquet for the warehouse, and CSV for vendor handoff. Keep scoring decoupled from rendering so a new output format never touches the scoring logic.

def write_report(
    scored: list[dict[str, Any]],
    manifest: RunManifest,
    thresholds: ComplianceThreshold,
    out_dir: Path,
) -> Path:
    """Emit JSON (CI gate) + Parquet (warehouse) with an embedded audit envelope."""
    out_dir.mkdir(parents=True, exist_ok=True)
    envelope = {
        "run_key": run_key(manifest),
        "run_id": manifest.run_id,
        "window_ms": [manifest.window_start_ms, manifest.window_end_ms],
        "threshold_hash": threshold_hash(thresholds),
        "engine_version": "qc-report/2.3.0",
        "assets": scored,
        "verdict": "fail" if any(a["status"] == "FAIL" for a in scored) else "pass",
    }
    json_path = out_dir / f"{envelope['run_key']}.json"
    json_path.write_text(json.dumps(envelope, indent=2, sort_keys=True), encoding="utf-8")

    # Columnar copy for the warehouse — one row per asset, flat schema.
    pd.DataFrame([
        {"asset_id": a["asset_id"], "score": a["compliance_score"], "status": a["status"],
         "run_key": envelope["run_key"], "threshold_hash": envelope["threshold_hash"]}
        for a in scored
    ]).to_parquet(out_dir / f"{envelope['run_key']}.parquet", engine="pyarrow", index=False)
    return json_path

6. Write the report immutably and refuse duplicate runs

Before doing any work, check the run key against the archive. WORM (write-once-read-many) semantics make the report tamper-evident; the dedupe check makes the whole job idempotent under scheduler retries.

def generate_if_new(manifest: RunManifest, dsn: str, out_dir: Path) -> Path | None:
    """Idempotent entrypoint — a retried at-least-once trigger is a no-op."""
    key = run_key(manifest)
    if (out_dir / f"{key}.json").exists():
        return None  # report already emitted for this exact run — do not double-count
    thresholds = ComplianceThreshold()
    df = fetch_verdicts(dsn, manifest)
    scored = [score_asset(row, thresholds) for _, row in aggregate_by_asset(df).iterrows()]
    return write_report(scored, manifest, thresholds, out_dir)

The rendering of human-facing daily summaries — PDF layout, per-shift rollups, and chart embedding — is covered in generating daily QC reports with Python.

Threshold reference table

Constraint	Threshold	Source
Sync drift tolerance (pre-recorded)	± 150 ms	FCC 47 CFR § 79.1 (synchronicity)
Sync drift tolerance (live-to-tape)	± 100 ms	Broadcast operational practice
Sustained CPS (operational)	≤ 20 CPS	Broadcast operational practice
Burst CPS (transient)	≤ 30 CPS	Broadcast operational practice
Characters per line (CEA-608)	32	CEA-608
Composite pass floor	≥ 85 / 100	Pipeline policy (configurable)
Report retention (broadcast-ready)	≥ 90 days	FCC / CRTC archival practice
Report retention (compliance artifacts)	≥ 12 months	FCC / CRTC archival practice
Aggregation window	aligned to ingest boundary	Pipeline policy

The retention figures and drift ceilings vary by jurisdiction; the authoritative per-region values are enumerated in the FCC Part 79 compliance checklist and the Ofcom Code on subtitling standards. The pass floor and the penalty weights are pipeline policy, not regulation — embed their hash in every report so the verdict stays reproducible after the policy changes.

Verification & test pattern

Validate the scorer and the idempotency guard against fixtures whose expected verdict you can compute by hand. The first test pins a deterministic FAIL; the second proves a retried run is a no-op.

def _row(asset_id, max_cps, max_drift_ms, over_length, cue_count=10):
    return pd.Series({"asset_id": asset_id, "cue_count": cue_count,
                      "max_cps": max_cps, "max_drift_ms": max_drift_ms,
                      "over_length": over_length})

def test_score_drift_failure_is_deterministic():
    t = ComplianceThreshold()
    # 220 ms drift > 100 ms budget -> -20; clean otherwise -> 80 -> below 85 floor
    result = score_asset(_row("A1", max_cps=12.0, max_drift_ms=220.0, over_length=0), t)
    assert result["status"] == "FAIL"
    assert result["compliance_score"] == 80.0
    assert result["violations"][0]["clause"] == "FCC 47 CFR § 79.1 (synchronicity)"

def test_clean_asset_passes():
    t = ComplianceThreshold()
    result = score_asset(_row("A2", max_cps=15.0, max_drift_ms=40.0, over_length=0), t)
    assert result["status"] == "PASS" and result["compliance_score"] == 100.0

def test_run_key_is_order_independent():
    m1 = RunManifest("r1", ("b", "a"), 0, 1000)
    m2 = RunManifest("r1", ("a", "b"), 0, 1000)
    # asset_ids are sorted in load_manifest, so both keys must match
    assert run_key(RunManifest("r1", tuple(sorted(m1.asset_ids)), 0, 1000)) == run_key(m2)

Wire these into the same suite that gates the build so a scoring regression fails CI before it ever reaches production reporting; the fixture pattern slots into the pytest setup described in CI/CD gating for caption builds.

Troubleshooting / failure modes

Duplicate reports on scheduler retry. Airflow/EventBridge re-fire after a transient worker crash, producing two reports for one window. Fix: gate every run behind the run_key existence check (step 6); the second run becomes a no-op.

Window straddle double-counts cues. A cue whose timestamp lands on the window boundary is counted in both the closing and opening windows. Fix: use a half-open interval (ts >= start AND ts < end) and align window_start_ms to the ingest boundary with croniter, never to job-start wall-clock.

Partial-window false pass. The report runs before a slow validator commits its rows, so the aggregate scores an incomplete window as a pass. Fix: require a per-asset completion sentinel (expected cue count or a “validation_complete” marker) before scoring; otherwise defer the run.

Irreproducible score after a policy change. A report scored under last quarter’s weights cannot be recomputed once the threshold file is edited. Fix: embed threshold_hash and engine_version in the envelope (step 5) so any verdict is reproducible against the rules live at airtime.

OOM on a large ingest window. A multi-hour batch pulls millions of cue rows into one DataFrame. Fix: chunk fetch_verdicts by asset and aggregate incrementally; the per-asset summary is tiny, so only the cue rows of one asset need to be resident.

Silent NaN scores. An asset with zero matched cue rows produces NaN aggregates that compare falsely against thresholds. Fix: short-circuit empty groups (the df.empty guard in step 3) and emit an explicit NO_DATA status rather than a numeric score.

Operational notes

Memory must scale with asset count, not cue count. The per-cue rows are transient inside aggregate_by_asset; once grouped, each asset is a single row, so a window of thousands of assets still fits comfortably if the cue rows are streamed in per-asset chunks rather than materialized whole. For batch reporting across thousands of assets, drive the job with async batch caption processing and cap worker-pool concurrency so reads against the shared verdict store do not exhaust the connection pool — the pool_size=4 in step 2 is a per-worker budget, not a global one. Between asset batches, drop the intermediate DataFrames and let the allocator reclaim them; explicit del of the cue frame after aggregation keeps peak resident memory flat across a long run. Finally, treat the JSON envelope as the contract every downstream consumer reads: the CI gate parses verdict, the warehouse loads the Parquet copy, and the WORM archive stores the hashed envelope, so a single deterministic write feeds the build gate, analytics, and the audit trail at once.

Frequently Asked Questions

Why not just use cron directly? A bare cron entry gives at-most-once semantics with no retry, no manifest, and no idempotency. Production schedulers (Airflow, Kubernetes CronJob, EventBridge) retry on failure, which is why the job itself must dedupe on a deterministic run key rather than trusting the scheduler.

What goes in the report so it survives an audit? The asset verdicts, the half-open window, a hash of the active threshold set, and the engine version. Together they let an engineer recompute the exact score months later against the rules that were live at airtime.

JSON, Parquet, or CSV — which is canonical? The JSON envelope is canonical and feeds the CI gate; Parquet is a columnar copy for the warehouse and CSV is a convenience export for vendors. Scoring is decoupled from rendering, so all three derive from one scored frame.

How long must reports be retained? Broadcast-ready assets are typically held ≥ 90 days and compliance artifacts ≥ 12 months, but the binding figure is jurisdictional — confirm against the FCC Part 79 checklist and the Ofcom code for the target market.

Automated sync drift detection — the frame-accurate drift verdicts this report aggregates.
Enforcing character rate limits in QC — the CPS verdicts that feed the composite score.
CI/CD gating for caption builds — the stateless gate that consumes the report’s JSON verdict.
Generating daily QC reports with Python — PDF/CSV rendering and per-shift rollups for the human-facing summary.
Async batch caption processing — the concurrency model for driving the report across thousands of assets.

Part of: Automated QC Validation & Reporting — the broadcast caption QC reference.

Scheduled QC Report Generation

Continue reading

Related in QC & Reporting