Why not just use errors='ignore' to get past the UnicodeDecodeError?

errors='ignore' deletes the offending bytes and errors='replace' substitutes U+FFFD, so accented characters and proper nouns silently vanish. That drops the file below the 99% character-accuracy floor of FCC 47 CFR 79.1(j)(2). Decode strictly and fall back to CP1252 instead.

Why try CP1252 before ISO-8859-1?

ISO-8859-1 maps the 0x80-0x9F byte range to unprintable C1 control codes, so it decodes any byte and would mask real content. CP1252 maps that same range to smart quotes and dashes that legacy SCC tools actually emit, so it must be tried first; latin-1 stays last as the catch-all.

My SCC file decodes fine but has a stray character at the start. Why?

A UTF-8 BOM (EF BB BF) is valid UTF-8, so the decode succeeds and leaves a U+FEFF glyph before the first timecode line. The parser then fails to match HH:MM:SS:FF and drops the opening cue. Strip the BOM before decoding.

Is the high bit on SCC text bytes an encoding problem?

No. The high bit on SCC text bytes is CEA-608 odd parity, not a UTF-8 lead byte. Do not adjust it at the encoding stage; mask it with & 0x7F only inside the CEA-608 state machine after the file has been decoded.

Fixing UTF-8 Encoding Errors in SCC Files

A Scenarist Closed Caption (SCC) file that raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 is almost never corrupt — it is a Windows-1252 payload being read by a strict UTF-8 decoder. Legacy non-linear editors and caption authoring tools export extended punctuation (en-dash 0x96, curly quotes 0x91–0x94, e-acute 0xE9) as single CP1252 bytes that are illegal as standalone UTF-8, so the decoder halts at the first one. The wrong fix — errors='ignore' or errors='replace' — silently deletes or mangles those characters, and dropping an accented proper noun puts the file under the 99% character-accuracy floor that FCC Part 79 compliance requires for English-language captions. This page gives a single decode-and-normalize function that recovers the bytes losslessly before the first hex word reaches your parser.

Minimal working implementation

import logging
import pathlib
from charset_normalizer import from_bytes

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

# Codecs tried in order. utf-8 first (target), then the legacy supersets that
# dominate SCC exports. cp1252 is a strict superset of latin-1 for the 0x80-0x9F
# range (smart quotes, en/em dash) that latin-1 maps to C1 control codes instead.
_FALLBACK_CODECS = ("utf-8", "cp1252", "iso-8859-1")


def decode_scc_bytes(raw: bytes, name: str = "<bytes>") -> tuple[str, str]:
    """Return (text, codec_used). Never loses characters: every codec is tried
    with errors='strict', so a partial/mojibake decode can never be returned."""
    # Strip a UTF-8 BOM (0xEF 0xBB 0xBF) some Windows tools prepend; left in
    # place it is decoded as a stray glyph that breaks the first SCC line.
    if raw.startswith(b"\xef\xbb\xbf"):
        raw = raw[3:]
        logging.info("Stripped UTF-8 BOM from %s", name)

    for codec in _FALLBACK_CODECS:
        try:
            return raw.decode(codec, errors="strict"), codec
        except UnicodeDecodeError as exc:
            logging.warning("%s: %s failed at byte %d (%s)",
                            name, codec, exc.start, exc.reason)

    # Statistical last resort: charset_normalizer (successor to chardet) sniffs
    # the encoding from byte distribution. We still decode strict afterwards.
    best = from_bytes(raw).best()
    if best is not None:
        text = str(best)
        logging.error("%s: fell back to detected %s (confidence %.2f) — review",
                      name, best.encoding, 1 - best.chaos)
        return text, best.encoding
    raise UnicodeDecodeError("scc", raw, 0, len(raw), "no viable codec")


def normalize_scc_file(src: pathlib.Path, dst: pathlib.Path) -> str:
    """Read any legacy SCC encoding, write canonical UTF-8 (no BOM, LF endings).
    Returns the source codec for the audit manifest."""
    text, codec = decode_scc_bytes(src.read_bytes(), src.name)
    # CRLF -> LF so chunked stream decoders never split a hex word on \r\n.
    text = text.replace("\r\n", "\n").replace("\r", "\n")
    # newline="" prevents Python re-expanding \n to \r\n on Windows runners.
    dst.write_text(text, encoding="utf-8", newline="")
    logging.info("Normalized %s -> %s (source codec: %s)", src.name, dst.name, codec)
    return codec

Code walkthrough

decode_scc_bytes is the core. It treats decoding as an ordered series of strict attempts rather than a single forgiving one. The critical design rule is that errors='strict' is used on every codec: a strict decode either returns the complete, correct string or raises — it can never hand back a half-mangled buffer. This is what keeps the function compliant with FCC 47 CFR § 79.1, which counts every dropped or substituted character against the accuracy score; errors='ignore' and errors='replace' both violate that by design, so they never appear here.

The BOM strip runs first because the byte-order mark (0xEF 0xBB 0xBF) is structurally valid UTF-8 — decode('utf-8') would succeed and leave a U+FEFF glyph at the head of the file. That glyph then sits in front of the first SCC timecode line, so the parser’s line tokenizer fails to match the HH:MM:SS:FF pattern and silently drops the opening cue. Removing it before any decode attempt is the only reliable fix.

The fallback order — utf-8 → cp1252 → iso-8859-1 — is deliberate. UTF-8 is the target every modern pipeline wants. Windows-1252 is tried next because it is the real-world source for the 0x80–0x9F range (smart quotes, en/em dashes, ellipsis) that legacy captioning tools emit; ISO-8859-1/latin-1 maps that same range to unprintable C1 control codes, so trying latin-1 first would “succeed” while turning an en-dash into a control character. Latin-1 stays last as the catch-all that can decode any byte.

charset_normalizer.from_bytes(raw).best() is the statistical floor for files that are neither clean UTF-8 nor clean CP1252 — typically multi-vendor concatenations that mix encodings. It scores candidate encodings by byte-distribution “chaos”; 1 - best.chaos is logged as a confidence so operators can route low-confidence results to manual review rather than trusting them blindly. It is the successor to the unmaintained chardet and is the same detector used elsewhere in parsing SCC with Python libraries.

normalize_scc_file wraps the decode with two normalizations the rest of the pipeline depends on. Line endings collapse to LF so that streaming/chunked decoders downstream never split a 16-bit hex word across a \r\n boundary. The write uses encoding="utf-8" (no utf-8-sig, so no BOM is re-added) and newline="" so a Windows CI runner does not re-expand every \n back to \r\n. It returns the detected source codec so a batch run can emit a per-file audit manifest.

Threshold reference table

Constraint	Value	Source
Character accuracy floor (English captions)	99%	FCC 47 CFR § 79.1(j)(2)
UTF-8 BOM byte sequence	`EF BB BF`	Unicode 15.0 §23.8
CP1252 smart-quote / dash range	`0x80`–`0x9F`	WHATWG Encoding §5
Canonical SCC line ending	LF (`0x0A`)	pipeline convention
Manual-review confidence threshold	chaos > 0.10 (confidence < 0.90)	operational
Fallback-rate alert threshold	> 2% over rolling 24 h	operational

Edge cases & known gotchas

Mixed-encoding concatenation: files stitched from multiple vendors can hold a UTF-8 segment followed by a CP1252 segment. No single codec decodes both strictly, so these fall through to charset_normalizer and should always be flagged for review — the detector picks one encoding for the whole buffer.
Latin-1 always “succeeds”: because ISO-8859-1 maps all 256 byte values, putting it before CP1252 would mask real Windows-1252 content by decoding 0x96 to a C1 control code instead of an en-dash. Keep it last.
Double BOM: a file run through a buggy converter can carry EF BB BF EF BB BF; the single-strip guard leaves the second BOM in place. Strip in a while loop if your sources are known offenders.
Replacement characters already baked in: if an upstream tool already wrote U+FFFD (�), no decoder can recover the original byte — scan post-normalization and reject the file rather than passing the lossy text on.
Parity vs encoding confusion: the high bit set on SCC text bytes is CEA-608 odd parity, not a UTF-8 lead byte. Do not “fix” parity at the encoding stage; mask it (& 0x7F) only inside the CEA-608 state machine, after decoding.

Integration hook

This function is the pre-ingest gate that runs before tokenization in parsing SCC with Python libraries: point its tokenizer at the normalized UTF-8 output of normalize_scc_file so the CEA-608 state machine never sees a raw CP1252 byte or a leading BOM. At archive scale, call it inside the worker pool described in async batch caption processing, and stream the returned source-codec values into the manifest consumed by scheduled QC report generation so encoding fallbacks are auditable.

Parsing SCC with Python libraries — the CEA-608 tokenizer and state machine that consumes the clean UTF-8 this page produces.
SRT timestamp normalization — the sibling cleanup step that frame-quantizes timestamps once text is decoded.
SCC vs SRT vs WebVTT architecture — why SCC’s legacy byte assumptions differ from the UTF-8-native web formats.

Part of: SRT, SCC & WebVTT Parsing Workflows — the broadcast caption parsing reference.