Line 1 shows how BPE segments the full sequence. Line 2 zooms into the step right after
<ACGTA>: the same DNA prefix can be extended by five different vocab entries, but only
one matches the original segmentation. Click a candidate to see how teacher forcing scores it.
Actual tokenization
Equivalent tokenization options
Pick a row: BPE teacher forcing only credits <TCG> as correct, even though all five candidates
extend the next nine bases TCGTATAGG with no error.