Skip to content

Commit fecfdd1

Browse files
authored
Add simpler BPE, and make previous BPE better (#870)
* Add simpler BPE, and make previous BPE better * update * Update README.md
1 parent 1164cb3 commit fecfdd1

File tree

6 files changed

+1222
-121
lines changed

6 files changed

+1222
-121
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,8 @@ Qwen3-0.6B/
8585
tokenizer-base.json
8686
tokenizer-reasoning.json
8787
tokenizer.json
88+
config.json
89+
bpe_merges.txt
8890

8991
# Datasets
9092
the-verdict.txt

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ Several folders contain optional materials as a bonus for interested readers:
158158
- [Installing Python Packages and Libraries Used In This Book](setup/02_installing-python-libraries)
159159
- [Docker Environment Setup Guide](setup/03_optional-docker-environment)
160160
- **Chapter 2: Working with text data**
161-
- [Byte Pair Encoding (BPE) Tokenizer From Scratch](ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb)
161+
- [Byte Pair Encoding (BPE) Tokenizer From Scratch](ch02/05_bpe-from-scratch/bpe-from-scratch-simple.ipynb)
162162
- [Comparing Various Byte Pair Encoding (BPE) Implementations](ch02/02_bonus_bytepair-encoder)
163163
- [Understanding the Difference Between Embedding Layers and Linear Layers](ch02/03_bonus_embedding-vs-matmul)
164164
- [Dataloader Intuition with Simple Numbers](ch02/04_bonus_dataloader-intuition)

ch02/05_bpe-from-scratch/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
# Byte Pair Encoding (BPE) Tokenizer From Scratch
22

3-
- [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) contains optional (bonus) code that explains and shows how the BPE tokenizer works under the hood.
3+
- [bpe-from-scratch-simple.ipynb](bpe-from-scratch-simple.ipynb) contains optional (bonus) code that explains and shows how the BPE tokenizer works under the hood; this is geared for simplicity and readability.
4+
5+
- [bpe-from-scratch.ipynb](bpe-from-scratch.ipynb) implements a more sophisticated (and much more complicated) BPE tokenizer that behaves similarly as tiktoken with respect to all the edge cases; it also has additional funcitionality for loading the official GPT-2 vocab.

0 commit comments

Comments
 (0)