Large language models (LLMs) are rapidly increasing in parameter count, but this growth is not matched by an availability of high-quality data. This discrepancy raises concerns about the sustain- ability of current approaches to language model improvement, especially as forecasts
...
Large language models (LLMs) are rapidly increasing in parameter count, but this growth is not matched by an availability of high-quality data. This discrepancy raises concerns about the sustain- ability of current approaches to language model improvement, especially as forecasts suggest a potential data shortage by the end of the decade. This study investigates the impact of different tokenization strategies on the performance of small transformer (around 10 million parameters) models, evaluating three prominent subword tokenization methods: Byte-Pair Encoding (BPE), Word- Piece, and SentencePiece. Additionally, we ex- amine the trade-off between vocabulary size and embedding size and measure these factors’ effects on language understanding and model efficiency within the BabyLM pipeline with BLiMP and SuperGLUE scores. Our findings indicate that while the different tokenization strategies have minimal impact on model performance, the trade-off be- tween vocabulary size and embedding size significantly affects both language understanding and efficiency. Increasing the vocabulary size beyond a certain threshold does not seem to enhance language understanding. This research improves our understanding of how tokenization influences the language modeling process, specifically within the context of small language models.