Skip to main content
 

Priming: Multi-Stage Pretraining using Formal Languages with Ascending Complexity (2024)

Undergraduate: Tianyi Niu


Faculty Advisor: Shashank Srivastava
Department: Computer Science


This work proposes a novel three-stage approach to training language models where the dataset used in each stage increases in grammatical complexity, measured by the Chomsky hierarchy. Building on previous research in distilling inductive biases and non-linguistic pretraining, we propose adding an additional pretraining step called priming, where the priming stage involves a small synthetic dataset generated according to a specific formal language. The model is then pretrained on a larger and more complex non-linguistic dataset, and finally finetuned on a natural language dataset. We investigate two main questions: (1) whether priming would lead to stronger inductive biases, and (2) whether priming will allow for more efficient pretraining. This study examines eleven priming datasets generated with different formal languages of varying complexities. We find that priming does not enable more efficient pretraining, however, priming does increase performance when pretrained for a sufficient number of steps. Moreover, we find priming languages that (1) do not involve token-to-token dependencies and (2) are comprised of simple structural patterns will lead to better performance.