276°
Posted 20 hours ago

Grokking Machine Learning

£9.9£99Clearance
ZTS2023's avatar
Shared by
ZTS2023
Joined in 2023
82
63

About this deal

There have also been promising results in predicting grokking before it happens. Though some require knowledge of the generalizing solution or the overall data domain , some rely solely on the analysis of the training loss and might also apply to larger models — hopefully we’ll be able to build tools and techniques that can tell us when a model is parroting memorized information and when it’s using richer models. Understanding the solution to modular addition wasn’t trivial. Do we have any hope of understanding larger models? One route forward — like our digression into the 20 parameter model and the even simpler boolean parity problem — is to: 1) train simpler models with more inductive biases and fewer moving parts, 2) use them to explain inscrutable parts of how a larger model works, 3) repeat as needed. We believe this could be a fruitful approach to better understanding larger models, and complementary to efforts that aim to use larger models to explain smaller ones and other work to disentangle internal representations . Moreover, this kind of mechanistic approach to interpretability, in time, may help identify patterns that themselves ease or automate the uncovering of algorithms learned by neural networks. Credits In this article we’ll examine the training dynamics of a tiny model and reverse engineer the solution it finds – and in the process provide an illustration of the exciting emerging field of mechanistic interpretability . While it isn’t yet clear how to apply these techniques to today’s largest models, starting small makes it easier to develop intuitions as we progress towards answering these critical questions about large language models. Grokking Modular Addition Recall that our modular arithmetic problem a + b m o d 67 a + b \bmod 67 a + b mod 67 is naturally periodic, with answers wrapping around if the sum ever passes 67. Mathematically, this can be mirrored by thinking of the sum as wrapping a a a and b b b around a circle. The weights of the generalizing model also had periodic patterns, indicating that the solution might use this property. Does grokking happen in larger models trained on real world tasks? Earlier observations reported the grokking phenomenon in algorithmic tasks in small transformers and MLPs . Grokking has subsequently been found in more complex tasks involving images, text, and tabular data within certain ranges of hyperparameters . It’s also possible that the largest models, which are able to do many types of tasks, may be grokking many things at different speeds during training .

We observed that the generalizing solution is sparse after taking the discrete Fourier transformation, but the collapsed matrices have high norms. This suggests that direct weight decay on W output \textbf{W}_\text{output} W output ​ and W input \textbf{W}_{\text{input}} W input ​ doesn’t provide the right inductive bias for the task. Book Genre: Artificial Intelligence, Computer Science, Programming, Software, Technical, Technology Using x 2 xThe periodic patterns suggest the model is learning some sort of mathematical structure; the fact that it happens when the model starts to solve the test examples hints that it’s related to the model generalizing. But why does the model move away from the memorizing solution? And what is the generalizing solution? Generalizing With 1s and 0s We can induce memorization and generalization on this somewhat contrived 1s and 0s task — but why does it happen with modular addition? Let’s first understand a little more about how a one-layer MLP can solve modular addition by constructing a generalizing solution that’s interpretable. Modular Addition With Five Neurons While we now have a solid understanding of the mechanisms a one-layer MLP uses to solve modular addition and why they emerge during training, there are still many interesting open questions about memorization and generalization. Which Model Constraints Work Best?

Do more complex models also suddenly generalize after they’re trained longer? Large language models can certainly seem like they have a rich understanding of the world, but they might just be regurgitating memorized bits of the enormous amount of text they’ve been trained on . How can we tell if they’re generalizing or memorizing? The model’s architecture is similarly simple: ReLU ( a one-hot W input + b one-hot W input ) W output \text{ReLU}\left(\mathbf{a}_{\text{one-hot}} \mathbf{W}_{\text{input}} + \mathbf{b}_{\text{one-hot}} \mathbf{W}_{\text{input}}\right) \mathbf{W}_{\text{output}} Directly training the model visualized above — ReLU ( a one-hot W input + b one-hot W input ) W output \text{ReLU} \left(a_{\text{one-hot}}\textbf{W}_{\text{input}} + b_{\text{one-hot}}\textbf{W}_{\text{input}} \right) \textbf{W}_{\text{output}} ReLU ( a one-hot ​ W input ​ + b one-hot ​ W input ​ ) W output ​ — does not actually result in generalization on modular arithmetic, even with the addition of weight decay. At least one of the matrices has to be factored:

about the technology

PDF / EPUB File Name: Grokking_Machine_Learning_-_Luis_Serrano.pdf, Grokking_Machine_Learning_-_Luis_Serrano.epub

Figuring out both of these questions simultaneously is hard. Let’s make an even simpler task, one where we know what the generalizing solution should look like and try to understand why the model eventually learns it. Thanks to Ardavan Saeedi, Crystal Qian, Emily Reif, Fernanda Viégas, Kathy Meier-Hellstern, Mahima Pushkarna, Minsuk Chang, Neel Nanda and Ryan Mullins for their help with this piece.

team

Asda Great Deal

Free UK shipping. 15 day free returns.
Community Updates
*So you can easily identify outgoing links on our site, we've marked them with an "*" symbol. Links on our site are monetised, but this never affects which deals get posted. Find more info in our FAQs and About Us page.
New Comment