Related action: the laws of scaling and Hopfield models in LLM research
Links table
Abstract and 1 introduction
2 related work
3 model and 3.1 points of points
3.2 transformers
4 new energy function
4.1 Class Temple
5 loss by entry
6 experimental results and 6.1 experimental evaluation of the half of the diameter
6.2 GPT-2 Training
6.3 Vanilla adapters training
7 Conclusion and recognition
Approach a. Dealimous tables
Approach b. Some of the properties of energy functions
Approaches C. Dealled Bochin from Section 5
Drown D. Transformations Details: Using GPT-2 as an example
Reference
Laws of scaling As we were discussed in the foreground, we saw fixed experimental evidence that the performance of the models increases with the high size of the models and the size of the training data (Kaplan et al., 2020; khandelwal et al Intensive experiments were also conducted to explore the laws of nervous scaling under various circumstances, including restrictions on the arithmetic budget (Hoffmann et Al In these analyzes, the expected risk decomposition is used, which leads to the following suitability:
For Chinchilla models, equipped parameters are (Hoffmann et al., 2022A)
A line of research relates to the generalization of architectural neuromus (Belkin et al., 2019; Nakkiran et al Modern experiments show that excessive transformers that display the UNTY Al
Hopfield models Classical Hopfield networks (Amari, 1972; Hopfield, 1982) was presented as a typical examples of interconnected memory. The dynamics of the network update determines the energy function, whose fixed points correspond to the stored memories. An important indicator is the number of patterns that the model can save, known as the network storage capacity. Energy modifications (Krotov and Hopfield, 2016; Demircigil et al The original model works on bilateral variables. The MPFIED network (RAMSAUER ET Al., 2020) Hopfield model is generalized to the continuous field, making it an attractive tool to understand the attention mechanism in transformers, which also takes the inclusion of veil in the real field as inputs. Looking at the input (for example, demanding), the HopField layer recovers the memory by rapprochement to the local minimum of the energy scene, and has a nosely correspondence base for the mechanism of querying attention. Krotov (2021) suggests a model of hierarchical interconnection (ham) that allows the nervous network to be described with global energy function, instead of energy functions for individual layers.