Google researchers develop a new technology of artificial intelligence that does not waste a brain force on useless words
Authors:
(1) David Raposo, Google DeepMind and with equal contribution;
(2) Sam Riter, Google DeepMind;
(3) Blake Richards, Google Depp Mend, McGill University and Mila University;
(4) Timothy Lilicrab, Google DeepMind;
(5) Peter Conway Humphrez, Google DeepMind;
(6) Adam Santoro, Google DeepMind and with equal contribution.
Editor’s note: This Part 1 of 5 of a study shows a way to make the transformer -based language models more efficient by allocating albuminous arithmetic resources. Read the rest below.
Links table
- introduction
- background
- Implementing a mixture of depths
-
3.1. Determine an account budget
-
3.2. Guidance around transformers
-
3.3. Guidance charts
-
3.4. Implementation of guidance
-
3.5. Samples and 3.6. Training methods
-
- results
- 4.1. Training, ISOFLOP comparisons
- 4.2. Automatic evaluation evaluation and 4.3. A mixture of depth and experts (situation)
- Discussion and references
The transformer -based language models are published uniformly through the input sequence. In this work, we make it clear that transformers can instead learn to customize the fluctuations (or account) in a dynamic way for specific sites in a sequence, which leads to improving customization along the sequence of different classes across the depth of the model. Our method imposes a total account budget by determining the number of symbols (𝑘) that can be involved in self -calculation and MLP accounts in a specific layer. The symbols to be processed by the network are determined using a higher guidance mechanism. Since it was previously defined, this simple procedure uses a fixed account fee of known sizes of the tensioner, unlike other police account techniques. However, since the identities of the symbols 𝑘 are liquid, this method can spend uniform fluctuations through the dimensions of the depth of time and model. Consequently, the spending can be fully predicted in total, but dynamically and sensitizing the context at the distinctive symbol level. Do not learn the trained models in this way only to customize the account dynamically, but rather do it efficiently. These models match the performance of the baseline for equivalent confusion and wall times for training, but they require part of the fluctuations for each pass forward, and it can be more than 50 % faster in deportation while taking samples after training.
1. Introduction
Not all problems require the same time or effort to solve it. Likewise, in language modeling, not all symbols and sequences require the same time or effort to carefully predict. However, transformer models spend the same amount of account for each symbol in a front pass. Ideally, transformers will use a smaller total account budgets by not spending at an unnecessary expense.
The police account is a technique that tries to reduce a total account by spending it only when needed (Bengio et al Various algorithms offer solutions for how many (Anslie et al., 2023; Bapna et al However, the general structures of this difficult problem may not work well with the restrictions of the current devices because they tend to provide the Dyghani et al., 2018; Graves, 2016). The promising police account methods may be that instead in harmony with our current devices, which give priority to the fixed calculation graphs, and known tensioner sizes that are chosen to increase the use of devices to the maximum.
Here we consider the problem of language modeling using a fixed account budget that can be made less than the one used by the vanilla adapter. The network should learn how to customize the dynamic available account by making decisions for each, in each layer, around the place of spending from the available budget. In our implementation, our user is defined as modified and does not change before training, instead of being a sign of network decisions during flying. Thus, the gains of device efficiency – such as a decrease in memory imprint, or reduce fluctuations for each pass to the front – can be expected and exploited early. As we will appear, these gains can be achieved without sacrificing public performance.
We benefit from an approach similar to a mixture of expert transformers (MEE), where guidance decisions are taken at the distinctive symbol level across the network depth. Upon leaving from MEE, we choose either an account application on the icon (as is the case for a standard transformer), or pass it through a remaining connection (remain unchanged and calculate the account). Unlike MEE as well, we apply this guidance to both MLPS forward and multi -head attention. Since this also affects the keys and information we address, the guidance makes decisions not only about the symbols that are updated, but also the distinctive symbols are provided to attend. We refer to this strategy as a mixture of depth (MD) to emphasize how individual symbols pass through different numbers of layers, or blocks, through the depth of the transformer (see Figure 1).
The technology of the Ministry of Defense also allows to perform barter with speed. On the one hand, one can train on the Ministry of Defense transformer improves vanilla transformers by up to 1.5 % on the goal of training on the possibility of the final registry for equivalent training (ISOFLOP), and while consuming an equivalent amount of wall time for training. On the other hand, one can train the Ministry of Defense transformer that achieves equal training loss with the optimal vanilla adapter, but it uses part of the fluctuations (up to 50 %) for each pass, and thus faster in the step. Together, these results indicate that MD transformers are intelligently learning (i.e. skipping unnecessary accounts) where they can achieve equal or better record possibilities for each sequence despite the smaller flipping emissions of each pass.