The new artificial intelligence method allows you to decide what to think about it

Authors:
(1) David Raposo, Google DeepMind and with equal contribution;
(2) Sam Riter, Google DeepMind;
(3) Blake Richards, Google Depp Mend, McGill University and Mila University;
(4) Timothy Lilicrab, Google DeepMind;
(5) Peter Conway Humphrez, Google DeepMind;
(6) Adam Santoro, Google DeepMind and with equal contribution.
Editor’s note: This part 2 of 5 of a study shows a way to make transformer -based language models more efficient by customizing calculations dynamically. Read the rest below.
Links table
- introduction
- background
- Implementing a mixture of depths
-
3.1. Determine an account budget
-
3.2. Guidance around transformers
-
3.3. Guidance charts
-
3.4. Implementation of guidance
-
3.5. Samples and 3.6. Training methods
-
- results
- 4.1. Training, ISOFLOP comparisons
- 4.2. Automatic evaluation evaluation and 4.3. A mixture of depth and experts (situation)
- Discussion and references
2. The background
Transformer engineering has become the spine in practical artificial intelligence, which brought unprecedented capabilities at the expense of expensive training and application procedures. This tremendous interest in making transformer structures is more efficient (Gupta and AGRAWAL, 2021; Tay et al., 2020). One of the most promising methods is the conditional account, as the mechanisms learned to when and how to spend the account determines. These terms were presented by Bengio (2013), and this concept has been explored over the next few years (Bengio et al., 2016, 2013; Cho and Bengio, 2014; Graves, 2016; , 2017).
A wide range of modern works has developed conditional calculation methods for transformers. Some of this work focuses on “early exit”, that is, learning to determine the date of ending the account on a specific symbol, allowing the symbol to overcome any remaining transformer layers after making the decision to go out (Elbayad et al., 2019; LIU and others, 2021; Reverse early exit methods, the distinctive symbol can exceed the middle layers, then it is updated by self -interest using the symbols that All the middle classes have gone.
Other works have developed ways to repeat the layers of converts with a shared weight of a number of steps (Dehghani et al., 2018; Simoulin and Crabbé, 2021). Paula and others. (2023) A way to choose the symbols to merge when the inference of the trained vision transformer, which does not require any learning significantly. To me and others. (2023) Take advantage of the conditional account in preparing accurate control by building on the transformer methods (he and others, 2021) to learn to skip blocks of pre -frozen weights in favor of operating only a small small transformer.
Colt5 (Anslie et al Moreover, they use the same steering mechanism to determine whether the distinctive symbol will come to all other symbols or a few few, as in Guo et al. (2022). Like Mod, Colt5 Soft Top-K is used to make guidance decisions. However, COLT5 focuses on preparing encryption coding, and thus needs to face the problem of effective chain coding in view of the unusual nature of the Kit-K. In contrast, our current work is with
Mod focuses on the preparation of Decoder only, so we suggest an predictive router to enable the effective reasoning of the police account in transformers.
One of the successful formulas of the police account is the “Mix Mix” layer (MEE) as presented by Shazeer et al. (2017). It was initially developed in the context of LSTMS, and the subsequent work showed a convincing experimental results of MEE with transformers (Fedus et al Unlike other police accounting curricula that try to maintain an additional account or spend, MEE transformers use the police logic to direct the symbols to one of many experts MLPS while maintaining the total fixed account expenses. Our super mix method can be considered as the use of the logic of the direction of MEE transformers, but instead of the presence of many experts, Mod publishes one expert that can be skipped dynamically.