Related work: Vattente in the scene of improvement in reasoning LLM
Links table
Abstract and 1 introduction
2 background
2.1 large language models
2.2 I turned
3 issues with a profitation and 3.1 model requires rewriting the nucleus of attention
3.2 Adds repetition in the presentation framework and 3.3 general expenditures
4 visions of LLM service systems
5 VATTENTION: System Design and 5.1 Overview of Design
5.2 Take advantage of low -level Cuda support
5.3 LLMS service with Vattente
6 VATTENTION: Improvements and 6.1 Reducing the internal fragmentation
6.2 Cumin hide memory customization
7 evaluation
7.1 Transportation and Performance of Prevention
7.2 Transportation and Performance for Blinds
7.3 The effectiveness of actual memory customization
7.4 Memory fragmentation analysis
8 relevant work
9 conclusion and references
In modern work, GMLAKE [35] Show that the use of Cuda’s virtual memory support can reduce the fragmentation of DNN training functions, which increases the size of the training boost. In particular, GMLAKE uses Cuda’s support to collect multiple smaller memory pages in almost adjacent one object that can prevent errors out of memory for large organisms. On the other hand, the lukewar focuses on avoiding the fragmentation of LLM. It differs from training, LLM conclusion is a cumin sensor and requires smaller granular allocations. We suggested many specific improvements to LLM to meet these requirements.
The improvement of LLM is an active search field. Various scheduling systems are proposed to improve different aspects of LLM service. For example, Orca [47] And vlm [39] It aims to improve productivity productivity with effective assembly. Sacral [26] And SplitFuse [36] Divide a long predetermined into multiple smaller parts and merge the decoding codes with each piece to improve the use of a GPU account. Based on similar techniques, Sarathi [25] A meal assembly is suggested to reduce the impact of a long prior repetition on the time of transmission transmission. split [41]DistSERVE [49] And Tetriinfer [38] Classify stages in advance and decompose the blade, and implement them on various symmetric copies to avoid interference between decoding requests. For inferred without contact with resource restricted devices, Flexgen [43] He suggested strategy and discharge to improve productivity. FastSERVE [45] Reduces times of completing the task in the conclusion of LLM using preventive scheduling.
For all systems above to work effectively, the effective use of the actual GPU memory is necessary. Since VLM, profitment has been adopted in various work frameworks, for example, Tensorrt-LLM [14]Lightlm [12] And Kerneel applications, for example, in Flashattente [9] And Flashinfer [11]. On the other hand, Vattente offers an alternative approach to the dynamic KV-CACHE memory management. We explain that the use of system support for demand deportation can easily add dynamic memory management support to the current Kernel applications.
Authors:
(1) Ramia Prabho, Microsoft Research India;
(2) Ajay Nayak, the Indian Institute of Science and contributed to this work as a trained in Microsoft Research India;
(3) Jayashree Mohan, Microsoft Research India;
(4) Ramachandran Ramjee, Microsoft Research India;
(5) Ashish Panwar, Microsoft Research India.