Harness the Potential of AI Tools with ChatGPT. Our blog offers comprehensive insights into the world of AI technology, showcasing the latest advancements and practical applications facilitated by ChatGPT’s intelligent capabilities.
Training a large language model requires significant computational resources, including powerful GPUs and TPUs, as well as specialized hardware such as AI accelerators. These resources can be expensive to acquire and maintain. Gathering and preparing the vast amounts of data needed to train large language models can be a costly and time-consuming process. High-quality, diverse, and representative datasets are essential for model performance.
Training large language models can take weeks or even months, depending on the model’s size and complexity. Sparsity is a natural approach to reducing this cost. The existing methods require costly retraining or do not yield wall-clock time speedup on modern hardware. The researchers have developed a new input-dependent set of attention heads and MLP parameters that yield approximately the same output as the dense models with a given input for a longer time.
They hypothesize that contextual sparsity exists, and when they are accurately predicted, they can speed up the LLM inference in wall-clock time without compromising LLM’s quality or in-context learning ability. They propose “DEJAVU“, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware implementation that speeds up LLM inference.
Even if contextual sparsity exists, it is challenging to predict the sparsity for a given input in advance. It is nontrivial to verify if such contextual sparsity exists, and naive verification can be prohibitively expensive. It might also be difficult to achieve end-to-end wall-clock time speedup. The team has verified the existence of such sparsity with a simple approach. Contextual sparsity depends not only on individual input tokens but also on their interactions. Only with token embeddings with sufficient contextual information, they predict sparsity accurately.
The contextual sparsity in the MLP block can be identified after computing the activation. However, this only demonstrates the existence of contextual sparsity but brings no benefits in terms of efficiency. A fast and precise prediction is needed to exploit contextual sparsity for end-to-end efficiency.
DEJAVU uses lookahead predictors to side-step prediction costs. Given the input to the attention layer at block k, they asynchronously predict the contextual sparsity for the MLP at block k and provide the information to the MLP at block k. They then predict the sparsity for the attention head at the next layer. They also claim that contextual sparsity can be accurately predicted with lightweight learning-based algorithms.
Researchers find that DEJAVU achieves over two times reduction in token generation latency compared to the state-of-the-art FasterTransformer and over six times compared to Hugging Face with no accuracy loss. The MLP sparse predictor introduces no accuracy loss on both zero-shot tasks and language modeling. In the training of the MLP sparse predictor, they observed that the sparse predictor achieves high validation accuracy.
Check out theand . All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join and , where we share the latest AI research news, cool AI projects, and more.
We are also onand .
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.
Discover the vast possibilities of AI tools by visiting our website at
https://chatgptoai.com/ to delve deeper into this transformative technology.