Text Practice Mode
large language model
created Oct 14th, 12:37 by Aamir Ali
0
928 words
16 completed
0
Rating visible after 3 or more votes
00:00
Large neural networks spend most computation on floating point tensor multipli- cations. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity mul- tiplication (L-Mul) algorithm that approximates floating point number multipli- cation with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level com- putation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by element- wise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algo- rithm on a wide range of textual, visual, and symbolic tasks, including natural lan- guage understanding, structural reasoning, mathematics, and commonsense ques- tion answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves compa- rable precision as float8 e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8 e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8 e4m3 as accumulation precision in both fine-tuning and inference.
1 INTRODUCTION
Modern artificial intelligence (AI) systems are significant energy consumers. Because of the large scale computation needed for neural network inference, AI applications based on such models are consuming a considerable amount of electricity resource. Reportly, the average electricity consump- tion of ChatGPT service in early 2023 was 564 MWh per day, equivalent to the total daily electricity usage of 18,000 families in the United States1. It is estimated that Google’s AI service could con- sume as much electricity as Ireland (29.3 TWh per year) in the worst-case scenario (de Vries, 2023).
Reducing the amount of computation needed by neural networks is the key to reduce both energy consumption and inference speed for large-scale AI models. Neural networks, especially large lan- guage models (LLMs) (Radford et al., 2019; Brown, 2020; Achiam et al., 2023; Touvron et al., 2023; Team et al., 2023), contain a large number of floating point parameters involved in element- wise and matrix multiplication computations. In transformer (Vaswani, 2017) based LLMs, the attention mechanism is a major bottleneck that limits the computation efficiency. Given a input context of N tokens, the complexity of standard attention mechanism computation is O(N2), in- volving multiplying high dimensional tensors. Besides attention, there are also a large amount of element-wise multiplication and linear transformation computations. In this work, we propose the
1 https://www.eia.gov/tools/faqs/faq.php?id=97 1
arXiv:2410.00907v2 [cs.CL] 2 Oct 2024
Preprint
linear-complexity multiplication (L-Mul) algorithm, which approximates floating point multiplica- tion with integer addition operations. The algorithm can be integrated into existing models at various levels, such as replacing the multiplication in the attention mechanism or substituting all matrix and element-wise multiplications.
The proposed L-Mul method will lead to a significantly reduced energy consumption for both model training and inference. In modern computing hardware, multiplications between floating point num- bers consumes significantly higher energy than addition operations (Horowitz, 2014). Specifically, multiplying two 32-bit floating point numbers (fp32) costs four times energy as adding two fp32 numbers, and 37 times higher cost than adding two 32-bit integers (int32). The rough energy costs for various operations are shown in Table 1. In PyTorch (Paszke et al., 2019), the default precision for accumulating tensor multiplication results is set to fp32. While I/O and control op- erations are not considered, approximating fp32 multiplications with int32 additions consumes only 1/37 ≈ 2.7% of the energy. When the accumulation precision is reduced to fp16, integer addition consumes approximately 4.7% of the energy required for floating-point multiplication.
Operation
We evaluate the numerical precision of L-Mul algorithm on transformer-based language models with a wide range of language and vision tasks. Experiments with full-precision model weights show that replacing standard multiplication operations with L-Mul in the attention mechanism is almost loss- less for transformer-based LLMs. On natural language reasoning tasks, the average performance loss of L-Mul-based attention is 0.07% across commonsense, structured reasoning, language under- standing. On vision tasks, L-Mul-based attention gained 0.12% accuracy improvement on visual question answering, object hallucination, and free-form visual instruction tasks. The experiment re- sults are obtained by directly adapting pretrained LLMs with the standard attention implementation to the new L-Mul-based attention mechanism without any additional training.
The error estimation and ablation study show that under the training-free setting, L-Mul with 4- bit mantissa can achieve comparable precision as multiplying float8 e4m3 numbers, and L- Mul with 3-bit mantissa outperforms float8 e5m2 multiplication. We also show that fine-tuning can fix the performance gap between L-Mul and the standard multiplication. Fine-tuning a model where all multiplication operations in attention mechanisms, linear transformations, and element- wise products are replaced by 3-bit-mantissa L-Mul results in comparable performance to fine- tuning a standard model with an accumulation precision of float8 e4m3.
In the expansive landscape of AI efficiency research, our approach centers on enhancing the effi- ciency of tensor arithmetic algorithms—a direction that is orthogonal yet complementary to prevail- ing efforts in I/O and control optimization (Jouppi et al., 2017; Choquette et al., 2021; Abts et al., 2022)2. We believe that truly energy- and compute-efficient AI computation will emerge from a holistic integration of optimizations across I/O, control, and arithmetic operations.
1 INTRODUCTION
Modern artificial intelligence (AI) systems are significant energy consumers. Because of the large scale computation needed for neural network inference, AI applications based on such models are consuming a considerable amount of electricity resource. Reportly, the average electricity consump- tion of ChatGPT service in early 2023 was 564 MWh per day, equivalent to the total daily electricity usage of 18,000 families in the United States1. It is estimated that Google’s AI service could con- sume as much electricity as Ireland (29.3 TWh per year) in the worst-case scenario (de Vries, 2023).
Reducing the amount of computation needed by neural networks is the key to reduce both energy consumption and inference speed for large-scale AI models. Neural networks, especially large lan- guage models (LLMs) (Radford et al., 2019; Brown, 2020; Achiam et al., 2023; Touvron et al., 2023; Team et al., 2023), contain a large number of floating point parameters involved in element- wise and matrix multiplication computations. In transformer (Vaswani, 2017) based LLMs, the attention mechanism is a major bottleneck that limits the computation efficiency. Given a input context of N tokens, the complexity of standard attention mechanism computation is O(N2), in- volving multiplying high dimensional tensors. Besides attention, there are also a large amount of element-wise multiplication and linear transformation computations. In this work, we propose the
1 https://www.eia.gov/tools/faqs/faq.php?id=97 1
arXiv:2410.00907v2 [cs.CL] 2 Oct 2024
Preprint
linear-complexity multiplication (L-Mul) algorithm, which approximates floating point multiplica- tion with integer addition operations. The algorithm can be integrated into existing models at various levels, such as replacing the multiplication in the attention mechanism or substituting all matrix and element-wise multiplications.
The proposed L-Mul method will lead to a significantly reduced energy consumption for both model training and inference. In modern computing hardware, multiplications between floating point num- bers consumes significantly higher energy than addition operations (Horowitz, 2014). Specifically, multiplying two 32-bit floating point numbers (fp32) costs four times energy as adding two fp32 numbers, and 37 times higher cost than adding two 32-bit integers (int32). The rough energy costs for various operations are shown in Table 1. In PyTorch (Paszke et al., 2019), the default precision for accumulating tensor multiplication results is set to fp32. While I/O and control op- erations are not considered, approximating fp32 multiplications with int32 additions consumes only 1/37 ≈ 2.7% of the energy. When the accumulation precision is reduced to fp16, integer addition consumes approximately 4.7% of the energy required for floating-point multiplication.
Operation
We evaluate the numerical precision of L-Mul algorithm on transformer-based language models with a wide range of language and vision tasks. Experiments with full-precision model weights show that replacing standard multiplication operations with L-Mul in the attention mechanism is almost loss- less for transformer-based LLMs. On natural language reasoning tasks, the average performance loss of L-Mul-based attention is 0.07% across commonsense, structured reasoning, language under- standing. On vision tasks, L-Mul-based attention gained 0.12% accuracy improvement on visual question answering, object hallucination, and free-form visual instruction tasks. The experiment re- sults are obtained by directly adapting pretrained LLMs with the standard attention implementation to the new L-Mul-based attention mechanism without any additional training.
The error estimation and ablation study show that under the training-free setting, L-Mul with 4- bit mantissa can achieve comparable precision as multiplying float8 e4m3 numbers, and L- Mul with 3-bit mantissa outperforms float8 e5m2 multiplication. We also show that fine-tuning can fix the performance gap between L-Mul and the standard multiplication. Fine-tuning a model where all multiplication operations in attention mechanisms, linear transformations, and element- wise products are replaced by 3-bit-mantissa L-Mul results in comparable performance to fine- tuning a standard model with an accumulation precision of float8 e4m3.
In the expansive landscape of AI efficiency research, our approach centers on enhancing the effi- ciency of tensor arithmetic algorithms—a direction that is orthogonal yet complementary to prevail- ing efforts in I/O and control optimization (Jouppi et al., 2017; Choquette et al., 2021; Abts et al., 2022)2. We believe that truly energy- and compute-efficient AI computation will emerge from a holistic integration of optimizations across I/O, control, and arithmetic operations.
saving score / loading statistics ...