Investigating LLM Training with Minimax Optimization

Hoang Duc An

Authors

Hoang Duc An Hanoi Regional Vocational, Vietnam Author

Keywords:

LLM, Optimization, Quasi-Newton

Abstract

Large language models (LLMs) have considerably impacted natural language processing, yet training them remains challenging due to complex loss landscapes that often exhibit saddle point characteristics. In this work, we adapt a second-order saddle point approach (Xiao, Bo, and Wu 2024) to the LLM training environment. By approximating the squared Hessian matrix via iterative greedy updates and incorporating modifications such as limited-memory updates, adaptive step-size control, and efficient Hessian-vector products, our approach attains competitive convergence speed and stability in high-dimensional adversarial settings. Our experimental results on a transformer-based language model trained on an open web text corpus suggest that, while the improvements are moderate, the method offers a viable alternative to conventional optimizers such as Adam (Kingma and Ba 2015) and LAMB (You et al. 2019). We situate our work within the broader context of recent advances in optimization for deep learning (Bottou, Curtis, and Nocedal 2018; Martens 2010; Zhang et al. 2023; Vaswani et al. 2017; Goodfellow et al. 2014; LeCun, Bengio, and Hinton 2015; He et al. 2016; Pascanu, Mikolov, and Bengio 2013; Duchi, Hazan, and Singer 2011; Hinton and Salakhutdinov 2012).

Downloads

Download data is not yet available.

References

Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. 2018. “Optimization Methods for Large-Scale Machine Learning.” SIAM Review 60 (2): 223–311.

Brown, Tom et al. 2020. “Language Models Are Few-Shot Learners.” https://arxiv.org/abs/2005.14165.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 4171–86.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12: 2121–59.

Goodfellow, Ian et al. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems (NeurIPS).

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78.

Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. 2012. “A Better Way to Pretrain Deep Boltzmann Machines.” In Advances in Neural Information Processing Systems (NeurIPS), 2447–55.

Kingma, Diederik P., and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” In International Conference on Learning Representations (ICLR).

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44.

Liu, Wei et al. 2021. “An Analysis of ExtraSGD for Adversarial Training.” Journal of Machine Learning Research 22: 1–29.

Madry, Aleksander et al. 2018. “Towards Deep Learning Models Resistant to Adversarial Attacks.” In International Conference on Learning Representations (ICLR).

Martens, James. 2010. “Deep Learning via Hessian-Free Optimization.” In Proceedings of the 27th International Conference on Machine Learning (ICML), 735–42.

Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. 2013. “On the Difficulty of Training Recurrent Neural Networks.” In International Conference on Machine Learning (ICML), 1310–18.

Radford, Alec et al. 2019. “Language Models Are Unsupervised Multitask Learners.” https://openai.com/blog/better-language-models/.

Vaswani, Ashish et al. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems (NeurIPS).

Xiao, Minheng, Shi Bo, and Zhizhong Wu. 2024. “Multiple Greedy Quasi-Newton Methods for Saddle Point Problems.” In 2024 6th International Conference on Data-Driven Optimization of Complex Systems (DOCS), 749–54. IEEE.

You, Yang et al. 2019. “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes.” In International Conference on Learning Representations (ICLR).

Zhang, Hong et al. 2023. “Quasi-Newton Methods in Deep Learning: A Review.” In Proceedings of the International Conference on Machine Learning (ICML).

Investigating LLM Training with Minimax Optimization

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite