Achieving World No. 1 in LLM Performance – [Starview Vol. 9]

2023/09/07 | Written By: Jieun Song (People eXperience), Sungmin Park (Content Manager)

Upstage has achieved the title of 'World No. 1 on the Open LLM Leaderboard' at Hugging Face, known as the GitHub of generative AI.

If there was one topic that swept the world in the first half of the year, it would be the advent of generative AI technology represented by ChatGPT. With the growing interest in LLMs, many AI companies have been showcasing LLMs developed and trained in-house or building generative AI based on open source and launching related services, thus gradually expanding the market.

Amidst these developments, Upstage recorded number one on the Open LLM Leaderboard at Hugging Face, the world's largest open-source AI model platform. Hugging Face can essentially be described as the 'Billboard Chart' of the AI sector. It's a competitive platform where over 300 global AI models developed by technology companies and research institutions around the world are updated and compete fiercely in performance.

This achievement of Upstage as the world's number one on the Open LLM Leaderboard marks a significant milestone in establishing their undeniable technological leadership in the LLM market. With great anticipation for the bright future of Upstage's LLM, in the ninth edition of Starview, we had the opportunity to meet Mr. Kim Sang-hoon and Mr. Song Won-ho, the key figures behind achieving the top position on Hugging Face's Open LLM Leaderboard.

Q. Nice to meet you. Please introduce yourselves.

Kim Sang-hoon: Hello, I'm Kim Sang-hoon, the technical leader in the Foundation Model part of the LLM engine team.

Song Won-ho: Hello, I'm Song Won-ho, working on model development in the LLM engine team.

Q. We're curious about the formation of the LLM TF and your motivation to test your mettle on Hugging Face.

Kim Sang-hoon: The LLM TF started in June this year with the goal of verifying whether we could train a Korean LLM in-house. During this process, we decided to join the battle on Hugging Face's Open LLM Leaderboard.

Open LLM Leaderboard, where various models like Meta's LLaMA, LLaMA-2, Alpaca, Vicuna, WizardLM, Stability.ai's LLM compete in rankings, serves as a platform designed to track, rank, and evaluate open-source language models, providing an an objective measure of current state-of-the-art models. Hence, achieving first place here was an important indicator of our technical prowess.

Song Won-ho: LLM was a big wave, and many companies jumped on it. We prepared our TF so that we could enjoy riding this wave. We wondered how to onboard quickly and effectively. We all had experience in competitions and knew we could learn a lot quickly through them, which led us to try for number one on Hugging Face's leaderboard. We thought that if we’re going to give it a shot, we should aim for global domination!

Q: We're curious who were your teammates for the challenge Hugging Face.

Song Won-ho: Eight of us, including myself, CTO Lee Hwal-Seok, Min-Ji, Sang-Hoon, Yoon-Soo, Ducky, Chan-Joon, and Chloe, got together and kicked off the TF. The CTO and Min-Ji mainly handled management, and Sang-Hoon, leading the model development, took on much of the model development. I, Yoon-Soo, and Ducky worked in sync with Sang-Hoon on model development. Chan-Joon contributed significantly to our technical deliberations, and Chloe enriched our modeling with her participation in various discussions.

Q. You achieved the amazing feat beating the named houses in big tech with a lightweight 30B model. Could you please elaborate on the role of parameters in the model and the significance of this achievement?

“In July, Upstage unveiled a 30 billion (30B) parameter model through Hugging Face, scoring an average of 67 points. This score surpassed Meta's ‘LLaMA 2’ 70 billion (70B) parameter model released on the same day, achieving the remarkable feat of becoming the first Korean LLM to rank number one.

Following this, Upstage introduced a model fine-tuned with even more data based on the latest LLaMA 2 70 billion (70B) parameter model, aiming to solidify its position as the global leader. As a result, the newly revealed Upstage 70B model scored 72.3 points on the leaderboard, surpassing the ‘Stable Beluga 2’ model by U.S. Stability AI (71.4 points), which had taken the top since the release of LLaMA 2, thereby reclaiming the world number one position.

Notably, Upstage's latest model achieved an unprecedented feat by surpassing the benchmark score of GPT-3.5 (71.9 points), which forms the basis of ChatGPT. This is the first instance of a domestic startup's smaller LLM surpassing the score of GPT-3.5, known as the epitome of generative AI models, demonstrating that a small LLM developed by a local startup can equally compete with the super-large models of global big tech companies."

— Upstage, surpassing ChatGPT to establish itself as the 'World's Best LLM' (23.08.01)

Kim Sang-hoon: OpenAI mentioned 'Scaling Laws' in a paper released in 2020, which refers to the principle that the performance of a model improves as the number of parameters and the amount of training data increases.

Scaling laws graph from “Scaling Laws for Neural Language Models”

Our initially introduced 30B model was fine-tuned from Meta's LLaMA-1. Later, we fine-tuned the 70B model from LLaMA-2, surpassing the 175B-sized GPT-3.5. This has significant implications. It demonstrates that with high-quality data and optimized training methods, smaller models can surpass the performance of larger ones. We believe that smaller LLMs will surpass the performance of ChatGPT in the future.

The Upstage model ranked first on the Hugging Face Open LLM Leaderboard

Q. What was the driving force behind achieving number one on the Hugging Face Open LLM Leaderboard in just two months after starting to build your own model?

Kim Sang-hoon: A major driving force was that Upstage has the most top-tier Kaggle competitors in Korea and includes members who have presented papers at numerous international conferences. Our member’s background and experience were key contributors to our success.

From the left, the image shows Upstage's historical Kaggle awards and the paper 'DMOps' by Upstage, which was accepted at the prestigious ICML 2023 - DMLR, a top authority in 'Data-Centric AI'

Song Won-ho: We created an in-house leaderboard in the style of Kaggle, which fostered a healthy competition among team members to improve their scores. This approach gave us the drive to develop models quickly. We shared datasets, hyperparameters, and model development ideas, which helped us to rapidly create high-performance models. Additionally, many people participated in sharing the latest research and sparking diverse opinions that enriched the conversation. I think the vast array of different ideas led to good results.

Q: During the various experiments, what was your main focus?

Kim Sang-hoon: To raise leaderboard scores, we had to consider four key metrics: reasoning challenge (ARC), common sense inference (HellaSwag), multidomain knowledge test (MMLU), and truthfulness in answers (also known as “AI hallunication”; TruthfulQA). However, the pre-fine tuned LLM had a very low score in mitigating AI hallucination, around the 40s, which gave us considerable room for improvement. The AI hallucination rate drastically decreases if the model is aligned too closely with the training data, so we focused more on hyperparameter tuning to prevent this.

Song Won-ho: LLMs undergo pretraining with large corpus, learning various types of knowledge in the process. In my experiments, I contemplated how to maximize the retrieval of previously learned data. In simpler terms, while the current model knows a lot, it might not know how to 'speak' properly. So, I focused on driving development in a direction that effectively teaches it how to 'speak' without forgetting what it already knows.

Q: Recently, Upstage launched the '1T Club' for Korean data. Can you introduce this project, which could be the first step towards Korea's independence in LLM?

Kim Sang-hoon: Looking at the performance of Meta's LLaMA, it's clear that at least 1T, ideally 2T, of training data is necessary. While 2T of English data can be obtained online, securing enough Korean data poses a big challenge due to licensing issues and others. This was the primary reason for launching the 1T Club. Through this club, we aim to gather enough Korean data and train a Korean LLM with performance comparable to ChatGPT.

Song Won-ho: To train an LLM, a large corpus is first used for pretraining, requiring petabytes of tokens like 1T and 2T. The quality of the data is extremely important. For example, when teaching a three-year-old child to read, would it be better to read classic literature or YouTube comments? I would choose classic literature. Similarly, training an LLM with high-quality data is crucial, and in this context, the 1T Club is the first step towards creating a smarter model than any other LLM.

Q: What are your plans for the second half of this year?

Kim Sang-hoon: This year, we mainly plan to focus on developing a Korean LLM. Initially, we aim to retrain Meta's LLM, LLaMA, in a Korean version. During this process, we plan to explore the optimal dataset, hyperparameters, and preprocessing methods to eventually train a completely new Korean LLM.

Song Won-ho: Currently, there are many LLMs proficient in English, but only a few excel in Korean. The best LLMs for Korean could be considered ChatGPT or GPT-4. Our short-term goal is to develop an LLM that performs significantly better in Korean than ChatGPT. Especially, it's important not just to excel in Korean; developing an LLM proficient in both Korean and English will be crucial. If we succeed, it will disrupt the private Korean LLM market, which is quite exciting!

The Way We Work, Upstage Way

Q: Please share the important Upstage Way and practical know-how you practice.

Kim Sang-hoon: My key Upstage Ways are 'One step more' and 'Sharing'. As a Kaggle competitor aiming to win on the leaderboard, I continually improve my models' performance by taking 'One step more.' Also, by combining ideas with team members, I aim to create more high-performance models and share my model ideas for this purpose. To effectively share, I try to record as much as possible on the leaderboard’s model cards.

Song Won-ho: To me, the most important aspect of the Upstage Way is ‘One step more’. In model development, every minor detail can contribute to a difference in model performance, and this performance ultimately determines whether the AI service is viable or not. Therefore, it’s crucial to constantly think about and discuss ways to improve, and the ‘One step more’ mentality helps me do just that.

When you conclude an experiment, don’t stop there. It’s important to consider a variety of variables and continue with additional experiments. It’s not just about taking 'One more step,' but connecting these steps into ‘a flight of steps.' Making this happen requires a strong unwavering motivation.

I try to align the growth of the company with my personal growth. Because of this effort, I know the extra steps I take make a difference for me and the company.

Q: Do you have any messages for your other stars at Upstage?

Kim Sang-hoon: Upstage is filled with talented individuals, and there is something to learn from all the stars. The passion and expertise of these stars have greatly contributed to my personal growth. I look forward to continuing to work together!

Song Won-ho: It's my greatest pleasure to work with such outstanding colleagues at Upstage. Fasten your seatbelts quickly, before Upstage launches. Let's aim for the stars!

Achieving World No. 1 in LLM Performance – [Starview Vol. 9]

The Need for AI Ethics and Corporate Efforts

Upstage's 70B Language Model Outperforms GPT-3.5, Becomes Global No.1