Latest News

sciencenews.png

NII develops new domestic LL trained on a high-quality corpus of 12 trillion tokens

2026.05.21

The Research and Development Center for Large Language Models (LLMC) at the National Institute of Informatics (NII) has developed new domestic LLMs: the "LLM-jp-4 8B model" (hereafter "4 8B model"), with approximately 8.6 billion parameters, and the "LLM-jp-4 32B-A3B model" (hereafter "4 32B-A3B model"), an MoE model with approximately 32 billion parameters. Both were released to the public under an open-source license on April 3.

Category-wise evaluation of representative LLMs using llm-jp-eval.
Provided by NII

These LLMs were developed through full-scratch training as part of the activities of "LLM-jp," an LLM research and development community led by the center. They can process inputs and outputs of up to approximately 65,000 tokens. The computational resource used for the development was "ABCI 3.0," an AI bridging cloud provided by AIST (National Institute of Advanced Industrial Science and Technology).

Regarding model architecture, the "4 8B model" uses Llama 2 architecture, whereas the "4 32B-A3B model" uses Qwen 3 MoE architecture.

For the training corpus, the community collected, selected, and constructed high-quality corpus available to third parties, with consideration for the Open-Source AI Definition (OSAID). The resulting corpus is approximately six times larger than that of the "LLM-jp-3.1" series previously developed and released by the community.

For pre-training, a large-scale corpus consisting of publicly available internet data and government and parliamentary documents was used.

The total consists of approximately 19.5 trillion tokens, including roughly 700 billion tokens in Japanese, 17.8 trillion in English, 850 billion in other languages (Chinese and Korean), and 200 billion in program code.

Through experiments, how much emphasis to place on each sub-corpus was optimized to utilize a total of approximately 10.5 trillion tokens for pre-training.

The subsequent intermediate training used a training corpus totaling 1.2 trillion tokens, which was built by adding synthetic data generated by LLMs, including instruction pre-training data, to the pre-training corpus. Combined, the pre-training and intermediate training utilized a corpus of approximately 12 trillion tokens.

Tuning was performed using 22 types of English and Japanese instructional tuning data. The training data includes open-source licensed data, as well as data developed by the community (to be released in due course). To evaluate the developed models, an LLM-as-a-Judge evaluation using GPT-5.4 was conducted using "LLM-jp-judge," an evaluation framework developed by the community.

As a result, on "Japanese MT-Bench," which measures Japanese understanding performance, the "4 8B model" achieved a score of 7.54 and the "4 32B-A3B model" reached 7.82. These scores surpassed "GPT-4o" (7.29), "gpt-oss-20b" (7.33), and "Qwen 3-8B" (7.14).

Furthermore, on "MT-Bench," which measures English understanding performance, the "4 8B model" achieved 7.79 and the "4 32B-A3B model" reached 7.88. This performance was equal to or higher than "GPT-4o" (7.69), "gpt-oss-20b" (7.85), and "Qwen 3-8B" (7.69).

Additionally, evaluations were conducted using "llm-jp-eval v2.1.3," a framework for cross-sectional evaluation using 42 types of evaluation data based on existing Japanese and English language resources developed by the community.

As a result, it was confirmed that both developed models achieved Japanese language performance equivalent to "gpt-oss-20b" and "Qwen 3-8B." Moving forward, the center will use both developed models to advance research and development aimed at ensuring the transparency and reliability of LLMs. Development is also currently underway for models with a larger number of parameters, which are scheduled for sequential release this fiscal year.

This article has been translated by JST with permission from The Science News Ltd. (https://sci-news.co.jp/). Unauthorized reproduction of the article and photographs is prohibited.

Back to Latest News

Latest News

Recent Updates

    Most Viewed