Latest News

sciencenews.png

PZLAST: an ultra-fast service which enables homology searches against huge amount of publicly available metagenomic data

2021.10.07

The joint research group of the National Institute of Genetics (Advanced Genomics Center, Data Analysis Division Head, Hiroshi Mori), RIKEN, and PEZY Computing (President and CEO, Kazuo Takahashi) announced on July 12, that it has developed the publicly accessible "PZLAST" Web service, which enables extremely fast and highly accurate homology searches of amino acid sequences based on amino acid sequence data predicted from a huge amount of genome fragments. It is expected to contribute towards understanding the dynamics of specific genes, such as drug resistance factors, and viruses in the environment, and the discovery of new functional genes.

Microbiome research involves research regarding the genomes of bacterial populations in the environment and the field has developed rapidly. The International Nucleotide Sequence Database Collaboration operated by the National Institute of Genetics Bioinformation and DDBJ Center is accumulating genome fragment sequence (metagenome sequence) data of microbial populations that are the main constituents of microbiota. In particular, the genetic information of microbial communities contains a large amount of unknown genetic information regarding unidentified and uncultured bacteria. It is expected to include genes with new functions and useful factors.

On the other hand, homology searches of this genetic data in a short time were impossible until now due to the large amount of genetic information stored. To solve this issue, the research group has developed "PZLAST", which enables the fast and highly accurate homology searches of amino acid sequences for gene amino acid sequence data (about 2.5 TB) predicted from publicly available metagenomic data. The Web service runs on the MIMD type multi-core processor "PEZY-SC2". This processor is different from the image processing processor GPU (SIMD type) used for general science and technology calculation and is able to execute different instructions for variable data. "PZLAST" takes advantage of this feature and achieves high-speed and high-precision sequence homology searches by distributing commands to multiple "PEZY-SC2" processors and performing massively parallel computing using a large number of threads (15,872 threads) in each processor.

Supercomputer "Satsuki" at RIKEN used for PZLAST (photo taken by Prof. Kurokawa, a member of the research group)
Credit: National Institute of Genetics

In the "PZLAST" homology search, amino acid sequence data obtained from the "MicrobeDB.jp" microbial integrated database, which is developed and operated by the National Institute of Genetics, is used as the reference sequence. The reference amino acid sequence data is a gene cluster predicted from 4339 samples. The volume is approximately 2.5 TB. The user can enter up to 10,000 arrays in one search. The time required for one search is approximately 10 minutes. The search results summarize the top 10,000 hits based on MEO class, FMA class, and geographical distribution, and they are then visualized on a web browser.

This article has been translated by JST with permission from The Science News Ltd.(https://sci-news.co.jp/). Unauthorized reproduction of the article and photographs is prohibited.

Back to Latest News

Latest News

Recent Updates

    Most Viewed