Flagship open-source genomic LLM. State-of-the-art on GeneTuring (78.9%), ClinVar VUS classification (82.4%), and Gene-Disease Association (84.1%). Trained on ClinVar, NCBI, OMIM, gnomAD, and 8M+ PubMed genomics papers. Supports VCF input and clinical report output.
State-of-the-art biomedical LLM achieving 91.2% on MedQA and 89.4% on USMLE Step 1-3. Trained on 42M+ PubMed abstracts with DPO alignment using 120K physician-curated preference pairs. Surpasses GPT-4 on 7 of 9 medical benchmarks.
Compact, deployable biomedical LLM for on-premise and edge clinical systems. Achieves 82.4% on MedQA. Same training corpus as the 70B variant β distilled for speed. GGUF and AWQ quantized variants available. Runs on a single NVIDIA RTX 4090.
Clinical decision support model engineered for differential diagnosis generation and evidence-based treatment recommendations. Fine-tuned on de-identified EHR data, clinical pathways, SOAP notes, and clinical guidelines. 86.2% on clinical diagnosis accuracy benchmark.
Specialized for genomic sequence analysis and variant interpretation. Trained on NCBI, Ensembl, ClinVar, and OMIM datasets. 87% accuracy on clinical variant classification. Supports VCF input and automated ACMG variant classification output in structured JSON.
ADMET property prediction, SMILES-to-property conversion, molecular optimization, and compound-target interaction modeling. Trained on 10M+ compounds from ChEMBL and PubChem. Top performance on 14 ADMET benchmarks from TDC.
Multimodal LLM fusing pathology slide images and clinical text. Supports histopathology analysis, tumor grading, and automated radiology report generation with visual grounding. Based on LLaVA-Med architecture.
𧬠Foundation DNA Models
Large-scale genomic language models trained on DNA sequences across all domains of life.
Massive-scale foundation model trained on the OpenGenome2 dataset (8.8 trillion bases) across all domains of life β bacteria, archaea, eukarya and phage. Achieves zero-shot prediction of functional impacts of genetic variation including noncoding pathogenic mutations and clinically significant BRCA1 variants. Published in Nature 2026.
Human-centric genomic foundation model optimized for million-base-pair sequences using a Mixture of Experts (MoE) architecture. Trained on 636 high-quality human de novo assemblies representing diverse global populations. Achieves single-nucleotide precision over a 1 Mb context and excels at clinical inference tasks including ClinVar pathogenicity prediction (~0.93 AUC).
Family of multi-species DNA foundation models (up to 2.5B parameters) developed by InstaDeep in collaboration with NVIDIA and TUM. The NT-v2 series introduces rotary positional embeddings, swiGLU activations and extended 12 kb context. Provides high-accuracy zero-shot embeddings for regulatory element detection, chromatin accessibility and splice site prediction.
Efficient multi-species genome foundation model (ICLR 2024) that replaces k-mer tokenization with BPE and uses Attention with Linear Biases (ALiBi) for position encoding. Achieves ~56Γ lower compute than comparable models while outperforming them on 23 of 28 GUE benchmark tasks. Particularly strong at splice site prediction. Pairs with the GUE benchmark suite.
Long-context genomic foundation model (Stanford / Harvard) that replaces standard attention with Hyena operators β a subquadratic drop-in replacement enabling sequences up to 1 million tokens at single-nucleotide resolution. Trains 160Γ faster than Flash Attention at 1M sequence length. Sets SotA on 23 downstream tasks including regulatory element and chromatin profile prediction.
First family of RC-equivariant bidirectional long-range DNA language models built on Mamba (SSM) architecture. Introduces BiMamba and MambaDNA blocks that model long-range genomic dependencies while maintaining computational efficiency. Caduceus-Ph uses RC data augmentation; Caduceus-PS is inherently RC-equivariant via parameter sharing. Outperforms 10Γ larger Transformer models on long-range variant effect prediction.
π¬ RNA & Regulatory Models
Models focused on RNA biology, splicing, gene regulation, and single-cell transcriptomics.
Knowledge-informed cross-species foundation model pre-trained on 120+ million human and mouse single-cell transcriptomes. Integrates four types of biological prior knowledge (GRN, promoter sequences, gene family annotations, co-expression networks) to decipher universal gene regulatory mechanisms. Published in Cell Research 2024.
A 32-layer deep residual convolutional neural network specialized in predicting mRNA splicing directly from DNA sequences. Predicts splice donor and acceptor sites, canonical and cryptic, with single-nucleotide resolution. Trained on GENCODE annotations (hg38). Available as pip package; precomputed scores available for all SNVs and indels in the human genome.
First unsupervised multiple sequence alignment (MSA)-based RNA language model. Uses MSA-Transformer architecture to capture evolutionary information from homologous RNA sequences. Attention maps directly correlate with RNA secondary structure (base pairing) and 1D solvent accessibility without supervised training. Published in Nucleic Acids Research 2024.
Foundation model for single-cell biology pre-trained on 33+ million cells using a generative transformer. Excels at cell type annotation, multi-batch and multi-omic integration, perturbation response prediction, and gene network inference. Supports zero-shot and fine-tuned applications. Published in Nature Methods 2024.
Attention-based foundation model trained on ~104 million human single-cell transcriptomes (V2) for large-scale gene network analysis. Uses rank-value encoding of gene expression. Enables in-silico perturbation, dosage sensitivity prediction, and network biology tasks via zero-shot and fine-tuned inference. Has experimentally validated transcription factor discoveries. Published in Nature 2023.
π© Specialized & Domain-Specific Models
Targeted models for plant genomes, clinical reasoning, and research-grade generative tasks.
BERT-based model specifically pre-trained on plant genome sequences using BPE tokenization. Designed for agricultural research applications including promoter prediction and regulatory element identification in crop species. Related to InstaDeep's AgroNT (1B), which was trained on 48 plant species for gene expression tasks.
Medical-specialized versions of Gemma 3 trained for performance on medical text and image comprehension. Covers genomics-adjacent tasks including clinical QA, diagnostic reasoning, chest X-ray interpretation and cancer genomics (TCGA data). MedGemma 1.5 4B is the latest multimodal instruction-tuned release. Requires agreement to Health AI Developer Foundation terms.
Large-scale reasoning model widely adopted in biotech research for complex differential diagnosis and analytical tasks in genomics. Its chain-of-thought reasoning and strong scientific comprehension make it valuable for interpreting variant reports, literature synthesis, and integrating multi-omics evidence. Community fine-tunes exist for biomedical specialization.
Frontier-scale multimodal reasoning model used for analyzing visual scientific data and molecular structures in genomic research. Applied in workflows requiring interpretation of structural biology figures, omics charts, and complex multi-step genomic analyses. 235B total params with MoE architecture.
π Benchmarks & Additional Foundation Models
Performance leaders on NT-Bench / GenBench plus key generative and functional genomics models.
Generative genomic foundation model using a transformer decoder architecture with 98k nucleotide context, pre-trained on 386 billion nucleotides of eukaryotic DNA from RefSeq. Achieves state-of-the-art on Genomic Benchmarks and NT tasks while being significantly faster and more accessible than earlier large-scale models. GENERator-v2 released March 2026.
Family of foundational DNA language models (AIRI Institute) supporting sequences up to 36 kb. Offers multiple pre-training strategies: standard MLM, sparse BigBird attention, and T2T human genome assembly with SNP augmentation. All variants use BPE tokenization. Pre-trained models available for human, multi-species and yeast genomes. Published in Nucleic Acids Research 2025.
Genome Rules Obtained Via Extracted Representations β a DNA language model using an optimized BPE (600 cycles) vocabulary derived from the human genome for frequency-balanced tokenization. Designed specifically for identifying functional elements in the human genome. Achieves strong performance on CTCF binding, promoter and splice site prediction. Published in Nature Machine Intelligence 2024.
Transformer-based model from Google DeepMind that predicts gene expression and chromatin states directly from 131 kb DNA sequences. Integrates long-range interactions via attention and achieves human Pearson R of 0.625 on validation. Widely used for regulatory variant effect prediction and as baseline in AlphaGenome benchmarks. Published in Nature Methods 2021.
Versatile unifying model from Google DeepMind that predicts the functional impact of DNA variations across multiple biological modalities β gene expression, splicing, chromatin accessibility, histone modifications, TF binding, and contact maps β all from 1 Mb sequences at single-base resolution. Matches or outperforms best external models on 25 of 26 variant effect prediction benchmarks. Published in Nature 2026.