Models — OpenGenomeLLM

Flagship open-source genomic LLM. State-of-the-art on GeneTuring (78.9%), ClinVar VUS classification (82.4%), and Gene-Disease Association (84.1%). Trained on ClinVar, NCBI, OMIM, gnomAD, and 8M+ PubMed genomics papers. Supports VCF input and clinical report output.

Genomics Clinical Apache 2.0 PyTorch GGUF ⬇ 48.2k downloads ❤️ 1.2k likes Updated 2 days ago

⬇ Download Model Card →

🔬

deepcog-ai / OpenBioLLM-70B 🔥 HOT 70B

State-of-the-art biomedical LLM achieving 91.2% on MedQA and 89.4% on USMLE Step 1-3. Trained on 42M+ PubMed abstracts with DPO alignment using 120K physician-curated preference pairs. Surpasses GPT-4 on 7 of 9 medical benchmarks.

Biomedical NLP Apache 2.0 PyTorch ⬇ 124k downloads ❤️ 3.4k likes Updated 4 days ago

⬇ Download Model Card →

⚡

deepcog-ai / OpenBioLLM-8B 8B

Compact, deployable biomedical LLM for on-premise and edge clinical systems. Achieves 82.4% on MedQA. Same training corpus as the 70B variant — distilled for speed. GGUF and AWQ quantized variants available. Runs on a single NVIDIA RTX 4090.

Biomedical Lightweight Apache 2.0 GGUF AWQ ⬇ 89.4k downloads ❤️ 2.1k likes Updated 1 week ago

⬇ Download Model Card →

🏥

deepcog-ai / ClinicalReasoner-13B 13B

Clinical decision support model engineered for differential diagnosis generation and evidence-based treatment recommendations. Fine-tuned on de-identified EHR data, clinical pathways, SOAP notes, and clinical guidelines. 86.2% on clinical diagnosis accuracy benchmark.

Clinical EHR Apache 2.0 ⬇ 34.7k downloads ❤️ 891 likes Updated 2 weeks ago

⬇ Download Model Card →

🧪

deepcog-ai / GenomicLLM-7B 7B

Specialized for genomic sequence analysis and variant interpretation. Trained on NCBI, Ensembl, ClinVar, and OMIM datasets. 87% accuracy on clinical variant classification. Supports VCF input and automated ACMG variant classification output in structured JSON.

Genomics Variants Apache 2.0 ⬇ 28.3k downloads ❤️ 742 likes Updated 3 weeks ago

⬇ Download Model Card →

💊

deepcog-ai / DrugDiscovery-LLM 20B

ADMET property prediction, SMILES-to-property conversion, molecular optimization, and compound-target interaction modeling. Trained on 10M+ compounds from ChEMBL and PubChem. Top performance on 14 ADMET benchmarks from TDC.

Chemistry ADMET Apache 2.0 ⬇ 19.1k downloads ❤️ 534 likes Updated 1 month ago

⬇ Download Model Card →

👁️

deepcog-ai / PathologyVision-LLM BETA Vision

Multimodal LLM fusing pathology slide images and clinical text. Supports histopathology analysis, tumor grading, and automated radiology report generation with visual grounding. Based on LLaVA-Med architecture.

Multimodal Pathology Radiology Apache 2.0 ⬇ 11.2k downloads ❤️ 312 likes Updated 1 month ago

⬇ Download Model Card →

🧬 Foundation DNA Models

Large-scale genomic language models trained on DNA sequences across all domains of life.

🌍

arcinstitute / Evo 2 🔥 HOT 7B / 40B

Massive-scale foundation model trained on the OpenGenome2 dataset (8.8 trillion bases) across all domains of life — bacteria, archaea, eukarya and phage. Achieves zero-shot prediction of functional impacts of genetic variation including noncoding pathogenic mutations and clinically significant BRCA1 variants. Published in Nature 2026.

Genomics Multi-species Apache 2.0 PyTorch ⬇ 100k+ downloads Updated 2025

⬇ GitHub Model Card →

🧑‍⚕️

BGI-HangzhouAI / Genos NEW 1.2B / 10B

Human-centric genomic foundation model optimized for million-base-pair sequences using a Mixture of Experts (MoE) architecture. Trained on 636 high-quality human de novo assemblies representing diverse global populations. Achieves single-nucleotide precision over a 1 Mb context and excels at clinical inference tasks including ClinVar pathogenicity prediction (~0.93 AUC).

Clinical Human Genome MIT PyTorch Released Oct 2025

⬇ GitHub Model Card →

🔭

InstaDeepAI / Nucleotide Transformer NT-v2 50M–2.5B

Family of multi-species DNA foundation models (up to 2.5B parameters) developed by InstaDeep in collaboration with NVIDIA and TUM. The NT-v2 series introduces rotary positional embeddings, swiGLU activations and extended 12 kb context. Provides high-accuracy zero-shot embeddings for regulatory element detection, chromatin accessibility and splice site prediction.

Genomics Multi-species Apache 2.0 Published Nature Methods 2024

⬇ GitHub Model Card →

🧪

zhihan1996 / DNABERT-2 117M

Efficient multi-species genome foundation model (ICLR 2024) that replaces k-mer tokenization with BPE and uses Attention with Linear Biases (ALiBi) for position encoding. Achieves ~56× lower compute than comparable models while outperforming them on 23 of 28 GUE benchmark tasks. Particularly strong at splice site prediction. Pairs with the GUE benchmark suite.

Genomics Splicing Apache 2.0 ICLR 2024

⬇ GitHub Model Card →

🐍

LongSafari / HyenaDNA 1M context

Long-context genomic foundation model (Stanford / Harvard) that replaces standard attention with Hyena operators — a subquadratic drop-in replacement enabling sequences up to 1 million tokens at single-nucleotide resolution. Trains 160× faster than Flash Attention at 1M sequence length. Sets SotA on 23 downstream tasks including regulatory element and chromatin profile prediction.

Long-Range Regulatory Apache 2.0 NeurIPS 2023

⬇ GitHub Model Card →

⚕️

kuleshov-group / Caduceus Ph & PS

First family of RC-equivariant bidirectional long-range DNA language models built on Mamba (SSM) architecture. Introduces BiMamba and MambaDNA blocks that model long-range genomic dependencies while maintaining computational efficiency. Caduceus-Ph uses RC data augmentation; Caduceus-PS is inherently RC-equivariant via parameter sharing. Outperforms 10× larger Transformer models on long-range variant effect prediction.

SSM / Mamba Long-Range Apache 2.0 ICML 2024

⬇ GitHub Model Card →

🔬 RNA & Regulatory Models

Models focused on RNA biology, splicing, gene regulation, and single-cell transcriptomics.

🧭

xCompass-AI / GeneCompass Cross-species

Knowledge-informed cross-species foundation model pre-trained on 120+ million human and mouse single-cell transcriptomes. Integrates four types of biological prior knowledge (GRN, promoter sequences, gene family annotations, co-expression networks) to decipher universal gene regulatory mechanisms. Published in Cell Research 2024.

Single-Cell Cross-species Apache 2.0 Cell Research 2024

⬇ GitHub Paper →

✂️

Illumina / SpliceAI CNN / DL

A 32-layer deep residual convolutional neural network specialized in predicting mRNA splicing directly from DNA sequences. Predicts splice donor and acceptor sites, canonical and cryptic, with single-nucleotide resolution. Trained on GENCODE annotations (hg38). Available as pip package; precomputed scores available for all SNVs and indels in the human genome.

Splicing Variant Interpretation CC BY-NC 4.0 Cell 2019

⬇ GitHub HF Model →

🔗

yikunpku / RNA-MSM 95M

First unsupervised multiple sequence alignment (MSA)-based RNA language model. Uses MSA-Transformer architecture to capture evolutionary information from homologous RNA sequences. Attention maps directly correlate with RNA secondary structure (base pairing) and 1D solvent accessibility without supervised training. Published in Nucleic Acids Research 2024.

RNA Structure MSA-based MIT NAR 2024

⬇ GitHub HF Model →

🦠

bowang-lab / scGPT Single-cell

Foundation model for single-cell biology pre-trained on 33+ million cells using a generative transformer. Excels at cell type annotation, multi-batch and multi-omic integration, perturbation response prediction, and gene network inference. Supports zero-shot and fine-tuned applications. Published in Nature Methods 2024.

Single-Cell Perturbation MIT Nature Methods 2024

⬇ GitHub HF Model →

⚡

ctheodoris / Geneformer 10M–316M

Attention-based foundation model trained on ~104 million human single-cell transcriptomes (V2) for large-scale gene network analysis. Uses rank-value encoding of gene expression. Enables in-silico perturbation, dosage sensitivity prediction, and network biology tasks via zero-shot and fine-tuned inference. Has experimentally validated transcription factor discoveries. Published in Nature 2023.

Gene Networks Single-Cell CC BY 4.0 Nature 2023; V2 Dec 2024

⬇ Download Model Card →

🔩 Specialized & Domain-Specific Models

Targeted models for plant genomes, clinical reasoning, and research-grade generative tasks.

🌿

nigelhartm / Plant-BERT BERT-base

BERT-based model specifically pre-trained on plant genome sequences using BPE tokenization. Designed for agricultural research applications including promoter prediction and regulatory element identification in crop species. Related to InstaDeep's AgroNT (1B), which was trained on 48 plant species for gene expression tasks.

Plant Genomics Agriculture MIT Research preview

⬇ Download Model Card →

💊

google / MedGemma 1.5–4B

Medical-specialized versions of Gemma 3 trained for performance on medical text and image comprehension. Covers genomics-adjacent tasks including clinical QA, diagnostic reasoning, chest X-ray interpretation and cancer genomics (TCGA data). MedGemma 1.5 4B is the latest multimodal instruction-tuned release. Requires agreement to Health AI Developer Foundation terms.

Clinical Multimodal Health AI Terms Google, 2025

⬇ Download MedGemma 1.5 →

🧠

deepseek-ai / DeepSeek-R1 (Genomics variant) 7B–671B

Large-scale reasoning model widely adopted in biotech research for complex differential diagnosis and analytical tasks in genomics. Its chain-of-thought reasoning and strong scientific comprehension make it valuable for interpreting variant reports, literature synthesis, and integrating multi-omics evidence. Community fine-tunes exist for biomedical specialization.

Reasoning Biotech MIT Jan 2025

⬇ Download Model Card →

🔭

Qwen / Qwen3-235B 235B

Frontier-scale multimodal reasoning model used for analyzing visual scientific data and molecular structures in genomic research. Applied in workflows requiring interpretation of structural biology figures, omics charts, and complex multi-step genomic analyses. 235B total params with MoE architecture.

Multimodal Scientific Reasoning Qwen License 2025

⬇ Download Model Card →

📊 Benchmarks & Additional Foundation Models

Performance leaders on NT-Bench / GenBench plus key generative and functional genomics models.

⚡

GenerTeam / GENERator NEW 1.2B / 3B

Generative genomic foundation model using a transformer decoder architecture with 98k nucleotide context, pre-trained on 386 billion nucleotides of eukaryotic DNA from RefSeq. Achieves state-of-the-art on Genomic Benchmarks and NT tasks while being significantly faster and more accessible than earlier large-scale models. GENERator-v2 released March 2026.

Generative Eukaryotic Apache 2.0 arXiv Feb 2025; v2 Mar 2026

⬇ GitHub HF Models →

🌐

AIRI-Institute / GENA-LM BERT / BigBird

Family of foundational DNA language models (AIRI Institute) supporting sequences up to 36 kb. Offers multiple pre-training strategies: standard MLM, sparse BigBird attention, and T2T human genome assembly with SNP augmentation. All variants use BPE tokenization. Pre-trained models available for human, multi-species and yeast genomes. Published in Nucleic Acids Research 2025.

Long-Context Apache 2.0 NAR 2025

⬇ GitHub HF Models →

📖

PoetschLab / GROVER BERT-base

Genome Rules Obtained Via Extracted Representations — a DNA language model using an optimized BPE (600 cycles) vocabulary derived from the human genome for frequency-balanced tokenization. Designed specifically for identifying functional elements in the human genome. Achieves strong performance on CTCF binding, promoter and splice site prediction. Published in Nature Machine Intelligence 2024.

BPE Tokenization Functional Elements Apache 2.0 Nat. Mach. Intel. 2024

⬇ Download Model Card →

🎯

EleutherAI / Enformer (DeepMind) Vision

Transformer-based model from Google DeepMind that predicts gene expression and chromatin states directly from 131 kb DNA sequences. Integrates long-range interactions via attention and achieves human Pearson R of 0.625 on validation. Widely used for regulatory variant effect prediction and as baseline in AlphaGenome benchmarks. Published in Nature Methods 2021.

Gene Expression Chromatin Apache 2.0 Nat. Methods 2021

⬇ GitHub HF Model →

🏆

google / AlphaGenome (DeepMind) NEW 1M context

Versatile unifying model from Google DeepMind that predicts the functional impact of DNA variations across multiple biological modalities — gene expression, splicing, chromatin accessibility, histone modifications, TF binding, and contact maps — all from 1 Mb sequences at single-base resolution. Matches or outperforms best external models on 25 of 26 variant effect prediction benchmarks. Published in Nature 2026.

Multi-modal Variant Effect Non-commercial Nature 2026

⬇ GitHub HF Model →