Stop Wrestling with Bioinformatics Pipelines! bioSkills Makes AI Agents Your Expert Lab Partner
What if you could describe your RNA-seq experiment in plain English and watch a PhD-level bioinformatician execute it flawlessly in real-time? No more hunting through Stack Overflow at 2 AM. No more deciphering why your STAR alignment keeps segfaulting. No more copy-pasting from three-year-old tutorials that break with every dependency update.
Here's the brutal truth: bioinformatics is drowning in complexity. We're talking about 63 distinct skill domains, 474 specialized techniques, and toolchains that make rocket science look approachable. A typical single-cell RNA-seq analysis touches a dozen different packages—each with its own quirks, version dependencies, and silent failure modes. Undergrads spend semesters just learning to install the software. Postdocs lose months to pipeline debugging. Principal investigators burn grant money on computational bottlenecks that have nothing to do with actual biology.
But what if your AI coding agent already knew all of this? What if it arrived pre-loaded with battle-tested patterns for everything from FASTQ quality control to Mendelian randomization? That's exactly what bioSkills delivers—a revolutionary open-source project that transforms Claude Code, OpenAI Codex, Google Gemini, and other AI agents into expert bioinformatics collaborators.
With 474 skills across 63 categories, bioSkills isn't just another cheat sheet. It's a comprehensive knowledge architecture that lets you describe your biological question naturally and trust your AI agent to select the right tools, apply best practices, and generate correct, idiomatic code. The project has already demonstrated measurable performance gains on the Bio-Task Bench evaluation dataset—and it's completely free under the MIT license.
Ready to stop fighting your tools and start doing science? Let's dive into how bioSkills works, why it's spreading like wildfire through computational biology labs, and exactly how to deploy it in your next project.
What is bioSkills? The Secret Weapon Behind AI-Powered Bioinformatics
bioSkills is a meticulously curated collection of SKILLS.md files designed specifically for AI coding agents performing bioinformatics workflows. Created by the GPTomics team, this open-source repository addresses a critical gap in the AI-assisted coding landscape: domain expertise.
Generic AI coding assistants are impressive generalists. They can write Python, debug JavaScript, and explain algorithms. But ask them to properly normalize single-cell RNA-seq data with sctransform versus standard log-normalization, or to choose between DESeq2 and edgeR for a differential expression study with specific experimental constraints, and they often hallucinate, recommend outdated approaches, or miss critical statistical assumptions.
bioSkills solves this by encoding expert bioinformatics knowledge directly into structured skill files that AI agents can consume. Each skill contains:
- Precise "Use when..." triggers that help agents match biological questions to appropriate analytical approaches
- Version-compatibility blocks documenting reference package versions and API verification notes
- Goal/Approach structured explanations that preserve analytical intent even as underlying tool versions evolve
- Magic number documentation with biological rationale—no more unexplained thresholds
- Example prompts in natural language that demonstrate how researchers actually describe their problems
The repository targets an impressively broad audience: undergraduates learning computational biology, PhD researchers processing large-scale data, clinical bioinformaticians building diagnostic pipelines, and even principal investigators who need rapid prototyping without deep computational expertise.
What makes bioSkills particularly powerful is its multi-agent compatibility. Unlike tools locked to a single platform, bioSkills supports Claude Code, OpenAI Codex CLI, Google Gemini CLI, OpenCode, and OpenClaw—with intelligent format conversion between agent skill standards. Codex, Gemini, and OpenCode automatically convert examples/ directories to scripts/ and usage-guide.md to references/. OpenClaw preserves original structure while optionally adding dependency metadata.
The project's evaluation data speaks volumes: benchmark performance reports demonstrate measurable improvements on real bioinformatics tasks, with the full evaluation methodology and results available in the bioskills_eval_20260328.pdf report.
Key Features: Why bioSkills Is Unlike Anything You've Tried Before
Let's dissect what makes bioSkills genuinely transformative for computational biology workflows:
1. Unprecedented Scale and Granularity With 474 skills spanning 63 categories, bioSkills covers virtually every major bioinformatics domain. From foundational sequence I/O (9 skills) to cutting-edge spatial transcriptomics (11 skills), from classical phylogenetics (8 skills) to emerging fields like liquid biopsy analysis (6 skills)—the depth is staggering. The workflows category alone contains 41 end-to-end pipeline skills covering RNA-seq, variant calling, ChIP-seq, scRNA-seq, spatial analysis, proteomics, microbiome studies, CRISPR screens, metabolomics, and even multi-omics integration.
2. Multi-Platform Agent Ecosystem
bioSkills doesn't force you into a single AI vendor's ecosystem. The dedicated install scripts for Claude Code, Codex, Gemini, OpenCode, and OpenClaw mean you can use your preferred agent or even switch between them. The --categories flag enables surgical installation—load only single-cell and variant-calling skills if that's your current focus. The --dry-run option lets you preview installations and estimate token costs before committing.
3. Version-Aware, Future-Proof Architecture
Every code-containing skill includes a ## Version Compatibility block with reference package versions. Example scripts carry version header comments: # Reference: <package> <version>+ | Verify API if version differs. This isn't cosmetic—it's a survival strategy in a field where Bioconductor packages update twice yearly and API-breaking changes are routine.
4. Natural Language Interface to Complex Analysis The magic happens when skills are deployed. You describe your biological question in plain English, and the agent selects appropriate tools based on context. No memorizing command-line flags. No hunting through documentation. The skill architecture handles the translation.
5. Rigorous Contribution Standards
The project enforces strict quality controls: single-value primary_tool fields, documented magic numbers, structured Goal/Approach sections, and mandatory version compatibility blocks. This isn't crowdsourced chaos—it's curated expertise.
6. Production-Ready Dependencies
The requirements specification alone demonstrates serious intent: Python 3.9+ with biopython, pysam, cyvcf2, pybedtools, pyBigWig, scikit-allel, anndata; R/Bioconductor with DESeq2, edgeR, Seurat, clusterProfiler, methylKit; and comprehensive CLI toolchains via Homebrew, APT, or Conda.
Real-World Use Cases: Where bioSkills Transforms Your Workflow
Use Case 1: The Overwhelmed Graduate Student
Scenario: You've just received 10X Genomics scRNA-seq data for your thesis project. You know the analysis involves quality control, normalization, clustering, and cell type annotation—but the Seurat documentation is 200 pages, and you're not sure which normalization method is appropriate for your data structure.
With bioSkills: Simply tell your agent: "I just got my 10X scRNA-seq data—filter out low-quality cells and normalize." The agent draws from 14 single-cell skills covering Seurat, Scanpy, Pertpy, Cassiopeia, and MeboCost. It applies appropriate QC thresholds with documented rationale, selects normalization based on your data characteristics, and generates reproducible code with version-pinned dependencies.
Use Case 2: The Clinical Researcher Under Pressure
Scenario: Your collaborator identified a BRCA1 variant in a patient sample. You need rapid, accurate assessment: population frequency in gnomAD, ClinVar clinical significance, ACMG pathogenicity classification, and functional effect prediction. Normally this requires navigating five different databases with incompatible query formats.
With bioSkills: Ask naturally: "I found a BRCA1 variant in my patient—is it pathogenic according to ACMG guidelines? Which of my variants are already known to be disease-causing in ClinVar? What's the population frequency in gnomAD?" The agent orchestrates variant-calling skills (13 total, covering bcftools, GATK, DeepVariant, Manta, Delly, VEP) and clinical database skills (10 total, including myvariant, ClinVar/gnomAD integration, pharmacogenomics, and PRS tools) to deliver a comprehensive clinical report.
Use Case 3: The Multi-Omics Investigator
Scenario: You're integrating RNA-seq, ATAC-seq, and Hi-C data to understand gene regulation in your model system. Each datatype has its own analysis pipeline, and you need to ensure peak calls, differential accessibility tests, and enhancer-gene predictions are statistically sound and biologically meaningful.
With bioSkills: Deploy the full arsenal: "Run the ENCODE 4 ATAC-seq pipeline with IDR across replicates, build a consensus peakset, predict enhancer-gene connections with ABC using ATAC + H3K27ac + Hi-C, and identify TADs from my Hi-C contact matrix." The ATAC-seq category's 12 skills (MACS3, DiffBind, chromVAR, TOBIAS, scprinter, ArchR, Signac, SnapATAC2, Cicero, ABC, chromBPNet, BPNet, scBasset, EnFormer, WASP, GATK ASEReadCounter, RASQUAL) and Hi-C analysis skills (8 total with cooler, cooltools, pairtools, HiCExplorer) work in concert.
Use Case 4: The Method Developer Validating New Approaches
Scenario: You've developed a new peak-calling algorithm and need rigorous benchmarking against established methods with proper cross-validation, multiple test correction, and reproducible reporting.
With bioSkills: Request: "Run a complete biomarker discovery pipeline with proper cross-validation, set up a Snakemake workflow for 50 samples, and generate a MultiQC report summarizing all outputs." The machine learning skills (6 total with sklearn, shap, lifelines, scvi-tools), workflow management skills (Snakemake, Nextflow, cwltool, Cromwell), and reporting skills (RMarkdown, Quarto, Jupyter, MultiQC, matplotlib) combine for publication-ready rigor.
Step-by-Step Installation & Setup Guide
Getting bioSkills operational takes minutes, not hours. Here's the complete deployment process:
Prerequisites Installation
Python Dependencies:
# Core bioinformatics Python stack
pip install biopython pysam cyvcf2 pybedtools pyBigWig scikit-allel anndata mygene
R/Bioconductor (Required for differential expression, single-cell, pathway analysis, methylation):
# Install Bioconductor package manager if not present
if (!require('BiocManager', quietly = TRUE))
install.packages('BiocManager')
# Install core Bioconductor packages
BiocManager::install(c('DESeq2', 'edgeR', 'Seurat', 'clusterProfiler', 'methylKit'))
CLI Tools (Choose your platform):
# macOS via Homebrew
brew install samtools bcftools blast minimap2 bedtools
# Ubuntu/Debian via APT
sudo apt install samtools bcftools ncbi-blast+ minimap2 bedtools
# Cross-platform via Conda (comprehensive installation)
conda install -c bioconda samtools bcftools blast minimap2 bedtools \
fastp kraken2 metaphlan sra-tools bwa-mem2 bowtie2 star hisat2 \
manta delly cnvkit macs3 macs2 genrich tobias rgt-hint idr picard \
preseq deeptools chromap subread fithichip gatk4
Agent-Specific Installation
Clone the repository:
git clone git@github.com:GPTomics/bioSkills.git
cd bioSkills
Claude Code Installation:
./install-claude.sh # Global installation
./install-claude.sh --project /path/to/project # Project-specific install
./install-claude.sh --categories "single-cell,variant-calling" # Selective install
./install-claude.sh --list # Preview available skills
./install-claude.sh --validate # Verify all skill files
./install-claude.sh --update # Incremental update
./install-claude.sh --uninstall # Clean removal
OpenAI Codex CLI:
./install-codex.sh # Global installation
./install-codex.sh --project /path/to/project # Project-specific
./install-codex.sh --categories "single-cell,variant-calling"
./install-codex.sh --list
./install-codex.sh --validate
./install-codex.sh --update
./install-codex.sh --uninstall
Google Gemini CLI:
./install-gemini.sh # Global installation
./install-gemini.sh --project /path/to/project
./install-gemini.sh --categories "single-cell,variant-calling"
./install-gemini.sh --list
./install-gemini.sh --validate
./install-gemini.sh --update
./install-gemini.sh --uninstall
OpenCode (Auto-discovers skills from other installers):
./install-opencode.sh # Install to ~/.config/opencode/skills/
./install-opencode.sh --project /path/to/project
./install-opencode.sh --categories "single-cell,variant-calling"
./install-opencode.sh --list
./install-opencode.sh --validate
./install-opencode.sh --update
./install-opencode.sh --uninstall
OpenClaw (from ClawHub or direct install):
./install-openclaw.sh # Global install
./install-openclaw.sh --categories "single-cell,variant-calling"
./install-openclaw.sh --project /path/to/workspace
./install-openclaw.sh --tool-type-metadata # Add dependency metadata
./install-openclaw.sh --dry-run # Preview + token estimate
./install-openclaw.sh --list
./install-openclaw.sh --validate
./install-openclaw.sh --update
./install-openclaw.sh --uninstall
Critical note: OpenCode automatically discovers Agent Skills from ~/.claude/skills/ and ~/.agents/skills/, so installations from install-claude.sh or install-codex.sh work without re-running.
REAL Code Examples: See bioSkills in Action
The true power of bioSkills emerges when you examine how natural language prompts translate into sophisticated, multi-step analyses. Here are authentic examples from the repository demonstrating the system's capabilities:
Example 1: Complete RNA-seq Differential Expression Pipeline
This demonstrates how a single natural language request triggers a comprehensive analytical workflow:
# RNA-seq & Differential Expression - Natural Language Prompts
"I have RNA-seq counts from treated vs control samples - find the differentially expressed genes"
"Run the complete RNA-seq pipeline from my FASTQ files to a list of DE genes"
"What biological pathways are enriched in my upregulated genes?"
"Run GSEA to see if whole pathways are up or down in my treatment"
"Align my paired-end RNA-seq reads to the human genome with STAR"
"Count reads per gene from my aligned BAM files"
What's happening under the hood: The agent draws from 6 differential-expression skills (DESeq2, edgeR, ggplot2, pheatmap), 4 rna-quantification skills (featureCounts, Salmon, kallisto, tximport), 4 read-alignment skills (bwa-mem2, bowtie2, STAR, HISAT2), and 6 pathway-analysis skills (clusterProfiler, ReactomePA, rWikiPathways, enrichplot). It automatically handles the statistical framework selection—DESeq2 for standard designs, edgeR for complex contrasts or when robust dispersion estimation is critical. The pathway analysis skills distinguish between Over-Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA), applying the appropriate background universe and handling prokaryotic versus eukaryotic gene identifiers.
Example 2: Single-Cell Analysis with Cell Type Annotation
# Single-Cell Analysis - Natural Language Prompts
"I just got my 10X scRNA-seq data - filter out low-quality cells and normalize"
"Cluster my single-cell data and help me figure out what cell types they are"
"Find marker genes for each cluster so I can annotate cell types"
"Reconstruct the differentiation trajectory and find branch points in my data"
"Which ligand-receptor pairs show active communication between my cell types?"
Technical depth: The 14 single-cell skills provide comprehensive coverage: Seurat for standard workflows, Scanpy for Python-native pipelines, Pertpy for perturbation analysis, Cassiopeia for lineage tracing, and MeboCost for metabolite communication. The QC skill applies adaptive thresholds based on mitochondrial percentage, feature count distributions, and doublet detection rather than arbitrary cutoffs. Normalization skills distinguish between log-normalization, SCTransform (for variance stabilization), and integration-aware normalization for multi-sample studies. Trajectory inference skills cover pseudotime ordering, RNA velocity, and branch point detection. The cell-cell communication skill implements ligand-receptor analysis with statistical significance testing.
Example 3: Advanced Variant Interpretation with SpliceAI
# Variant Calling & Clinical Genomics - Advanced Prompts
"Predict if this deep-intronic variant creates a pseudoexon using SpliceAI extended-window scoring"
"Apply ClinGen SVI 2023 splicing thresholds to classify variants as PP3 supporting/moderate/strong"
"My patient has CYP2D6 variants - what's their metabolizer phenotype?"
Clinical precision: These prompts leverage 13 variant-calling skills spanning germline/somatic calling, structural variant detection (Manta, Delly), and clinical interpretation. The SpliceAI integration demonstrates bioSkills' cutting-edge coverage—extended window scoring for deep intronic variants, with explicit ClinGen SVI 2023 threshold application for ACMG PP3 evidence classification. The pharmacogenomics skill maps CYP2D6 star alleles to metabolizer phenotypes with clinical actionability annotations. This isn't generic coding assistance; it's domain-expert reasoning encoded for AI execution.
Example 4: ENCODE-Compliant ATAC-seq with Deep Learning
# Epigenomics & Chromatin - Production-Grade Prompts
"Run the ENCODE 4 ATAC-seq pipeline with IDR across replicates and pseudoreplicate self-consistency"
"Build a Corces 2018 fixed-width consensus peakset (501 bp) before differential accessibility testing"
"Run TOBIAS three-step footprinting and verify CTCF aggregate shows a clean V-shape"
"Score 100 GWAS SNPs for chromatin effects with chromBPNet pre-trained on the matched cell type"
"Predict enhancer-gene regulatory connections with the ABC model using ATAC + H3K27ac + Hi-C"
Sophisticated orchestration: The 12 ATAC-seq skills represent one of bioSkills' most impressive categories. The ENCODE 4 pipeline skill implements IDR (Irreproducible Discovery Rate) across true replicates and pseudoreplicates with spike-in normalization and sex-chromosome aware QC. The consensus peakset skill generates fixed-width intervals following Corces 2018 methodology—critical for differential accessibility testing where variable-width peaks introduce statistical artifacts. TOBIAS footprinting includes bias correction and aggregate visualization validation. The chromBPNet integration enables in silico variant effect prediction with cell-type-matched pretrained models. The ABC (Activity-By-Contact) model skill orchestrates three datatypes (ATAC accessibility, H3K27ac signal, Hi-C contact frequency) to predict enhancer-gene regulatory connections.
Advanced Usage & Best Practices
Selective Installation for Performance: Don't install all 474 skills if you're focused on a specific project. Use --categories to load only relevant skills:
./install-claude.sh --categories "single-cell,differential-expression,pathway-analysis"
This reduces context window consumption and improves agent response speed.
Validate Before Critical Analyses: Always run --validate before important projects:
./install-claude.sh --validate
This checks skill file integrity, ensuring your agent won't fail mid-analysis due to malformed skill definitions.
Dry-Run for Token Budgeting: OpenClaw's --dry-run is invaluable for estimating token costs:
./install-openclaw.sh --dry-run --categories "variant-calling,clinical-databases"
Version Pinning for Reproducibility: When agents generate code, explicitly request version-pinned environments:
"Generate a requirements.txt with exact versions and a conda environment.yml for this analysis"
The skills' built-in version compatibility blocks help agents make informed recommendations.
Combine Skills for Multi-Modal Studies: bioSkills truly shines when you chain categories. For a complete multi-omics study:
"Integrate my scRNA-seq and scATAC-seq data, then link regulatory elements to target genes"
This draws from single-cell (14 skills), ATAC-seq (12 skills), and gene-regulatory-networks (5 skills) simultaneously.
Comparison with Alternatives: Why bioSkills Wins
| Feature | bioSkills | Generic AI Assistants | Bioinformatics Notebooks | Workflow Managers (Snakemake/Nextflow) |
|---|---|---|---|---|
| Domain Expertise | Deep, curated, version-aware | Surface-level, often hallucinated | Variable, depends on author | None—infrastructure only |
| Natural Language Interface | Native, optimized for biology | Generic, requires precise prompts | None—manual execution | None—code/configuration required |
| Multi-Agent Support | Claude, Codex, Gemini, OpenCode, OpenClaw | Single platform | N/A | N/A |
| Installation Complexity | One-command per agent | N/A | Manual dependency resolution | Complex environment setup |
| Coverage | 474 skills, 63 categories | None specific | Fragmented across repositories | None—build your own |
| Version Management | Built-in compatibility blocks | None | Often outdated | Container-based, rigid |
| Learning Curve | Minimal—describe your experiment | High for bioinformatics | High—must understand code | Very high |
| Reproducibility | Structured, documented | Unpredictable | Depends on author discipline | Excellent, but complex |
| Update Mechanism | --update incremental |
N/A | Manual | Version-controlled |
The verdict: Generic AI assistants lack domain depth. Notebooks and workflow managers provide reproducibility but demand substantial expertise and manual effort. bioSkills uniquely combines natural language accessibility with expert-level analytical guidance—making sophisticated bioinformatics genuinely approachable without sacrificing rigor.
FAQ: Your Burning Questions Answered
Q: Do I need to be a bioinformatics expert to use bioSkills? A: Absolutely not. The system is designed for users ranging from undergraduates to principal investigators. Natural language prompts mean you describe your biological question, not computational implementation. However, basic understanding of your experimental design helps you evaluate and interpret results.
Q: Which AI agent works best with bioSkills? A: All supported agents (Claude Code, Codex, Gemini, OpenCode, OpenClaw) perform well. Claude Code currently offers the most mature bioinformatics ecosystem integration. Codex excels for rapid prototyping. Gemini provides strong multimodal capabilities. Choose based on your existing workflow and subscription preferences.
Q: How does bioSkills handle tool version updates?
A: Each skill includes version compatibility documentation. The --update flag performs incremental updates. When major tool versions change, the project's structured Goal/Approach format preserves analytical intent while allowing implementation updates. Always verify API compatibility when using newer package versions.
Q: Can I contribute new skills or improve existing ones?
A: Yes! The project welcomes contributions with specific requirements: "Use when..." descriptions, single-value primary_tool fields, documented magic numbers, version compatibility blocks, and Goal/Approach structuring. See the Contributing section in the repository for complete guidelines.
Q: Is bioSkills suitable for clinical/diagnostic use? A: The repository provides analytical frameworks and educational guidance. For clinical diagnostics, you must validate all pipelines according to your institution's regulatory requirements (CLIA, CAP, etc.). bioSkills accelerates development but does not replace regulatory validation.
Q: How does performance compare to manual pipeline development? A: Evaluation on the Bio-Task Bench dataset demonstrates strong performance. The bioskills_eval_20260328.pdf report provides detailed metrics. For many standard analyses, bioSkills matches or exceeds typical graduate student implementation quality while dramatically reducing development time.
Q: What about proprietary or restricted data? A: bioSkills operates locally with your AI agent—no data is transmitted to the skill repository. For highly sensitive data, ensure your AI agent's configuration complies with your institutional data governance policies.
Conclusion: The Future of Bioinformatics Is Conversational
bioSkills represents a fundamental shift in how computational biology gets done. By encoding 474 expert-curated skills into AI-accessible formats, it demolishes the traditional barriers between biological insight and analytical execution. No more wrestling with dependency hell. No more translating biological questions into brittle shell scripts. No more wondering if your analysis follows current best practices.
The repository's 63 categories cover everything from foundational sequence manipulation to cutting-edge spatial transcriptomics, from classical population genetics to emerging causal inference methods. Its multi-agent architecture ensures you're never locked into a single vendor. Its version-aware design means your analyses remain reproducible as the tool ecosystem evolves.
Whether you're an undergraduate terrified of the command line, a postdoc racing against publication deadlines, or a principal investigator seeking to democratize computational analysis in your lab, bioSkills meets you where you are—and elevates what you can achieve.
The science of biology is too important to be bottlenecked by the mechanics of computation. Install bioSkills today, describe your next experiment in plain English, and discover what happens when AI agents truly understand bioinformatics.
⭐ Star bioSkills on GitHub | 🧬 Clone it now: git clone git@github.com:GPTomics/bioSkills.git