Stop Drawing Diagrams by Hand! PaperBanana Automates Academic Figures

What if your next paper's most beautiful figure took 30 seconds instead of 3 hours?

Every AI researcher knows the soul-crushing moment. You've cracked a novel architecture, run exhaustive experiments, written elegant proofs—and now you're staring at a blank canvas in PowerPoint at 2 AM, trying to draw a transformer block that doesn't look like it was made in MS Paint. Your method section is polished. Your results are groundbreaking. But your figures? They look like they belong in a middle school science fair.

Here's the dirty secret nobody talks about at NeurIPS: bad diagrams kill good papers. Reviewers subconsciously downgrade work that looks amateurish. Collaborators lose confidence. Your brilliant ideas get buried under visual noise you simply don't have time to fix.

But what if you never had to touch a drawing tool again?

Enter PaperBanana—the open-source multi-agent framework that's making hand-crafted academic illustrations obsolete. Born from Google Research as PaperVizAgent and now evolving rapidly under the dwzhu-pku/PaperBanana repository, this tool doesn't just generate images. It orchestrates an entire creative team of AI agents that think, plan, style, visualize, and critique—just like a real design studio, but at machine speed.

The results? Publication-ready diagrams that would take human designers hours, generated from nothing but your method text and figure caption. And the best part: it's already live on Hugging Face Spaces with zero setup required.

Ready to never waste another night on figure formatting? Let's peel back how PaperBanana works.

What is PaperBanana?

PaperBanana is a reference-driven multi-agent framework specifically engineered for automated academic illustration generation. Originally developed during Dawei Zhu's internship at Google Research as PaperVizAgent, the project has been forked and significantly expanded into its current open-source incarnation, with contributions from researchers at Peking University and beyond.

The name itself hints at the project's philosophy: just as a banana naturally peels to reveal something useful inside, PaperBanana strips away the tedious outer layers of figure creation to expose the core scientific communication underneath.

What makes PaperBanana genuinely different from generic image generation tools is its architectural commitment to scientific accuracy. While DALL-E or Midjourney might produce pretty pictures, they lack the structured reasoning required for academic contexts. PaperBanana's five-agent pipeline ensures that generated illustrations aren't just aesthetically pleasing—they're semantically faithful to the underlying research methodology.

The framework has gained substantial traction since its release, with community forks like PaperBanana-Pro offering Chinese-enhanced versions, and related projects such as AutoFigure-Edit and Paper2Any emerging in the same ecosystem. The original paper (arXiv:2601.23265) establishes the theoretical foundations, while the practical implementation continues to evolve through community contributions.

Importantly, the maintainers have committed to keeping PaperBanana fully open-source and non-commercial. As stated in their disclaimer: "Our goal is simply to benefit the community, so currently we have no plans to use it for commercial purposes." This positions PaperBanana as a genuine public good for the research community rather than a venture-backed product with hidden monetization plans.

Key Features That Make PaperBanana Insane

Five-Agent Orchestration Pipeline

PaperBanana's secret sauce is its multi-agent architecture that mimics how professional scientific illustrators actually work:

Retriever Agent: Performs generative retrieval from a curated dataset of reference diagrams. It doesn't just find similar images—it identifies the visual patterns and compositional strategies most relevant to your specific method description.
Planner Agent: Translates your raw method text into comprehensive visual descriptions using in-context learning from the retrieved references. This is where scientific understanding happens—the agent must comprehend causal relationships, data flows, and architectural components.
Stylist Agent: Automatically synthesizes and applies aesthetic guidelines specific to academic publications. No more guessing whether your arrows should be #333333 or #000000—this agent enforces domain-appropriate visual standards.
Visualizer Agent: Interfaces with state-of-the-art image generation models (configurable via OpenRouter or Google Gemini APIs) to render the styled descriptions into actual pixel outputs.
Critic Agent: The quality assurance layer that engages in multi-round iterative refinement with the Visualizer, identifying semantic inconsistencies, aesthetic flaws, and readability issues that would escape human notice at 2 AM.

Reference-Driven Intelligence

Unlike zero-shot generation approaches, PaperBanana leverages in-context learning from real academic figures. The PaperBananaBench dataset contains thousands of curated examples from actual publications, allowing the system to learn what "good" academic illustration looks like in specific subfields.

Flexible Experiment Modes

The framework supports six distinct pipeline configurations:

Mode	Pipeline	Use Case
`vanilla`	Direct generation	Quick prototyping without quality guarantees
`dev_planner`	Retriever → Planner → Visualizer	Basic structured generation
`dev_planner_stylist`	+ Stylist	Enhanced aesthetic quality
`dev_planner_critic`	+ Critic loop	Iterative refinement without styling
`dev_full`	Complete pipeline	Maximum quality for final figures
`demo_full`	Full pipeline without evaluation	Interactive exploration

Production-Ready Infrastructure

Parallel Generation: Generate up to 20 candidate diagrams simultaneously, then cherry-pick the best
High-Resolution Refinement: Upscale selected candidates to 2K or 4K resolution
Batch Export: Download all candidates as individual PNGs or consolidated ZIP archives
Pipeline Visualization: Track exactly how each figure evolved through Planner → Stylist → Critic stages

Real-World Use Cases Where PaperBanana Dominates

1. Conference Submission Crunches

The NeurIPS/ICML/ICLR submission timeline is brutal. You've spent months on experiments and have 48 hours to polish everything. PaperBanana transforms your method section into candidate figures in minutes, not hours. Run 20 variations in parallel, pick the strongest, refine to 4K, and submit with confidence.

2. Thesis and Dissertation Compilation

PhD students routinely need 50+ figures across hundreds of pages. The visual consistency requirements are enormous—same arrow styles, matching color palettes, coherent notation. PaperBanana's style-guided generation enforces consistency automatically, and the reference-driven approach learns your department's visual conventions.

3. Survey Papers and Literature Reviews

When synthesizing dozens of architectures into comparison figures, accuracy is paramount. Misrepresenting someone else's model is a career-damaging error. PaperBanana's Planner Agent explicitly reasons about component relationships, reducing the risk of semantic errors that plague manually-created survey diagrams.

4. Grant Proposals and White Papers

Program officers and reviewers scan figures first. A compelling visual narrative can secure millions in funding. PaperBanana's iterative Critic loop ensures that proposed architectures are communicated with maximum clarity—critical when reviewers spend 30 seconds per proposal.

5. Teaching Materials and Course Notes

Instructors repeatedly redraw the same concepts semester after semester. PaperBanana enables rapid generation of pedagogical diagrams from textual descriptions, with consistent styling across an entire course's materials.

Step-by-Step Installation & Setup Guide

Prerequisites

PaperBanana requires Python 3.12 and uses uv for fast, reliable package management. You'll also need API access to either Google Gemini or OpenRouter (which provides unified access to OpenAI, Anthropic, and other providers).

Step 1: Clone the Repository

git clone https://github.com/dwzhu-pku/PaperBanana.git
cd PaperBanana

This downloads the complete framework including agent implementations, prompt templates, and configuration templates.

Step 2: Configure API Keys

PaperBanana externalizes all sensitive configuration to prevent accidental credential commits:

# Copy the template configuration file
cp configs/model_config.template.yaml configs/model_config.yaml

Edit configs/model_config.yaml to specify:

defaults.main_model_name: Your chosen vision-language model (e.g., gemini-2.0-flash-exp)
defaults.image_gen_model_name: Your image generation model
At least one API key under api_keys—either google_api_key or openrouter_api_key

Critical: You do not need both keys. Either provider works independently. If both are configured, OpenRouter takes precedence for routing.

For high-throughput generation (20 candidates in parallel), ensure your API tier supports sufficient concurrency.

Step 3: Download Reference Data

# Create data directory and download PaperBananaBench
mkdir -p data
# Download from https://huggingface.co/datasets/dwzhu/PaperBananaBench
# Extract to data/PaperBananaBench/

The framework gracefully degrades without the dataset—Retriever Agent falls back to zero-shot mode—but reference-driven generation significantly improves quality.

Step 4: Install Environment

# Install uv if not already present
# See: https://docs.astral.sh/uv/getting-started/installation/

# Create virtual environment
uv venv  # Creates .venv/ in current directory

# Activate environment
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

# Install Python 3.12
uv python install 3.12

# Install dependencies
uv pip install -r requirements.txt

Launch Options

Option A: Zero-Setup Web Interface (Recommended)

Skip installation entirely and use the live Hugging Face demo. Enter your API key, paste your method text, and generate.

Option B: Local Gradio App

python app.py

Option C: Streamlit Interactive Demo

streamlit run demo.py

The Streamlit interface provides two tabs: Generate Candidates for parallel creation, and Refine Image for upscaling and modification.

Option D: Command-Line Interface

# Basic execution
python main.py

# Full pipeline with custom parameters
python main.py \
  --dataset_name "PaperBananaBench" \
  --task_name "diagram" \
  --split_name "test" \
  --exp_mode "dev_full" \
  --retrieval_setting "auto"

REAL Code Examples from PaperBanana

Let's examine actual implementation patterns from the repository, with detailed explanations of how each component functions.

Example 1: Basic CLI Execution with Full Pipeline

The command-line interface provides the most explicit control over PaperBanana's behavior. Here's the standard invocation for maximum quality generation:

python main.py \
  --dataset_name "PaperBananaBench" \
  --task_name "diagram" \
  --split_name "test" \
  --exp_mode "dev_full" \
  --retrieval_setting "auto"

Breaking this down:

--dataset_name "PaperBananaBench": Specifies the curated benchmark dataset for reference-driven retrieval. The Retriever Agent indexes this collection to find structurally similar diagrams.
--task_name "diagram": Selects between diagram (conceptual architecture figures) and plot (statistical visualizations). Each task type triggers different prompt templates and evaluation metrics.
--split_name "test": Uses the held-out test partition, ensuring no data leakage if you're benchmarking against ground truth.
--exp_mode "dev_full": Activates the complete five-agent pipeline (Retriever → Planner → Stylist → Visualizer → Critic). This is the slowest but highest-quality mode.
--retrieval_setting "auto": Enables automatic reference selection based on semantic similarity. Alternatives include manual (user-selected examples), random (baseline comparison), and none (zero-shot ablation).

Example 2: Virtual Environment Setup with uv

The repository uses modern Python tooling for reproducible environments:

uv venv                    # Create isolated environment in .venv/
source .venv/bin/activate  # Activate on Unix systems
uv python install 3.12     # Pin exact Python version
uv pip install -r requirements.txt  # Install locked dependencies

Why this matters:

The uv tool chain provides deterministic, fast dependency resolution. Unlike pip, uv uses a Rust-based resolver that handles complex constraint satisfaction in milliseconds. The explicit Python 3.12 pin ensures compatibility with PaperBanana's type hints and async patterns. The .venv/ location is gitignored, keeping your environment isolated from the repository state.

Example 3: Configuration Template Structure

While the exact YAML isn't shown in snippets, the setup instructions reveal the configuration pattern:

# configs/model_config.yaml (user-created from template)
defaults:
  main_model_name: "gemini-2.0-flash-exp"      # Vision-language model for reasoning
  image_gen_model_name: "imagen-3"              # Image generation backend

api_keys:
  google_api_key: "YOUR_GEMINI_KEY_HERE"        # Optional: direct Gemini access
  openrouter_api_key: "YOUR_OPENROUTER_KEY_HERE" # Optional: unified API gateway

Design rationale:

The template-based configuration prevents credential leakage (.gitignore excludes model_config.yaml). The dual-provider design provides resilience—if one service is rate-limited, the other seamlessly takes over. The defaults section centralizes model selection, making it trivial to experiment with newer releases without hunting through source code.

Example 4: Streamlit Launch for Interactive Development

For iterative figure refinement, the Streamlit demo provides superior UX:

streamlit run demo.py

This launches a local web server exposing two critical workflows:

Generate Candidates Tab:

Accepts Markdown-formatted method sections and free-text figure captions
Configures pipeline depth (vanilla through dev_full)
Sets parallelism level (up to 20 concurrent generations)
Displays results in sortable grid with evolution timeline visualization

Refine Image Tab:

Accepts uploaded candidates or external diagrams
Processes natural language modification requests
Renders 2K/4K upscaled outputs with configurable aspect ratios

The evolution timeline is particularly powerful—it visualizes how each figure transformed through Planner (textual structuring), Stylist (aesthetic refinement), and Critic (quality assurance) stages, providing transparency into the generation process.

Example 5: Pipeline Evolution Visualization

For debugging and research purposes, PaperBanana includes dedicated visualization tools:

streamlit run visualize/show_pipeline_evolution.py

This renders the intermediate states of each agent, enabling:

Retriever analysis: Which references influenced generation?
Planner verification: Did textual descriptions capture all architectural components?
Stylist inspection: How were color palettes and layouts transformed?
Critic feedback: What defects triggered iterative refinement?

The evaluation companion tool provides quantitative metrics:

streamlit run visualize/show_referenced_eval.py

Advanced Usage & Best Practices

Optimizing for Your Domain

PaperBananaBench currently emphasizes computer science papers. For other domains, consider:

Building custom reference sets: Collect 50-100 exemplary figures from your target venue, organized by subfield
Fine-tuning retrieval: The manual retrieval setting lets you pre-select references that match your conventions
Style guide synthesis: Run style_guides/generate_category_style_guide.py on your collection to extract domain-specific aesthetic rules

API Cost Management

Multi-agent pipelines with 20 parallel candidates and 5 Critic rounds can consume substantial tokens. Strategies:

Start with dev_planner mode for rapid iteration, escalate to dev_full only for final figures
Use OpenRouter's routing to select cost-optimized models for initial candidates, reserving premium models for refinement
Cache successful reference retrievals—the Retriever Agent's output is deterministic for identical inputs

Integration with LaTeX Workflows

For seamless publication integration:

Generate candidates in 4:3 or 16:9 aspect ratios matching your template
Export refined selections as high-resolution PNG
Use convert (ImageMagick) or pdf2svg for vector conversion if your venue requires SVG
The Critic Agent's feedback often suggests caption improvements—incorporate these directly

Collaborative Refinement

The Polish Agent (polish_agent.py) supports iterative human-AI collaboration. After automated generation, provide natural language feedback ("make the attention mechanism more prominent") and re-run specific pipeline stages rather than full regeneration.

Comparison with Alternatives

Feature	PaperBanana	DALL-E 3	Midjourney	Manual (PowerPoint/Illustrator)
Scientific Accuracy	✅ Explicit reasoning	❌ Generic	❌ Generic	✅ Human-verified
Reference Learning	✅ Curated academic examples	❌ None	❌ None	❌ Manual imitation
Iterative Refinement	✅ Closed-loop Critic	⚠️ Limited	⚠️ Limited	✅ Manual iteration
Style Consistency	✅ Automatic guidelines	❌ Ad-hoc	❌ Ad-hoc	⚠️ Manual enforcement
Batch Generation	✅ 20 parallel candidates	❌ Sequential	❌ Sequential	❌ Sequential
Cost	API usage only	API usage only	Subscription	Time-intensive
Open Source	✅ Apache-2.0	❌ Proprietary	❌ Proprietary	N/A
Customizability	✅ Modular agents	❌ Black-box	❌ Black-box	✅ Full control

When to choose PaperBanana:

You need semantically accurate technical diagrams, not artistic impressions
You're generating multiple figures under consistent style constraints
You value reproducibility and version control in your workflow
You want to automate without surrendering quality oversight

When alternatives win:

Purely decorative/artistic illustrations without technical constraints
One-off figures where setup overhead exceeds time savings
Venues with extremely specific template requirements not yet captured in reference sets

FAQ

Is PaperBanana free to use?

The framework itself is Apache-2.0 licensed and completely free. You pay only for API calls to your chosen image generation and language model providers. The Hugging Face Spaces demo has no additional hosting charges.

Do I need programming experience to use PaperBanana?

Not at all. The Hugging Face Spaces demo requires zero code. For local installation, basic command-line familiarity suffices—the README provides copy-paste commands.

Which API provider should I choose—Google or OpenRouter?

OpenRouter offers flexibility: access multiple providers (OpenAI, Anthropic, Google) through one key with intelligent routing. Google Gemini provides direct access potentially at lower latency. Either works; configure whichever matches your existing API credits.

Can PaperBanana generate statistical plots, or only architecture diagrams?

Both! The --task_name parameter accepts diagram or plot. Plot generation code is actively being expanded—check the TODO list for latest capabilities. The framework structure supports arbitrary visualization types through modular agent design.

How does PaperBanana handle complex multi-panel figures?

The Planner Agent decomposes complex descriptions into sub-figures, and the Critic Agent verifies spatial relationships. For intricate layouts, manual assembly of individually-generated panels may still be necessary, though automated composition is on the roadmap.

Is commercial use permitted?

The maintainers explicitly state no commercial plans for the project itself. However, note that Google has filed patents on the core workflows developed during the original internship. While this doesn't restrict open-source research use, third-party commercial applications using similar logic may face IP considerations.

How can I contribute to PaperBanana?

The project welcomes code contributions, bug reports, and reference dataset expansions. See the all-contributors table in the README for current contributors, and open issues or pull requests on the GitHub repository.

Conclusion

PaperBanana represents a genuine inflection point in how researchers produce visual communication. By encoding the expertise of professional scientific illustrators into a reproducible, open-source multi-agent system, it democratizes access to publication-quality figures that previously required specialized training or expensive design services.

The five-agent architecture isn't marketing fluff—it's a principled decomposition of how expert humans actually create scientific illustrations: finding references, planning compositions, applying stylistic conventions, executing visuals, and critically refining. That PaperBanana automates this pipeline while maintaining semantic accuracy is remarkable.

Is it perfect? Not yet. The TODO list reveals active development in statistical plots, manual example selection, and expanded domain coverage. But the trajectory is clear: academic illustration is becoming a solved problem.

My recommendation? Stop drawing diagrams by hand today. Start with the Hugging Face demo for immediate results, then clone the GitHub repository when you need deeper customization. Your future self—staring at a camera-ready paper with gorgeous figures generated in minutes—will thank you.

The era of 2 AM PowerPoint sessions is ending. Join the researchers who've already moved on.

Star the repository, try the demo, and share your most impressive generated figures with the community. The future of academic visualization is open source—and it's already here.