Edit-Banana: Stop Redrawing Diagrams by Hand

What if you could turn any static image into a fully editable diagram in seconds? No more painstakingly rebuilding flowcharts from screenshots. No more tracing architecture diagrams pixel by pixel. The secret weapon top technical writers and developers are using is finally exposed—and it's called Edit-Banana.

Here's the brutal truth: every developer has been there. You inherit documentation with embedded diagrams that are impossible to modify. You find the perfect technical reference—but it's locked in a PNG. You spend hours in DrawIO or Figma reconstructing what someone else already built. That friction kills productivity. That friction is why teams abandon documentation updates. And that friction is exactly what Edit-Banana obliterates.

Built by BIT-DataLab and powered by SAM 3 (Segment Anything Model 3) alongside cutting-edge multimodal large language models, Edit-Banana performs high-fidelity reconstruction that preserves original diagram details, logical relationships, color schemes, and even mathematical formulas. The output? Native DrawIO XML you can drag, drop, and edit immediately. This isn't OCR with extra steps. This is structural intelligence that understands what a diagram means, not just what it looks like.

Ready to reclaim your time? Let's peel back everything this framework delivers.

What Is Edit-Banana?

Edit-Banana is an open-source Universal Content Re-Editor designed to transform static, uneditable visual content into fully manipulatable digital assets. Born from the Beijing Institute of Technology's DataLab, this framework targets a deceptively simple problem with profound engineering complexity: how do you reverse-engineer a rendered image back into its structural components?

The repository lives at https://github.com/BIT-DataLab/Edit-Banana, where it's rapidly gaining traction among developers, technical writers, and AI researchers. Its core mission—"Make the Uneditable, Editable"—addresses a genuine pain point in modern workflows where visual knowledge is trapped in static formats.

What makes Edit-Banana genuinely trending right now? Three forces converge:

The multimodal AI explosion: With SAM 3 delivering state-of-the-art image segmentation and VLMs achieving unprecedented visual reasoning, the technical prerequisites finally exist.
Documentation-as-code maturity: Teams now expect diagrams to live in version control, be diffable, and be collaboratively editable—impossible with raster images.
The hybrid work imperative: Remote teams need to iterate on shared visuals without source files. Edit-Banana bridges that gap.

Unlike simple vectorization tools that trace edges blindly, Edit-Banana employs a semantic reconstruction pipeline. It doesn't just see pixels—it identifies shapes, recognizes text and formulas, understands spatial hierarchies, and reconstructs the logic of the original diagram. The result preserves stroke styles, arrow types (dashed, thick, directional), color matching, and element grouping. Your converted flowchart doesn't just look right—it's structurally right.

Key Features That Make Edit-Banana Insane

Let's dissect what separates Edit-Banana from every "image to vector" tool you've abandoned before.

Fine-Tuned SAM 3 Segmentation

Edit-Banana doesn't use off-the-shelf segmentation. The team fine-tuned SAM 3's mask decoder specifically for diagram elements. This matters because generic segmentation models trained on natural images fail catastrophically on technical diagrams—confusing borders for shapes, merging adjacent boxes, or fragmenting connected components. The fine-tuned model understands diagram grammar: boxes contain text, arrows connect nodes, groups have hierarchical boundaries.

Fixed Multi-Round VLM Scanning

Here's where it gets clever. After segmentation, a multimodal large language model performs structured extraction through fixed multi-round scanning. Instead of one-shot prediction that hallucinates or misses details, the system iteratively probes the image—verifying element relationships, confirming text content, and resolving ambiguities. This dramatically reduces error rates on complex diagrams with overlapping elements or unconventional layouts.

Dual-Engine Text Recognition

Text handling is where most conversion tools die. Edit-Banana deploys a sophisticated dual strategy:

Local OCR via Tesseract: Fast, offline, privacy-preserving text localization and recognition. Supports multiple languages including Chinese (tesseract-ocr-chi-sim).
Pix2Text for Mathematical Formulas: Specialized engine that recognizes mathematical notation and converts to LaTeX. The Crop-Guided Strategy extracts high-resolution regions around formulas and sends only those crops to the formula engine—preserving accuracy without overwhelming the model with full-image context.

Production-Grade User System

The web deployment at editbanana.net includes enterprise features:

Credit-based access control: New users receive 10 free credits; pay-per-use prevents resource abuse
Multi-user concurrency: Global Lock mechanism ensures thread-safe GPU access when multiple users submit simultaneously
LRU Cache for embeddings: Image embeddings persist across requests, eliminating redundant SAM 3 inference and slashing latency for repeat conversions

Use Cases Where Edit-Banana Absolutely Dominates

1. Legacy Documentation Revival

Your company has 500 pages of Confluence docs with embedded PNG diagrams from a tool nobody licenses anymore. Updating a single arrow requires rebuilding the entire graphic. Edit-Banana converts these en masse to editable DrawIO format—suddenly your documentation is maintainable again.

2. Academic Paper Figure Extraction

Researchers constantly need to modify figures from prior work—adapt a methodology diagram, extend a model architecture, compare approaches side-by-side. Edit-Banana's LaTeX formula preservation means mathematical expressions remain editable, not flattened into unchangeable images.

3. Competitive Analysis & Benchmarking

Product teams screenshot competitor flows, architecture diagrams, or UI patterns. Instead of recreating from scratch for internal analysis, Edit-Banana reconstructs the underlying structure—enabling rapid annotation, modification, and presentation without copyright-infringing direct reuse.

4. Automated Slide Deck Generation

Marketing receives design files as flattened PDFs. With Edit-Banana, extract individual diagrams as editable slides. The color matching preservation ensures brand consistency while unlocking the ability to tweak messaging, update statistics, or localize content for different markets.

5. Human-in-the-Loop Refinement

Not every conversion is perfect. Edit-Banana's pipeline produces immediately editable output—and the web interface supports manual repair, element adjustment, and local saving. The GIF demonstrations show users cutting, modifying, and persisting corrections seamlessly.

Step-by-Step Installation & Setup Guide

Ready to run Edit-Banana locally? The setup has three phases. Follow precisely—GPU acceleration is strongly recommended for acceptable performance.

Phase 1: Environment & Base Setup

Prerequisites: Python^{↗ Bright Coding Blog} 3.10+, CUDA-capable GPU, Linux/macOS (Windows possible with WSL).

Install PyTorch with CUDA support (example for CUDA 11.8):

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

Clone and initialize directories:

git clone https://github.com/BIT-DataLab/Edit-Banana.git
cd Edit-Banana
mkdir -p input output sam3_output

Phase 2: Models & Core Dependencies

Install base Python requirements:

pip install -r requirements.txt

SAM3 Library & BPE Vocabulary: Run the setup script to install the SAM3 library and copy the BPE tokenizer to models/:

bash scripts/setup_sam3.sh

Verify installation:

python -c "from sam3.model_builder import build_sam3_image_model; print('OK')"

Download SAM3 Weights: Obtain sam3.pt from ModelScope or Hugging Face and place under models/sam3_ms/.

Install Tesseract OCR (Ubuntu/Debian):

sudo apt install tesseract-ocr tesseract-ocr-chi-sim

Optional Enhancements:

# Alternative OCR engine (better mixed-language support)
pip install paddlepaddle==3.2.2 paddleocr

# Mathematical formula recognition
pip install pix2text onnxruntime-gpu

# Background removal capability
pip install onnxruntime modelscope
python scripts/setup_rmbg.py

Phase 3: Configuration

Copy and customize configuration:

cp config/config.yaml.example config/config.yaml

Edit config/config.yaml to set:

sam3.checkpoint_path: path to models/sam3_ms/sam3.pt
sam3.bpe_path: path to BPE vocab in models/

Troubleshooting Checklist:

Config paths match actual file locations
SAM3 weights and BPE vocab present in models/
SAM3 library extracted via setup script
OCR engine installed (Tesseract or PaddleOCR)

Common fix for GPU errors: set sam3.device: "cpu" in config if your GPU architecture is incompatible with compiled CUDA kernels.

REAL Code Examples from the Repository

Let's examine actual implementation patterns from Edit-Banana's codebase, with detailed commentary on what each section accomplishes.

Example 1: Basic CLI Conversion

The simplest entry point processes a single image through the full pipeline:

python main.py -i input/test_diagram.png

This command triggers the complete workflow: SAM 3 segmentation → text extraction → XML generation. Output lands in output/<image_stem>/ containing the DrawIO XML plus intermediate processing artifacts. For batch operations, omit the -i flag—every image in input/ gets processed sequentially.

What's happening under the hood? The main.py entry point orchestrates modular components defined in the project structure. It loads configuration from config/config.yaml, initializes the SAM 3 model with your specified weights, runs the segmentation pipeline, spawns parallel OCR processes, and merges spatial and textual data into valid DrawIO XML.

Example 2: Complete Local Development Setup

The README provides this comprehensive initialization sequence:

# Clone repository and enter directory
git clone https://github.com/BIT-DataLab/Edit-Banana.git && cd Edit-Banana

# Create isolated Python environment
python3 -m venv .venv && source .venv/bin/activate
# Windows users: .venv\Scripts\activate

# Install PyTorch with CUDA 11.8 (adjust for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# Install all Python dependencies
pip install -r requirements.txt

# Install system OCR engine with Chinese support
sudo apt install tesseract-ocr tesseract-ocr-chi-sim

After model setup and configuration:

# Create required directories
mkdir -p input output

# Copy and edit configuration
cp config/config.yaml.example config/config.yaml
# Edit config/config.yaml: set sam3.checkpoint_path and sam3.bpe_path

Critical insight: The virtual environment isolation prevents dependency conflicts with system Python packages. The explicit CUDA index URL ensures GPU-accelerated PyTorch—without it, you'd get CPU-only builds that make SAM 3 inference unbearably slow.

Example 3: Testing the Web API

Edit-Banana exposes a FastAPI backend for service-oriented deployments:

# Terminal 1: Start the server
python server_pa.py

# Terminal 2: Test with curl
curl -X POST http://localhost:8000/convert -F "file=@input/test.png"

Alternatively, open http://localhost:8000/docs for interactive Swagger documentation—upload files directly through the browser interface.

Architecture note: server_pa.py implements the production user system with credit tracking, concurrent request handling via Global Locks, and LRU caching of image embeddings. This isn't a toy server—it's designed for real multi-user workloads where GPU contention would otherwise cause failures.

Example 4: Project Structure Navigation

Understanding the codebase organization enables customization:

Edit-Banana/
├── config/               # Configuration files
├── flowchart_text/       # OCR & Text Extraction Module
│   ├── src/
│   └── main.py             # OCR-only entry point
├── input/                # [Manual] Input images
├── models/               # [Manual] Model weights
├── output/               # [Manual] Results
├── sam3/                 # SAM3 library (from facebookresearch)
├── sam3_service/         # SAM3 HTTP service (multi-process)
├── scripts/              # Setup utilities
│   ├── setup_sam3.sh
│   ├── setup_rmbg.py
│   └── merge_xml.py
├── main.py               # CLI entry (full pipeline)
├── server_pa.py          # FastAPI backend
└── requirements.txt

Key design decisions revealed: The separation of flowchart_text/ as a standalone module means you can run OCR extraction independently—useful for debugging text issues without re-running expensive segmentation. The sam3_service/ HTTP wrapper enables multi-process deployments where SAM 3 runs in dedicated workers, preventing GIL contention in Python's async server.

Advanced Usage & Best Practices

Configuration Tuning

The config/config.yaml exposes critical parameters for optimization:

SAM 3 thresholds: Adjust score_threshold and nms_threshold based on your diagram density. Dense technical schematics need higher NMS to prevent overlapping detections.
Max iteration loops: Control segmentation refinement depth—trade accuracy for speed.
Dominant color sensitivity: Fine-tune for diagrams with subtle color variations versus high-contrast designs.

Performance Optimization

Enable LRU caching: In production, cached embeddings reduce repeat conversion latency by 60-80% for similar images.
Use PaddleOCR for mixed scripts: When diagrams contain both English and Chinese, PaddleOCR's multilingual model outperforms Tesseract significantly.
GPU memory management: SAM 3 is memory-hungry. For batch processing, monitor nvidia-smi and reduce batch size if encountering OOM errors.

Integration Patterns

CI/CD documentation pipelines: Hook main.py into your docs build process—automatically convert designer-delivered PNGs to editable assets.
Webhook architecture: Deploy server_pa.py behind nginx with rate limiting, integrating with your organization's authentication system.

Comparison with Alternatives

Feature	Edit-Banana	Adobe Illustrator Image Trace	Potrace	Online OCR Tools
Semantic understanding	✅ SAM 3 + VLM reasoning	❌ Edge tracing only	❌ Path approximation	❌ Text only
Formula recognition	✅ LaTeX via Pix2Text	❌ Flattened	❌ Unsupported	⚠️ Limited
Editable output format	✅ Native DrawIO XML	⚠️ AI format	❌ SVG (no text edit)	❌ DOCX/PDF
Self-hostable	✅ Fully open source	❌ Proprietary	✅ Open source	❌ SaaS only
Arrow/style preservation	✅ 1:1 restoration	⚠️ Simplified	❌ Lost	❌ Ignored
GPU acceleration	✅ CUDA optimized	✅ Yes	❌ CPU only	❌ N/A
Cost	Free (Apache 2.0)	$20-55/month	Free	Per-page fees

The verdict: Edit-Banana occupies a unique position—it's the only open-source tool combining deep learning segmentation, multimodal reasoning, mathematical formula extraction, and native diagram editor output. Illustrator produces vectors without structure. Potrace creates paths without semantics. Online OCR captures text without layout. Edit-Banana reconstructs meaning.

FAQ

Q: Is Edit-Banana completely free to use? A: Yes! The core framework is Apache 2.0 licensed, permitting commercial use and modification. The web service at editbanana.net offers free credits with paid tiers for heavy usage.

Q: What image formats does Edit-Banana support? A: PNG, JPG, BMP, TIFF, and WebP. Output is standard DrawIO XML compatible with diagrams.net and draw.io desktop.

Q: Can I run Edit-Banana without a GPU? A: Technically yes—set sam3.device: "cpu" in config. However, SAM 3 inference becomes extremely slow. A CUDA-capable GPU is strongly recommended for practical use.

Q: How accurate is the formula recognition? A: The Crop-Guided Strategy with Pix2Text achieves high accuracy on standard mathematical notation. Complex handwritten formulas or highly stylized notation may require manual correction.

Q: Is my data private when using the web service? A: The local deployment keeps all processing on your infrastructure. For the web service, review BIT-DataLab's current privacy policy—when in doubt, self-host.

Q: Can I contribute to the project? A: Absolutely! The repository welcomes issues, discussions, and pull requests. Check the contribution guidelines for branch naming conventions.

Q: What's on the development roadmap? A: Intelligent arrow connection (in development), DrawIO template adaptation, batch export optimization, and local VLM deployment for fully offline operation.

Conclusion

Edit-Banana isn't merely a convenience tool—it's a paradigm shift in how we treat visual knowledge. Static images have trapped technical content for decades, creating friction that slows teams, buries institutional knowledge, and makes documentation decay inevitable. By combining SAM 3's pixel-perfect segmentation with multimodal AI reasoning, Edit-Banana performs something previously impossible: genuine semantic reconstruction that preserves not just appearance but editability.

The framework is production-ready today, with a clear roadmap toward even greater autonomy. Whether you're reviving legacy documentation, accelerating research workflows, or building automated content pipelines, Edit-Banana delivers capabilities that previously required expensive proprietary tools—or simply didn't exist.

My assessment? This is the most significant open-source release in document intelligence this year. The engineering is thoughtful, the use cases are genuine, and the Apache 2.0 license removes adoption barriers entirely.

Stop redrawing. Start converting. ⭐ Star the repository, clone it locally, or test instantly at https://www.editbanana.net/. Your future self—the one not manually tracing arrows at 2 AM—will thank you.

Get the code: https://github.com/BIT-DataLab/Edit-Banana

Outils recommandés

Midjourney Générez des images artistiques et professionnelles avec l'IA. Leonardo.ai Créez des visuels et des assets de jeu en quelques clics.

Edit-Banana: Stop Redrawing Diagrams by Hand