Edit-Banana: Stop Redrawing Diagrams by Hand
What if you could turn any static image into a fully editable diagram in seconds? No more painstakingly rebuilding flowcharts from screenshots. No more tracing architecture diagrams pixel by pixel. The secret weapon top technical writers and developers are using is finally exposed—and it's called Edit-Banana.
Here's the brutal truth: every developer has been there. You inherit documentation with embedded diagrams that are impossible to modify. You find the perfect technical reference—but it's locked in a PNG. You spend hours in DrawIO or Figma reconstructing what someone else already built. That friction kills productivity. That friction is why teams abandon documentation updates. And that friction is exactly what Edit-Banana obliterates.
Built by BIT-DataLab and powered by SAM 3 (Segment Anything Model 3) alongside cutting-edge multimodal large language models, Edit-Banana performs high-fidelity reconstruction that preserves original diagram details, logical relationships, color schemes, and even mathematical formulas. The output? Native DrawIO XML you can drag, drop, and edit immediately. This isn't OCR with extra steps. This is structural intelligence that understands what a diagram means, not just what it looks like.
Ready to reclaim your time? Let's peel back everything this framework delivers.
What Is Edit-Banana?
Edit-Banana is an open-source Universal Content Re-Editor designed to transform static, uneditable visual content into fully manipulatable digital assets. Born from the Beijing Institute of Technology's DataLab, this framework targets a deceptively simple problem with profound engineering complexity: how do you reverse-engineer a rendered image back into its structural components?
The repository lives at https://github.com/BIT-DataLab/Edit-Banana, where it's rapidly gaining traction among developers, technical writers, and AI researchers. Its core mission—"Make the Uneditable, Editable"—addresses a genuine pain point in modern workflows where visual knowledge is trapped in static formats.
What makes Edit-Banana genuinely trending right now? Three forces converge:
- The multimodal AI explosion: With SAM 3 delivering state-of-the-art image segmentation and VLMs achieving unprecedented visual reasoning, the technical prerequisites finally exist.
- Documentation-as-code maturity: Teams now expect diagrams to live in version control, be diffable, and be collaboratively editable—impossible with raster images.
- The hybrid work imperative: Remote teams need to iterate on shared visuals without source files. Edit-Banana bridges that gap.
Unlike simple vectorization tools that trace edges blindly, Edit-Banana employs a semantic reconstruction pipeline. It doesn't just see pixels—it identifies shapes, recognizes text and formulas, understands spatial hierarchies, and reconstructs the logic of the original diagram. The result preserves stroke styles, arrow types (dashed, thick, directional), color matching, and element grouping. Your converted flowchart doesn't just look right—it's structurally right.
Key Features That Make Edit-Banana Insane
Let's dissect what separates Edit-Banana from every "image to vector" tool you've abandoned before.
Fine-Tuned SAM 3 Segmentation
Edit-Banana doesn't use off-the-shelf segmentation. The team fine-tuned SAM 3's mask decoder specifically for diagram elements. This matters because generic segmentation models trained on natural images fail catastrophically on technical diagrams—confusing borders for shapes, merging adjacent boxes, or fragmenting connected components. The fine-tuned model understands diagram grammar: boxes contain text, arrows connect nodes, groups have hierarchical boundaries.
Fixed Multi-Round VLM Scanning
Here's where it gets clever. After segmentation, a multimodal large language model performs structured extraction through fixed multi-round scanning. Instead of one-shot prediction that hallucinates or misses details, the system iteratively probes the image—verifying element relationships, confirming text content, and resolving ambiguities. This dramatically reduces error rates on complex diagrams with overlapping elements or unconventional layouts.
Dual-Engine Text Recognition
Text handling is where most conversion tools die. Edit-Banana deploys a sophisticated dual strategy:
- Local OCR via Tesseract: Fast, offline, privacy-preserving text localization and recognition. Supports multiple languages including Chinese (
tesseract-ocr-chi-sim). - Pix2Text for Mathematical Formulas: Specialized engine that recognizes mathematical notation and converts to LaTeX. The Crop-Guided Strategy extracts high-resolution regions around formulas and sends only those crops to the formula engine—preserving accuracy without overwhelming the model with full-image context.
Production-Grade User System
The web deployment at editbanana.net includes enterprise features:
- Credit-based access control: New users receive 10 free credits; pay-per-use prevents resource abuse
- Multi-user concurrency: Global Lock mechanism ensures thread-safe GPU access when multiple users submit simultaneously
- LRU Cache for embeddings: Image embeddings persist across requests, eliminating redundant SAM 3 inference and slashing latency for repeat conversions
Use Cases Where Edit-Banana Absolutely Dominates
1. Legacy Documentation Revival
Your company has 500 pages of Confluence docs with embedded PNG diagrams from a tool nobody licenses anymore. Updating a single arrow requires rebuilding the entire graphic. Edit-Banana converts these en masse to editable DrawIO format—suddenly your documentation is maintainable again.
2. Academic Paper Figure Extraction
Researchers constantly need to modify figures from prior work—adapt a methodology diagram, extend a model architecture, compare approaches side-by-side. Edit-Banana's LaTeX formula preservation means mathematical expressions remain editable, not flattened into unchangeable images.
3. Competitive Analysis & Benchmarking
Product teams screenshot competitor flows, architecture diagrams, or UI patterns. Instead of recreating from scratch for internal analysis, Edit-Banana reconstructs the underlying structure—enabling rapid annotation, modification, and presentation without copyright-infringing direct reuse.
4. Automated Slide Deck Generation
Marketing receives design files as flattened PDFs. With Edit-Banana, extract individual diagrams as editable slides. The color matching preservation ensures brand consistency while unlocking the ability to tweak messaging, update statistics, or localize content for different markets.
5. Human-in-the-Loop Refinement
Not every conversion is perfect. Edit-Banana's pipeline produces immediately editable output—and the web interface supports manual repair, element adjustment, and local saving. The GIF demonstrations show users cutting, modifying, and persisting corrections seamlessly.
Step-by-Step Installation & Setup Guide
Ready to run Edit-Banana locally? The setup has three phases. Follow precisely—GPU acceleration is strongly recommended for acceptable performance.
Phase 1: Environment & Base Setup
Prerequisites: Python 3.10+, CUDA-capable GPU, Linux/macOS (Windows possible with WSL).
Install PyTorch with CUDA support (example for CUDA 11.8):
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
Clone and initialize directories:
git clone https://github.com/BIT-DataLab/Edit-Banana.git
cd Edit-Banana
mkdir -p input output sam3_output
Phase 2: Models & Core Dependencies
Install base Python requirements:
pip install -r requirements.txt
SAM3 Library & BPE Vocabulary: Run the setup script to install the SAM3 library and copy the BPE tokenizer to models/:
bash scripts/setup_sam3.sh
Verify installation:
python -c "from sam3.model_builder import build_sam3_image_model; print('OK')"
Download SAM3 Weights: Obtain sam3.pt from ModelScope or Hugging Face and place under models/sam3_ms/.
Install Tesseract OCR (Ubuntu/Debian):
sudo apt install tesseract-ocr tesseract-ocr-chi-sim
Optional Enhancements:
# Alternative OCR engine (better mixed-language support)
pip install paddlepaddle==3.2.2 paddleocr
# Mathematical formula recognition
pip install pix2text onnxruntime-gpu
# Background removal capability
pip install onnxruntime modelscope
python scripts/setup_rmbg.py
Phase 3: Configuration
Copy and customize configuration:
cp config/config.yaml.example config/config.yaml
Edit config/config.yaml to set:
sam3.checkpoint_path: path tomodels/sam3_ms/sam3.ptsam3.bpe_path: path to BPE vocab inmodels/
Troubleshooting Checklist:
- Config paths match actual file locations
- SAM3 weights and BPE vocab present in
models/ - SAM3 library extracted via setup script
- OCR engine installed (Tesseract or PaddleOCR)
Common fix for GPU errors: set sam3.device: "cpu" in config if your GPU architecture is incompatible with compiled CUDA kernels.
REAL Code Examples from the Repository
Let's examine actual implementation patterns from Edit-Banana's codebase, with detailed commentary on what each section accomplishes.
Example 1: Basic CLI Conversion
The simplest entry point processes a single image through the full pipeline:
python main.py -i input/test_diagram.png
This command triggers the complete workflow: SAM 3 segmentation → text extraction → XML generation. Output lands in output/<image_stem>/ containing the DrawIO XML plus intermediate processing artifacts. For batch operations, omit the -i flag—every image in input/ gets processed sequentially.
What's happening under the hood? The main.py entry point orchestrates modular components defined in the project structure. It loads configuration from config/config.yaml, initializes the SAM 3 model with your specified weights, runs the segmentation pipeline, spawns parallel OCR processes, and merges spatial and textual data into valid DrawIO XML.
Example 2: Complete Local Development Setup
The README provides this comprehensive initialization sequence:
# Clone repository and enter directory
git clone https://github.com/BIT-DataLab/Edit-Banana.git && cd Edit-Banana
# Create isolated Python environment
python3 -m venv .venv && source .venv/bin/activate
# Windows users: .venv\Scripts\activate
# Install PyTorch with CUDA 11.8 (adjust for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# Install all Python dependencies
pip install -r requirements.txt
# Install system OCR engine with Chinese support
sudo apt install tesseract-ocr tesseract-ocr-chi-sim
After model setup and configuration:
# Create required directories
mkdir -p input output
# Copy and edit configuration
cp config/config.yaml.example config/config.yaml
# Edit config/config.yaml: set sam3.checkpoint_path and sam3.bpe_path
Critical insight: The virtual environment isolation prevents dependency conflicts with system Python packages. The explicit CUDA index URL ensures GPU-accelerated PyTorch—without it, you'd get CPU-only builds that make SAM 3 inference unbearably slow.
Example 3: Testing the Web API
Edit-Banana exposes a FastAPI backend for service-oriented deployments:
# Terminal 1: Start the server
python server_pa.py
# Terminal 2: Test with curl
curl -X POST http://localhost:8000/convert -F "file=@input/test.png"
Alternatively, open http://localhost:8000/docs for interactive Swagger documentation—upload files directly through the browser interface.
Architecture note: server_pa.py implements the production user system with credit tracking, concurrent request handling via Global Locks, and LRU caching of image embeddings. This isn't a toy server—it's designed for real multi-user workloads where GPU contention would otherwise cause failures.
Example 4: Project Structure Navigation
Understanding the codebase organization enables customization:
Edit-Banana/
├── config/ # Configuration files
├── flowchart_text/ # OCR & Text Extraction Module
│ ├── src/
│ └── main.py # OCR-only entry point
├── input/ # [Manual] Input images
├── models/ # [Manual] Model weights
├── output/ # [Manual] Results
├── sam3/ # SAM3 library (from facebookresearch)
├── sam3_service/ # SAM3 HTTP service (multi-process)
├── scripts/ # Setup utilities
│ ├── setup_sam3.sh
│ ├── setup_rmbg.py
│ └── merge_xml.py
├── main.py # CLI entry (full pipeline)
├── server_pa.py # FastAPI backend
└── requirements.txt
Key design decisions revealed: The separation of flowchart_text/ as a standalone module means you can run OCR extraction independently—useful for debugging text issues without re-running expensive segmentation. The sam3_service/ HTTP wrapper enables multi-process deployments where SAM 3 runs in dedicated workers, preventing GIL contention in Python's async server.
Advanced Usage & Best Practices
Configuration Tuning
The config/config.yaml exposes critical parameters for optimization:
- SAM 3 thresholds: Adjust
score_thresholdandnms_thresholdbased on your diagram density. Dense technical schematics need higher NMS to prevent overlapping detections. - Max iteration loops: Control segmentation refinement depth—trade accuracy for speed.
- Dominant color sensitivity: Fine-tune for diagrams with subtle color variations versus high-contrast designs.
Performance Optimization
- Enable LRU caching: In production, cached embeddings reduce repeat conversion latency by 60-80% for similar images.
- Use PaddleOCR for mixed scripts: When diagrams contain both English and Chinese, PaddleOCR's multilingual model outperforms Tesseract significantly.
- GPU memory management: SAM 3 is memory-hungry. For batch processing, monitor
nvidia-smiand reduce batch size if encountering OOM errors.
Integration Patterns
- CI/CD documentation pipelines: Hook
main.pyinto your docs build process—automatically convert designer-delivered PNGs to editable assets. - Webhook architecture: Deploy
server_pa.pybehind nginx with rate limiting, integrating with your organization's authentication system.
Comparison with Alternatives
| Feature | Edit-Banana | Adobe Illustrator Image Trace | Potrace | Online OCR Tools |
|---|---|---|---|---|
| Semantic understanding | ✅ SAM 3 + VLM reasoning | ❌ Edge tracing only | ❌ Path approximation | ❌ Text only |
| Formula recognition | ✅ LaTeX via Pix2Text | ❌ Flattened | ❌ Unsupported | ⚠️ Limited |
| Editable output format | ✅ Native DrawIO XML | ⚠️ AI format | ❌ SVG (no text edit) | ❌ DOCX/PDF |
| Self-hostable | ✅ Fully open source | ❌ Proprietary | ✅ Open source | ❌ SaaS only |
| Arrow/style preservation | ✅ 1:1 restoration | ⚠️ Simplified | ❌ Lost | ❌ Ignored |
| GPU acceleration | ✅ CUDA optimized | ✅ Yes | ❌ CPU only | ❌ N/A |
| Cost | Free (Apache 2.0) | $20-55/month | Free | Per-page fees |
The verdict: Edit-Banana occupies a unique position—it's the only open-source tool combining deep learning segmentation, multimodal reasoning, mathematical formula extraction, and native diagram editor output. Illustrator produces vectors without structure. Potrace creates paths without semantics. Online OCR captures text without layout. Edit-Banana reconstructs meaning.
FAQ
Q: Is Edit-Banana completely free to use? A: Yes! The core framework is Apache 2.0 licensed, permitting commercial use and modification. The web service at editbanana.net offers free credits with paid tiers for heavy usage.
Q: What image formats does Edit-Banana support? A: PNG, JPG, BMP, TIFF, and WebP. Output is standard DrawIO XML compatible with diagrams.net and draw.io desktop.
Q: Can I run Edit-Banana without a GPU?
A: Technically yes—set sam3.device: "cpu" in config. However, SAM 3 inference becomes extremely slow. A CUDA-capable GPU is strongly recommended for practical use.
Q: How accurate is the formula recognition? A: The Crop-Guided Strategy with Pix2Text achieves high accuracy on standard mathematical notation. Complex handwritten formulas or highly stylized notation may require manual correction.
Q: Is my data private when using the web service? A: The local deployment keeps all processing on your infrastructure. For the web service, review BIT-DataLab's current privacy policy—when in doubt, self-host.
Q: Can I contribute to the project? A: Absolutely! The repository welcomes issues, discussions, and pull requests. Check the contribution guidelines for branch naming conventions.
Q: What's on the development roadmap? A: Intelligent arrow connection (in development), DrawIO template adaptation, batch export optimization, and local VLM deployment for fully offline operation.
Conclusion
Edit-Banana isn't merely a convenience tool—it's a paradigm shift in how we treat visual knowledge. Static images have trapped technical content for decades, creating friction that slows teams, buries institutional knowledge, and makes documentation decay inevitable. By combining SAM 3's pixel-perfect segmentation with multimodal AI reasoning, Edit-Banana performs something previously impossible: genuine semantic reconstruction that preserves not just appearance but editability.
The framework is production-ready today, with a clear roadmap toward even greater autonomy. Whether you're reviving legacy documentation, accelerating research workflows, or building automated content pipelines, Edit-Banana delivers capabilities that previously required expensive proprietary tools—or simply didn't exist.
My assessment? This is the most significant open-source release in document intelligence this year. The engineering is thoughtful, the use cases are genuine, and the Apache 2.0 license removes adoption barriers entirely.
Stop redrawing. Start converting. ⭐ Star the repository, clone it locally, or test instantly at https://www.editbanana.net/. Your future self—the one not manually tracing arrows at 2 AM—will thank you.
Get the code: https://github.com/BIT-DataLab/Edit-Banana