Stop Manual Pentesting! Raptor Turns Claude Code Into an Autonomous Hacking Machine
What if your security tools could think like an attacker—and keep getting smarter with every scan?
Picture this: It's 2 AM. Your team just spent 14 hours manually reviewing code for a critical release. The Semgrep report is 400 findings deep. CodeQL choked on false positives. And somewhere in that noise, a real remote code execution vulnerability is hiding, laughing at you. You've been there. We've all been there. The brutal truth? Traditional security scanning is broken. Static analysis tools vomit alerts. Human reviewers burn out. Exploits get written faster than patches. And your adversaries? They're not sleeping—they're automating.
But what if you could flip the script? What if you had an agent that doesn't just find vulnerabilities, but understands them, validates them, generates proof-of-concept exploits, and writes secure patches—all while you sleep?
Enter Raptor, the autonomous offensive/defensive security research framework that transforms Claude Code into a relentless, adversarial-thinking AI agent. Built by legendary security researchers including Gadi Evron, Daniel Cuthbert, and Thomas Dullien (Halvar Flake), Raptor isn't another scanner. It's a complete security operations brain that chains static analysis, binary analysis, LLM-powered validation, exploit generation, and patch writing into one devastating workflow.
Stop drowning in alert fatigue. Start hunting like the predator you were meant to be.
What is Raptor?
Raptor (Recursive Autonomous Penetration Testing and Observation Robot) is an open-source autonomous security research framework built on top of Claude Code—though crucially, it's not tied to it. You can plug in your own analysis layer if you prefer. The project lives at https://github.com/gadievron/raptor and represents a fundamental shift in how security research gets done.
The brain trust behind Raptor reads like a who's-who of cybersecurity: Gadi Evron (pioneer in cyber warfare research), Daniel Cuthbert (OWASP leader, AppSec veteran), Thomas Dullien (better known as Halvar Flake, legendary reverse engineer and Google Project Zero alum), Michael Bargury (cloud security researcher), and John Cartwright. These aren't academics theorizing from ivory towers—they're practitioners who built Raptor in their free time, "held together with enthusiasm and duct tape," because they needed a tool that actually works.
And here's what makes Raptor genuinely different: it thinks in attack chains, not isolated findings. Traditional tools find a SQL injection and stop. Raptor maps your attack surface, traces data flows, validates exploitability through a rigorous multi-stage pipeline, generates working exploits, writes patches, and then performs cross-finding analysis to uncover shared root causes and combined attack scenarios. It's the difference between hiring a script kiddie with a scanner and deploying a senior penetration testing team that never sleeps.
Raptor is trending now because it arrives at a critical inflection point. LLMs have matured enough to reason about code semantics. Claude Code provides the interactive agentic backbone. And the security industry is desperate for automation that doesn't drown practitioners in false positives. The MIT license means you can deploy it anywhere—even commercially (with the noted exception of CodeQL's own licensing restrictions).
Key Features That Make Raptor Insane
Raptor's feature set reads like a wishlist every security team has whispered into the void. Let's break down what actually matters:
Autonomous Agentic Workflow (/agentic) — The crown jewel. This isn't a script; it's a full decision-making pipeline. Raptor scans with Semgrep and CodeQL, deduplicates findings using deterministic correlation, validates each through exploitation stages, generates PoCs for confirmed vulnerabilities, writes patches, and performs cross-finding analysis. All autonomous. All traceable. All exportable.
Multi-Stage Exploitability Validation (Stages 0-F) — This is where Raptor separates signal from noise. Most tools flag a pattern and call it a day. Raptor runs a brutal gauntlet: Is this actually a vulnerability or pattern-matching noise? What does an attacker need to reach it? Does the code path exist and is externally reachable? Is this test code with unrealistic preconditions? Only findings that survive this interrogation get promoted to exploit generation.
Z3 SMT Integration — Raptor embeds a two-layer Z3 constraint solver integration. For CodeQL paths, it checks satisfiability before burning LLM tokens—provably unreachable paths get dropped instantly. For reachable paths, Z3 generates concrete candidate inputs that feed into analysis prompts. In binary exploitation, Z3 ranks one-gadgets by actual reachability against crash states, not heuristics. This is PhD-level automation running on every finding.
Multi-Model Architecture with Budget Control — Raptor decouples orchestration from analysis. Claude Code handles decisions; any supported LLM handles analysis dispatch. Configure multiple models for consensus voting, assign different models to different roles (analysis, code generation, consensus, aggregation, fallback), and cap spending with RAPTOR_MAX_COST. The fast-tier short-circuit uses cheap models to filter confident false positives, with Wilson 95% confidence bounds ensuring safety.
Offline and Air-Gapped Operation — Semgrep runs fully offline with cached registry packs (p/security-audit, p/owasp-top-ten, p/secrets, p/command-injection, p/jwt, p/default, p/xss). CodeQL needs network only for initial setup. This isn't an afterthought—it's designed for sensitive environments from day one.
Project-Based Organization — Named workspaces with merged findings, coverage tracking, run diffs, and exportable reports. No more timestamped directories scattered across your filesystem. Track your security posture evolution over time.
Nine Expert Personas — Load Mark Dowd for binary exploitation, Charlie Miller/Halvar Flake for reverse engineering, or the Patch Engineer for secure fix generation. These aren't cosmetic skins—they're structured reasoning frameworks that change how Raptor approaches problems.
Use Cases: Where Raptor Absolutely Dominates
1. Enterprise Codebase Security Assessment
Your organization has 2 million lines of legacy C++ and modern Python microservices. Traditional DAST/SAST tools generate 10,000 findings. Your team of four can triage maybe 200 per sprint. Raptor creates a project, maps the attack surface with /understand --map, runs /agentic overnight, and delivers a validated, prioritized report with working exploits and patches for confirmed vulnerabilities. The cross-finding analysis reveals that 12 apparently separate SQL injections actually share a single vulnerable ORM wrapper—one patch fixes them all.
2. Binary Vulnerability Research and Exploitation
You're analyzing a closed-source network appliance firmware. Raptor's /fuzz command spins up AFL++ with intelligent corpus design. A crash triggers /crash-analysis, which performs autonomous root-cause analysis. Z3 constraint checking ranks one-gadgets by actual feasibility against the crash state. The Binary Exploitation Specialist persona guides exploit development. What took a skilled researcher two weeks now compresses to hours.
3. Open Source Supply Chain Forensics
A popular npm package just had a suspicious commit. Was it a legitimate fix or a supply chain attack? /oss-forensics investigates using GitHub API data, GH Archive immutable history via BigQuery, Wayback Machine snapshots, and local git forensics. The structured pipeline produces evidence-backed findings suitable for incident response or responsible disclosure—not speculation.
4. CI/CD Security Pipeline Integration
Raptor's Python execution layer runs headless: python3 raptor.py scan --repo /code produces structured SARIF output without Claude Code. Integrate into GitHub Actions, GitLab CI, or Jenkins. The multi-model consensus provides deterministic correlation that reduces false positive noise in pull request comments. Fast-tier short-circuiting keeps costs predictable at scale.
5. Security Research and Bug Bounty Automation
For bug bounty hunters, Raptor is a force multiplier. Create a project per target, run /understand to map the attack surface, then /agentic for continuous monitoring. The validation pipeline ensures you only submit verified, exploitable findings—not embarrassing false positives that burn reputation. The exploit generation produces PoCs ready for responsible disclosure.
Step-by-Step Installation & Setup Guide
Ready to deploy your autonomous security agent? You have two paths: manual installation for customization, or Devcontainer for instant gratification.
Option 1: Manual Installation
# Clone the repository
git clone https://github.com/gadievron/raptor.git
cd raptor
# Install Python dependencies
pip install -r requirements.txt
# Install Claude Code (required for orchestration layer)
npm install -g @anthropic-ai/claude-code
# Install Semgrep (required for static analysis scanning)
pip install semgrep
# Optional: Install Z3 SMT solver for enhanced constraint analysis
pip install z3-solver
# Launch Claude Code with Raptor context
claude
Critical configuration step: Set up your analysis LLM. Create ~/.config/raptor/models.json:
{
"models": [
{
"provider": "anthropic",
"model": "claude-opus-4-6",
"api_key": "sk-ant-...",
"role": "analysis"
}
]
}
Or use environment variables for zero-config startup:
export ANTHROPIC_API_KEY=sk-ant-... # Anthropic Claude
export OPENAI_API_KEY=sk-... # OpenAI GPT models
export GEMINI_API_KEY=... # Google Gemini
export MISTRAL_API_KEY=... # Mistral AI
export OLLAMA_HOST=http://localhost:11434 # Local Ollama instance
Budget protection:
export RAPTOR_MAX_COST=5.00 # Hard cap at $5 per run
Option 2: Devcontainer (Recommended)
Everything pre-installed in a reproducible environment. The image is substantial (~6 GB) because it includes Python 3.12, static analysis tools, fuzzing infrastructure, and browser automation.
Via VS Code: Open the folder and select Dev Containers: Open Folder in Container.
Manual Docker build:
# Build the container image
docker build -f .devcontainer/Dockerfile -t raptor:latest .
# Run with privileged mode (required for rr deterministic debugger)
docker run --privileged -it raptor:latest
The --privileged flag is non-negotiable for binary analysis workflows using the rr debugger. Once inside, simply type claude or start with a command.
First-Time Project Setup
# Create a project to organize all findings
/project create myapp --target /path/to/your/code -d "Production web application"
# Activate the project
/project use myapp
# Map attack surface before scanning
/understand --map
# Run the full autonomous workflow
/agentic
# Review consolidated findings
/project findings --detailed
REAL Code Examples from Raptor
Let's examine actual implementation patterns from the Raptor repository. These aren't toy examples—they're production workflows you can run today.
Example 1: Creating and Managing Projects
The project system is Raptor's organizational backbone. Here's how you establish persistent workspaces:
# Create a named project with target code and description
/project create myapp --target /path/to/code -d "Short description"
# Set as active project for all subsequent operations
/project use myapp
# Execute security workflows within project context
/scan
/understand --map
/validate
# Review project status and history
/project status # All runs with pass/fail and timestamps
/project findings # Merged findings across all runs
/project findings --detailed # Per-finding technical detail
/project coverage --detailed # Which files received analysis
/project diff myapp run1 run2 # Compare two historical runs
/project report # Full merged report generation
/project clean --keep 3 # Retention management: keep last 3 runs
/project export myapp /tmp/myapp.zip # Portable export
/project none # Clear active project context
Why this matters: Without projects, each run generates isolated timestamped directories under out/. With projects, you get merged findings (deduplicated across runs), coverage tracking (preventing analysis gaps), and temporal diffs (seeing what changed between versions). This transforms Raptor from a point-in-time scanner into a continuous security monitoring platform.
Example 2: Multi-Model Configuration for Consensus Analysis
Raptor's model configuration enables sophisticated analysis strategies. Here's the exact JSON structure:
{
"models": [
{
"provider": "anthropic",
"model": "claude-opus-4-6",
"api_key": "sk-ant-...",
"role": "analysis"
},
{
"provider": "openai",
"model": "gpt-5.4",
"api_key": "sk-...",
"role": "analysis"
},
{
"provider": "anthropic",
"model": "claude-sonnet-4-6",
"api_key": "sk-ant-...",
"role": "aggregate"
}
]
}
Role breakdown:
analysis: Validates each finding through Stages A-D (is it real? reachable? exploitable?)code: Writes exploit PoCs and security patchesconsensus: Second-opinion voting on borderline true positivesaggregate: Optional LLM-written narrative synthesis beyond deterministic correlationfallback: Emergency substitution when primary models fail or rate-limit
The deterministic correlation magic: When you configure multiple analysis models, Raptor automatically runs consensus voting. Only findings that multiple independent models agree on get promoted. This isn't majority voting—it's a structured correlation that dramatically reduces false positives without human intervention.
Command-line multi-model invocation:
python3 raptor.py agentic --repo /code \
--model claude-opus-4-6 \
--model gpt-5.4 \
--aggregate claude-sonnet-4-6
Example 3: The Complete Agentic Pipeline
This is Raptor's killer feature—the full autonomous workflow:
# Step 1: Create and activate project context
/project create myapp --target /path/to/code
/project use myapp
# Step 2: Build attack surface intelligence
/understand --map
# Maps entry points, trust boundaries, and sinks BEFORE scanning
# This prevents wasted compute on unreachable code paths
# Step 3: Execute full autonomous workflow
/agentic
# Internally executes:
# - Semgrep static analysis with cached offline rules
# - CodeQL deep analysis with Z3 dataflow pre-screening
# - Deduplication via deterministic multi-model correlation
# - Stage A validation: Is pattern actually vulnerable?
# - Stage B validation: Attacker prerequisites and obstacles
# - Stage C validation: Code path existence and external reachability
# - Stage D validation: Final triage (test code? unrealistic preconditions?)
# - Exploit PoC generation for confirmed findings
# - Secure patch generation for confirmed findings
# - Cross-finding analysis for shared root causes and attack chains
# Step 4: Review consolidated intelligence
/project findings
# All validated findings with exploits, patches, and attack chain analysis
The validation pipeline explained: Stages A-D aren't arbitrary—each eliminates a specific class of false positive that plagues traditional scanners. Stage A catches pattern-matching hallucinations. Stage B eliminates theoretical vulnerabilities requiring impossible attacker capabilities. Stage C removes dead code paths. Stage D filters test fixtures and contrived examples. Only findings surviving all four stages consume exploit-generation tokens.
Example 4: Fast-Tier Short-Circuit with Scorecard Inspection
Raptor's cost optimization uses statistical confidence, not naive thresholds:
# The fast-tier automatically activates when cheap same-provider siblings exist:
# Anthropic Opus → Haiku, OpenAI 5.x → 4o-mini, Gemini Pro → Flash-Lite, Mistral Large → Small
# Cheap model pre-filters confident false positives only
# Ambiguous cases and confident true positives always run full analysis
# Inspect model performance history
/scorecard
# Or directly: libexec/raptor-llm-scorecard list
# Scorecard data persists at out/llm_scorecard.json
# Global across projects—lessons accumulate automatically
The Wilson bound safety mechanism: Raptor doesn't short-circuit based on raw accuracy. It computes the Wilson 95% upper confidence bound on miss-rate per (model, decision_class) cell. Short-circuiting only activates when this bound falls at or below 5%. This means the system conservatively errs toward full analysis until statistically confident—preventing premature optimization from missing real vulnerabilities.
Advanced Usage & Best Practices
Master the Personas: Don't just use default mode. When analyzing a heap overflow, explicitly invoke "Use the Binary Exploitation Specialist." When CodeQL paths seem ambiguous, "Use the CodeQL Dataflow Analyst." These aren't gimmicks—they're structured reasoning templates that dramatically improve output quality.
Z3 is Worth the Install: The constraint solver integration isn't optional fluff. On large codebases, Z3 pre-screening eliminates 30-50% of CodeQL paths before any LLM call. At frontier model pricing, this pays for itself on the first significant codebase.
Project Hygiene: Always create projects for persistent targets. Use /project diff before releases to catch regressions. Set up /project clean --keep 5 in cron to manage disk usage. Export projects before major architecture changes for historical comparison.
Model Strategy: For analysis, configure two different providers (Anthropic + OpenAI) for true independence in consensus. For code generation, stick to frontier models—Ollama produces unreliable exploit code. Use the aggregate role only when you need human-readable narrative summaries for reporting.
Air-Gapped Deployment: Pre-download CodeQL during network windows. The Semgrep rules cache is already local. Set OLLAMA_HOST for fully offline analysis if acceptable quality tradeoffs are understood.
Comparison with Alternatives
| Capability | Raptor | Traditional SAST (Semgrep/CodeQL alone) | Commercial ASPM | Manual Pentesting |
|---|---|---|---|---|
| Autonomous validation | ✅ Multi-stage LLM pipeline | ❌ Pattern matching only | ⚠️ Limited rule-based | ✅ Human expert |
| Exploit generation | ✅ Automated PoC creation | ❌ None | ❌ None | ✅ Manual |
| Patch generation | ✅ Automated secure fixes | ❌ None | ⚠️ Suggested fixes | ✅ Manual |
| False positive reduction | ✅ Deterministic multi-model consensus | ❌ High noise | ⚠️ Tuning required | ✅ Expert triage |
| Cost predictability | ✅ Budget caps + fast-tier short-circuit | ✅ Fixed licensing | ⚠️ Per-seat or per-scan | ❌ Expensive hourly |
| Air-gapped operation | ✅ Full offline capability | ✅ | ⚠️ Cloud-dependent | ✅ |
| Attack chain analysis | ✅ Cross-finding correlation | ❌ Isolated findings | ⚠️ Limited | ✅ Expert analysis |
| Binary analysis | ✅ AFL++, crash analysis, Z3 gadgets | ❌ Source only | ⚠️ Additional tools | ✅ Expert analysis |
| Scalability | ✅ Parallel LLM dispatch | ✅ Fast | ✅ | ❌ Human-limited |
| Transparency | ✅ Open source, MIT license | ✅/⚠️ Varies | ❌ Black box | ✅ Custom reporting |
The verdict: Raptor occupies a unique position. It combines the scalability of automated tools with the reasoning depth of human experts, at a fraction of commercial ASPM pricing. It's not a replacement for human judgment in critical decisions—it's a force multiplier that lets humans focus where they matter most.
FAQ: Your Burning Questions Answered
Is Raptor production-ready?
It's "works well enough that we can't stop using it" ready. The authors explicitly note it's not polished software—built with "enthusiasm and duct tape." Core commands (/agentic, /scan, /validate, /fuzz, /crash-analysis) are stable. /exploit and /patch are beta. /web is alpha/stub. For production CI, use the stable Python CLI layer.
How much does Raptor cost to run?
The framework itself is free (MIT). Costs are LLM inference only. Budget control via RAPTOR_MAX_COST caps per-run spend. Fast-tier short-circuiting reduces costs 40-60% on mature deployments. Typical small codebase analysis: $2-10. Large enterprise codebases: $20-50 with optimization.
Can I use Raptor without Claude Code?
Partially. The Python execution layer (raptor.py) runs independently for scanning and SARIF output. The full agentic workflow requires Claude Code for decision-making. The architecture is designed to allow other AI coding tools—Cursor, Windsurf, Copilot, Cline ports are listed as desired contributions.
Is Raptor safe to run on production code?
Raptor is read-only for analysis phases. /fuzz generates inputs and may crash target binaries—run in isolated environments. /exploit generates proof-of-concept code that should never execute on production. Always follow responsible disclosure practices.
What about CodeQL's commercial restrictions? Raptor's MIT license permits commercial use. However, CodeQL (required for deep analysis) has its own license prohibiting commercial use in some contexts. Review GitHub's CodeQL terms before deploying in commercial environments. Semgrep and Z3 have permissive licenses.
How does Raptor compare to AI agents like Devin or SWE-agent? Those are general-purpose software engineering agents. Raptor is specifically architected for security research with adversarial thinking frameworks, vulnerability validation pipelines, and exploit generation. It's a specialized tool for a specialized domain.
Can I contribute to Raptor?
Absolutely. High-priority needs include: proper web exploitation module, SSRF detection rules, YARA signature generation, ports to other AI coding tools, and enhanced firmware analysis. Join the #raptor channel on Prompt||GTFO Slack for coordination.
Conclusion: The Future of Security Research is Autonomous
Raptor represents something genuinely new: not just AI-assisted security tooling, but autonomous adversarial reasoning that operates at scale. The combination of structured validation pipelines, multi-model consensus, constraint-solving pre-screening, and expert personas creates a capability that didn't exist two years ago.
Is it perfect? No. The authors are refreshingly honest about its rough edges. But in a field where attackers automate relentlessly while defenders drown in alert fatigue, Raptor tilts the balance. It lets your team focus on creative exploitation and strategic defense while the agent handles systematic analysis.
The security teams that master tools like Raptor will operate at 10x the scale of those stuck in manual processes. The teams that ignore this shift will find themselves outpaced by both attackers and competitors.
Your move. Clone https://github.com/gadievron/raptor, spin up the Devcontainer, and run /agentic against your first project. See what an autonomous security agent finds when you let it think like an attacker. The results might just change how you approach security forever.
Get them bugs.....