PromptHub
Artificial Intelligence Mobile Development

PhoneAgent: The Secret AI Tool Automating iPhones Without Jailbreak

B

Bright Coding

Author

17 min read
13 views
PhoneAgent: The Secret AI Tool Automating iPhones Without Jailbreak

PhoneAgent: The Secret AI Tool Automating iPhones Without Jailbreak

What if your iPhone could literally do your work for you? Not just reminders or shortcuts—I'm talking about an AI agent that sees your screen, taps buttons, types text, and navigates between apps like a digital ghost in your pocket. No jailbreak. No complex enterprise MDM. Just pure, automated intelligence.

If you've ever spent 20 minutes manually filling out the same form, copying data between apps, or wishing you could batch-process tasks on mobile like you do with Python scripts on your laptop, you're about to feel a very specific emotion: relief mixed with mild outrage that this didn't exist sooner.

Enter PhoneAgent, the experimental open-source project that's making waves among developers who refuse to accept that mobile automation should be harder than desktop automation. While everyone's been obsessing with browser agents and desktop copilots, Rounak's quietly built something far more audacious—an AI agent that controls actual iPhone and Android devices through a clean, extensible RPC architecture.

The mobile automation space has been a desert of half-solutions. Apple's Shortcuts app? Too limited. XCTest UI tests? Buried in Xcode complexity. Android's UiAutomator? Powerful but painful to orchestrate at scale. PhoneAgent bridges all of this with a unified interface that lets AI models like OpenAI's Codex and OpenClaw literally see and control your phone through simple JSON-RPC calls.

This isn't theoretical. The GitHub repository includes working demos of AI agents booking rides, changing settings, and navigating complex multi-app workflows. And the architecture is surprisingly elegant—clean enough for production experiments, hackable enough for your wildest automation dreams.

Ready to understand why mobile developers are quietly bookmarking this repo? Let's dive deep.

What Is PhoneAgent?

PhoneAgent is an experimental mobile automation framework created by developer Rounak, designed to enable AI agents to control iOS and Android devices through a standardized remote procedure call (RPC) interface. It operates in two distinct modes that together cover virtually every mobile automation scenario a developer might need.

The first mode is a self-contained iPhone agent built with SwiftUI, XCTest runner infrastructure, and OpenAI's Responses API. This is the "magic" demo you've probably seen on social media—a native iOS app where you speak or type a task, and the AI executes it directly on your device using accessibility APIs and UI testing frameworks. Your OpenAI API key lives securely in the iOS Keychain, and the agent loop handles everything from screen parsing to action execution.

The second mode is where things get really interesting for developers: an external bridge that lets AI coding agents like Codex (OpenAI's coding model) and OpenClaw control both iOS and Android devices remotely. This transforms your phone into an API endpoint that any AI system can manipulate through simple JSON-RPC commands.

The project emerged from a simple but profound observation: desktop and browser automation have matured dramatically (think Playwright, Selenium, computer-use APIs), but mobile remained a fragmented mess of platform-specific tools with no universal interface. PhoneAgent solves this by providing a shared action surface across both platforms—get_tree, tap, scroll, enter_text, open_app—that works identically whether you're targeting an iPhone simulator or a physical Android device.

What's driving attention now is timing. With OpenAI's Codex and similar coding agents becoming genuinely capable, developers are hungry for ways to extend AI automation beyond the browser. PhoneAgent arrives as a pluggable, hackable bridge between these powerful models and the mobile devices where we actually spend our digital lives. It's experimental, yes—but it's the most coherent mobile automation architecture I've seen emerge from the open-source community this year.

Key Features That Make PhoneAgent Different

PhoneAgent isn't just another UI testing wrapper. The architecture reveals serious engineering decisions that solve real problems developers face when automating mobile at scale.

Unified Cross-Platform RPC Surface. The heart of PhoneAgent is its shared JSON-RPC action API. Whether you're controlling iOS through XCTest-hosted actions or Android through adb + UiAutomator, the same eleven commands work identically: get_tree (UI hierarchy), get_screen_image (screenshot), get_context (state), set_api_key, open_app, tap, tap_element, enter_text, scroll, swipe, and stop. This means your AI agent logic stays platform-agnostic—you write once, run anywhere.

Dual Operating Modes for Different Use Cases. The in-app iPhone agent mode prioritizes user experience: microphone input, wake-word activation, Keychain-secured API keys, and notification-based completion loops. The external bridge mode prioritizes developer flexibility: CLI-driven, scriptable, integrable into any automation pipeline. Most tools force you into one paradigm. PhoneAgent lets you choose based on whether you're building a consumer app or a backend automation system.

Native iOS Security Integration. API keys aren't dumped in plaintext or UserDefaults. The in-app agent uses iOS Keychain for OpenAI credential storage, and the RPC bridge is strictly localhost-bound (127.0.0.1:45678) with physical device workflows using SSH-style localhost forwarding. For a project at this experimental stage, the security model is surprisingly thoughtful.

Wireless Android Support. No USB tethering required for development. The adb pair and adb connect workflow enables full wireless debugging, and the bridge launcher auto-discovers available devices. This matters enormously for CI/CD pipelines and remote device labs where physical USB connections are impractical.

Always-On Wake Word Mode. The iOS agent can run persistently, listening for a custom wake phrase. This transforms your phone from a passive automation target into an active assistant that awaits commands without manual app launching—a crucial UX gap that most automation frameworks ignore entirely.

Rich Metadata Return Values. Screenshots include parseable metadata. UI trees expose coordinate rectangles in precise {{x, y}, {w, h}} format. The AI doesn't just "see" your screen—it receives structured, actionable data that enables reliable element targeting without brittle XPath or accessibility label dependencies.

Real-World Use Cases Where PhoneAgent Shines

Cross-Platform App Testing at Scale. Imagine running your entire mobile test suite through a single Python script that targets both your iOS simulator and Android emulator with identical commands. PhoneAgent's unified RPC surface makes this achievable without maintaining separate Appium, Espresso, and XCUITest codebases. The get_tree and tap_element methods provide sufficient granularity for complex interaction verification.

AI-Powered Customer Support Automation. Deploy an AI agent that can walk users through troubleshooting steps on their actual device. "Open Settings, tap Privacy, scroll to Analytics"—executed automatically while the user watches. The in-app agent mode with microphone input makes this feel like natural conversation rather than robotic instruction-following.

Legacy App Migration and Data Transfer. Moving data between apps that don't expose APIs? PhoneAgent can automate the tedious manual work: open SourceApp, navigate to export, save to Files, open TargetApp, import from Files, verify completion. The open_app command with precise bundle identifiers (com.apple.Preferences, com.apple.mobileslideshow) enables reliable app switching.

Accessibility Testing and Compliance. Automated verification that your app works with VoiceOver and Switch Control by programmatically navigating with the same accessibility APIs that assistive technologies use. The XCTest foundation means you're testing through Apple's official accessibility infrastructure, not simulated hacks.

Continuous Integration for Mobile Games. Game UI is notoriously difficult to test with traditional frameworks. PhoneAgent's screenshot-and-tap approach works regardless of whether elements are standard UIKit views or SpriteKit scenes. The get_screen_image method gives your AI vision model the visual context it needs to make gameplay decisions.

Personal Productivity Automation. The wake-word mode enables genuinely hands-free mobile workflows. "PhoneAgent, send my location to Sarah, then open Maps and start navigation home"—executed across three apps without you touching the screen. This is the kind of seamless cross-app automation that Apple Shortcuts promises but rarely delivers reliably.

Step-by-Step Installation & Setup Guide

Getting PhoneAgent running requires platform-specific setup, but the repository provides clear paths for both operating modes.

Prerequisites

For macOS hosts (required for iOS, recommended for Android):

  • Xcode with iOS Simulator or physical iPhone with Developer Mode enabled
  • Python 3 installed and available in PATH
  • Android SDK Platform Tools (adb) if targeting Android

For Android-only development on Linux/Windows:

  • Python 3
  • Android SDK with adb accessible
  • USB debugging or wireless debugging enabled on target device

In-App iPhone Agent Setup

# Clone the repository
git clone https://github.com/rounak/PhoneAgent.git
cd PhoneAgent

# Open the Xcode project
open PhoneAgent.xcodeproj

In Xcode:

  1. Select the PhoneAgent scheme (not the UITest target)
  2. Choose your target device—iOS simulator or connected physical iPhone
  3. Build and run (⌘+R)
  4. On first launch, enter your OpenAI API key when prompted (stored in Keychain)
  5. Grant accessibility permissions when requested—this is required for UI control

The app presents a simple interface: text input field, microphone button, and settings gear. Toggle "Always On" in settings to enable wake-word detection.

External Bridge Setup (AI Agent Control)

iOS Bridge:

# Launch the RPC bridge server on localhost:45678
./.agents/skills/phoneagent/scripts/start_rpc_bridge_local.sh

This script starts the XCTest-hosted RPC server. For physical devices, it automatically sets up localhost port forwarding so your Mac can communicate with the phone's test runner.

Android Bridge:

# For USB-connected or wirelessly-paired device
./.agents/skills/phoneagent/scripts/start_android_rpc_bridge_local.sh

# For specific wireless device (after adb pair/connect)
./.agents/skills/phoneagent/scripts/start_android_rpc_bridge_local.sh --serial 192.168.1.100:42073

The Android launcher auto-discovers available adb devices if you don't specify a serial.

Wireless Android Configuration

For wireless development without USB cables:

# On your Android device: Settings → Developer options → Wireless debugging
# Tap "Pair code with pairing code"

# From your computer, note the IP and pairing port displayed
adb pair 192.168.1.100:42000
# Enter the 6-digit pairing code when prompted

# Now connect using the main ADB port (different from pairing port)
adb connect 192.168.1.100:42073

# Verify connection
adb devices -l

Verification

Test your setup with the generic RPC CLI:

# Should return JSON UI hierarchy
./.agents/skills/phoneagent/scripts/rpc.py get-tree

# Should capture screenshot to /tmp/phoneagent-artifacts/
./.agents/skills/phoneagent/scripts/rpc.py get-screen-image --print-metadata

REAL Code Examples from the Repository

The PhoneAgent repository includes practical command-line examples that demonstrate the full automation surface. Let's examine the most important patterns with detailed explanations.

Opening Apps by Platform-Specific Identifier

The open_app command demonstrates PhoneAgent's clean cross-platform abstraction:

# iOS: Use bundle identifier (found in Info.plist or via `ideviceinstaller -l`)
./.agents/skills/phoneagent/scripts/rpc.py open-app com.apple.Preferences

# Android: Use package name (found in Play Store URL or `adb shell pm list packages`)
./.agents/skills/phoneagent/scripts/rpc.py open-app com.android.settings

What's happening here? The rpc.py CLI automatically detects which platform bridge is running (iOS on port 45678 or Android on its configured port) and translates the open-app command to the appropriate native action. On iOS, this triggers XCUIApplication(bundleIdentifier:) launch. On Android, it constructs an adb shell am start -n command targeting the package. The unified CLI means your automation scripts don't need platform branches for basic navigation.

Capturing and Analyzing Screen State

Visual context is essential for AI agents. PhoneAgent provides two complementary methods:

# Fetch structured UI hierarchy (element tree with coordinates)
./.agents/skills/phoneagent/scripts/rpc.py get-tree

# Capture screenshot with metadata for vision models
./.agents/skills/phoneagent/scripts/rpc.py get-screen-image --print-metadata

The get-tree response returns a JSON representation of the accessibility hierarchy—every visible element's type, label, value, and frame rectangle. This enables deterministic automation: target elements by property rather than visual matching.

The get-screen-image command writes a PNG to /tmp/phoneagent-artifacts/ and, with --print-metadata, outputs JSON including dimensions, timestamp, and platform. This is designed for multimodal AI agents: send the image to GPT-4V or similar, receive back natural language descriptions, then translate to precise tap_element calls.

The Complete AI Agent Interaction Loop

Here's how a Codex or OpenClaw agent would typically operate PhoneAgent:

# Conceptual Python pattern using the RPC CLI programmatically
import subprocess
import json

def phone_agent_action(command, *args):
    """Execute PhoneAgent RPC command and parse response."""
    cmd = ["./.agents/skills/phoneagent/scripts/rpc.py", command] + list(args)
    result = subprocess.run(cmd, capture_output=True, text=True)
    return json.loads(result.stdout)

# Agent loop: observe → decide → act
def execute_task(task_description):
    # 1. Get current visual state
    screen = phone_agent_action("get-screen-image", "--print-metadata")
    tree = phone_agent_action("get-tree")
    
    # 2. Send to AI model (conceptual—actual integration uses model API)
    # model_input = f"Task: {task_description}\nScreen: {screen}\nUI Tree: {tree}"
    # action = ai_model.decide(model_input)
    
    # 3. Execute decided action
    # Example: tap element at specific coordinates
    # phone_agent_action("tap-element", "{{100, 200}, {50, 50}}")
    
    # Example: type text into focused field
    # phone_agent_action("enter-text", "Hello, automated world!")
    
    # 4. Verify and continue loop until task complete

Critical implementation detail: The tap_element and enter_text commands use iOS-style rectangle notation {{x, y}, {width, height}} even on Android. This coordinate normalization means your AI agent's spatial reasoning transfers across platforms without modification.

Custom Endpoint Configuration

For non-standard deployments or multiple device labs:

# Target specific host/port (useful for Docker, remote servers, or multiple devices)
./.agents/skills/phoneagent/scripts/rpc.py --host 10.0.1.5 --port 45679 get-tree

Why this matters: The default 127.0.0.1:45678 assumes local development. But the --host and --port flags enable distributed automation farms where multiple PhoneAgent bridges run on different machines, all controlled from a central orchestrator. This is how you'd scale to device labs with hundreds of phones.

iOS-Specific Agent Loop (In-App Mode)

The submit_prompt method is exclusive to iOS and powers the conversational agent:

// Conceptual flow from SimulatorRPCServer.swift architecture
// This runs inside the XCTest-hosted bridge, not from CLI

func handleSubmitPrompt(request: JSONRPCRequest) -> JSONRPCResponse {
    let prompt = request.params["prompt"] as! String
    
    // 1. Capture current screen state
    let screenshot = captureScreenshot()
    let uiTree = getAccessibilityTree()
    
    // 2. Send to OpenAI Responses API with function definitions
    let response = openAIClient.complete(
        prompt: prompt,
        context: [screenshot, uiTree],
        tools: availableActions // tap, scroll, enter_text, etc.
    )
    
    // 3. Execute returned tool calls
    for toolCall in response.toolCalls {
        execute(action: toolCall.name, params: toolCall.arguments)
    }
    
    // 4. Return completion status
    return JSONRPCResponse(result: ["status": "completed", "actions_taken": response.toolCalls.count])
}

The architectural insight: The in-app agent isn't hardcoded with task logic. It's a generic executor that sends screen context to OpenAI and interprets the model's tool-use decisions. This means it improves automatically as GPT-4 and successors get smarter—no app updates required.

Advanced Usage & Best Practices

Handle Animation Timing Explicitly. The README warns that "UI tree snapshots can be noisy/stale during animations." Production automations should poll get_tree with small delays until element stability is detected, or use get_screen_image and frame-differencing to detect motion cessation.

Implement Retry Logic for Text Input. "Keyboard/text reliability can vary by app and platform." Wrap enter_text calls with verification: read back the field value via get_tree, retry with alternative input methods if mismatch detected. Some apps intercept keyboard input for custom handling that standard adb shell input text or XCTest typing can't penetrate.

Secure Your RPC Endpoints. While PhoneAgent binds to localhost by default, any process on your machine can reach port 45678. On shared development machines, consider SSH tunneling or firewall rules. The Android bridge's --serial restriction helps, but verify no other adb server processes can inject commands.

Cache App Identifiers. The common iOS identifiers (Settings: com.apple.Preferences, Camera: com.apple.camera, Photos: com.apple.mobileslideshow, Messages: com.apple.MobileSMS, Home Screen: com.apple.springboard) should be constants in your automation framework. For third-party apps, extract identifiers programmatically rather than hardcoding—Apple can change these between iOS versions.

Design for the Agent Loop Limitation. Android doesn't yet implement submit_prompt. If you need conversational AI control on Android, build your own loop: your orchestrator calls get_screen_image, sends to your AI model, parses the response, and issues discrete RPC commands. This is slightly more code but gives you full control over prompting and error handling.

Comparison with Alternatives

Feature PhoneAgent Appium UI Automator (raw) Shortcuts Playwright Mobile
Cross-platform single API ✅ Yes ⚠️ Partial (different locators) ❌ No ❌ iOS only ⚠️ Browser-only
AI agent native integration ✅ Built for this ❌ Manual ❌ Manual ❌ No ❌ No
Physical device support ✅ iOS + Android ✅ Yes ✅ Android only ❌ No ❌ No
No jailbreak/root required ✅ Yes ✅ Yes ✅ Yes ✅ Yes ✅ Yes
Wake word / voice trigger ✅ iOS mode ❌ No ❌ No ⚠️ Limited ❌ No
Wireless operation ✅ Android ⚠️ Complex setup ⚠️ ADB required ✅ Yes ✅ Yes
Open source ✅ Yes ✅ Yes ✅ Yes ❌ No ✅ Yes
Production maturity ⚠️ Experimental ✅ Mature ✅ Mature ✅ Mature ✅ Mature

When to choose PhoneAgent: You're building AI-native mobile automation, need unified iOS/Android scripting, or want voice-triggered in-app agents. The experimental status is acceptable for R&D, prototyping, and internal tools.

When to choose Appium: You need battle-tested stability, extensive community plugins, and enterprise support. Appium's ecosystem is vastly larger, but its AI integration requires significant custom engineering.

When to choose raw UI Automator/Espresso/XCUITest: You're writing traditional automated tests with explicit assertions, not AI-driven exploration. These frameworks offer finer control but no intelligence abstraction.

FAQ

Is PhoneAgent safe to use with my personal device? The RPC bridge is localhost-only and API keys use iOS Keychain. However, it's experimental software—use a dedicated test device or simulator for sensitive operations. Never run the bridge on public networks without additional authentication.

Do I need a paid OpenAI API key? The in-app iPhone agent requires an OpenAI API key for the Responses API. The external bridge mode (Codex/OpenClaw control) doesn't inherently require OpenAI—you can use any AI system that generates the appropriate JSON-RPC commands.

Can PhoneAgent automate any app, including games? It can interact with any app through accessibility APIs and screenshot analysis. Games with custom rendering engines (Unity, Unreal) may require vision-based (get_screen_image) rather than hierarchy-based (get_tree) automation, which is supported but slower.

Why does Android lack submit_prompt? The README explicitly notes this as a known limitation. The Android bridge focuses on RPC execution; you'll need to implement the agent loop in your orchestration code. Community contributions to add this would be welcome.

How does this compare to Apple's AX API or Android's AccessibilityService? PhoneAgent builds on top of these native frameworks. It doesn't replace them—it provides the unified RPC layer and AI integration that raw platform APIs lack. You're still using XCTest and UiAutomator under the hood.

Can I run this in CI/CD pipelines? Yes, particularly the external bridge mode. The headless-friendly CLI interface and wireless Android support make it suitable for device labs. However, iOS simulator launching in CI requires macOS runners (GitHub Actions, Bitrise, etc.).

What happens when iOS or Android updates break the automation? This is the reality of mobile automation. PhoneAgent's abstraction layer minimizes platform-specific code in your automations, but underlying XCTest or UiAutomator changes may require framework updates. The open-source nature means you can patch rather than wait for vendor fixes.

Conclusion

PhoneAgent represents something rare in the mobile automation space: a genuinely new architectural approach rather than incremental improvement on existing tools. By treating phones as AI-controllable endpoints through clean JSON-RPC, it bridges the gap between powerful language/vision models and the mobile platforms where billions of users actually work.

The experimental status is real—you'll encounter rough edges, platform limitations, and the occasional stale UI tree. But the foundation is solid: SwiftUI and XCTest for native iOS integration, adb and UiAutomator for Android reach, and a unified command surface that lets AI agents reason about mobile interfaces the same way they reason about web pages.

For developers building the next generation of AI-powered mobile experiences, PhoneAgent isn't just a tool—it's a proof of concept for what's possible when we stop treating mobile as harder-to-reach desktop and start designing for AI-native device control.

Ready to automate your phone? Clone the repository, run through the quick start, and experiment with the rpc.py CLI. The demos don't do it justice—there's something genuinely startling about typing a command on your laptop and watching your iPhone respond like a puppet on digital strings. That's the future PhoneAgent is building, one JSON-RPC call at a time.

Star the repo, open an issue with your automation scenario, and join the small but growing community of developers who refuse to accept that mobile should be the platform AI can't reach.

Comments (0)

Comments are moderated before appearing.

No comments yet. Be the first to share your thoughts!

Support us! ☕