Building device infrastructure for mobile AI agents

Every major tech company is building coding agents. The frontier is moving from “write code” to “interact with the product.” But the moment your agent needs to tap a button on a real iPhone, you hit an infrastructure problem that has nothing to do with AI.

We know this because we built it before. At Uber, we created DragonCrawl — the first AI-powered mobile testing system to run in production at scale. It executed Uber’s core trip flow across 85 cities with 99%+ stability, zero maintenance, on every Android code change. The models were the easy part. The infrastructure to actually run them against real devices was where we spent most of our time.

Now we’re building Revyl, and the problem is the same. There’s been a lot of progress on getting AI to reason about what to do on a screen. But getting it a real device to act on — with low-latency streaming, clean state, and verified execution — is a different kind of hard. That’s the infrastructure iceberg.

• • •

The iceberg

We wanted to get to a single CLI command: tell the device what to do in natural language, and have it happen. Here’s what that actually required.

revyl device tap —target “the login button”

Apple Silicon Mac Mini fleet management

Ansible provisioning — 13 roles, kernel tuning, simulator matrices

Device boot, clean-state erasure, port allocation

H.264 WebRTC streaming at 24-30 fps

Multi-model AI grounding with self-correction

Native action execution + visual verification

Distributed orchestration with sticky device routing

OpenTelemetry observability across every layer

One command on top. Eight layers of infrastructure underneath, each with its own failure modes, each harder than you’d expect.

• • •

Demo

9:41 AM

✓

Welcome back!

$ revyl device tap —target “the sign in button”

Capturing screenshot…

Grounding: “sign in button” → (142, 308)

Executing tap at (142, 308)…

Verified: screen changed. Step complete.

Done in 2.8s

• • •

The seven layers

Here’s what we actually had to build. Each of these is its own project, with its own set of surprises.

Custom Hardware Fleet

We run dedicated Apple Silicon Mac Minis. Provisioning them turned out to be one of the hardest parts — there’s no Terraform for Mac hardware. We wrote 13 Ansible roles covering kernel tuning, simulator creation, and observability. A custom LaunchDaemon supervisor handles auto-restarts, version polling, and graceful shutdown. Deploys happen in seconds via GitHub webhooks and AWS SSM.

Hard Emulators leak resources. Every iOS update can break your simulator matrix. Someone has to be on-call for hardware failures at 2am.

Device Provisioning

Android emulators cold-boot in 1-5 minutes. iOS simulators need to be erased between runs for hermetic isolation, then booted, then connected to a companion daemon. We added pre-warming to cut first-task latency by 30-60 seconds, but getting it right meant handling all the silent failures.

Hard Boot processes hang without error. Port conflicts across concurrent sessions produce cryptic failures. Clean-state guarantees require careful lifecycle management that’s easy to get subtly wrong.

Real-Time Video Streaming

We stream H.264 video at 24-30fps from real devices via WebRTC, anywhere in the world. A local ring buffer lets the AI “watch” what just happened. The pipelines are self-healing — they auto-restart on codec failures without losing the session. Getting here required a lot of time with GStreamer.

Hard Codec negotiation, WHIP/WHEP handshakes, resolution mapping between logical and physical pixels. One misconfigured buffer adds multi-second latency. One codec mismatch kills the stream silently.

AI Element Grounding

”Tap the login button” needs to become precise pixel coordinates. We use a reasoning model to understand intent and a vision model to locate the element on screen. When grounding fails, the system retries with expanded search regions. No CSS selectors, no accessibility IDs — just the screenshot and the instruction.

Hard Vision models hallucinate coordinates. Screen densities vary wildly across devices. Dynamic UIs shift elements between the moment we capture and the moment we execute. Getting to production-grade accuracy took months of iteration.

Action Execution & Verification

The loop is: execute, wait, verify. Every action is visually confirmed before we move on. We support the full mobile vocabulary — tap, swipe, type, long press, pinch, drag, scroll — using platform-native APIs (UIAutomator2 for Android, XCTest/IDB for iOS) with tiered retry logic.

Hard ADB commands hang without warning. XCTest runners timeout silently. Keyboard state is unpredictable across OS versions. Animation timing varies per device, so “wait for the screen to settle” is harder than it sounds.

Parallel Orchestration

We needed to go from 1 session to 10,000+ concurrent sessions. The Mac Mini fleet handles iOS. A workflow orchestrator runs suites in parallel with automatic capacity management. Every session gets a clean, isolated device.

Hard Devices are stateful — you can’t just spin up a container. We had to build sticky routing, distributed concurrency control, pre-warming to avoid 3-5 minute cold boots, and queue-depth autoscaling. All of it custom.

Observability

We instrumented everything with OpenTelemetry: grounding latency, click accuracy, action time, streaming health, provisioning phases. Every session gets a full video recording for post-mortem review.

Hard When something fails, you’re tracing across LLM, device, streaming, and orchestrator simultaneously. Building the observability layer was as much work as the infrastructure itself.

• • •

The interface

Seven layers of complexity, one CLI:

# AI-grounded device actions — no selectors, no coordinates

$ revyl device tap —target “the login button”

Verified. Step complete.

$ revyl device type —target “email field” —text “user@company.com”

Verified. Step complete.

$ revyl device swipe —target “the feed” —direction up

Verified. Step complete.

# Expose as MCP tools for any coding agent

$ revyl mcp

MCP server running. Agent can call device actions directly.

The CLI is the primary interface. You say what to do in natural language, and we handle the screenshot, grounding, execution, and verification loop behind it.

We also expose everything as an MCP server — revyl mcp turns the device into a tool that any agent framework can call directly, whether it’s Claude, GPT, or something custom. The live video stream is always available for continuous perception-action loops.

• • •

Performance

We optimized for the tight loop an AI agent needs: capture the screen, ground the instruction, execute the action, verify the result. Here’s where we ended up compared to other approaches.

State Polling Latency

Revyl

~50ms

In-house

~400ms

BrowserStack

~650ms

Sauce Labs

~700ms

Time to First Device Action

Revyl

~8s

BrowserStack

~45s

Sauce Labs

~60s

In-house

~180s

Max Concurrent Device Sessions

Revyl

10,000+

Sauce Labs

~50

BrowserStack

~25

• • •

Numbers

24-30

FPS streaming

Ansible roles

10,000+

Parallel sessions

~50ms

Polling latency

Compute tiers

Selectors required

Architecture

The full request path:

Your Agent

→

Revyl API

→

Orchestrator

→

Mac Mini Fleet (iOS)

K8s Cluster (Android)

→

Device Action

→

WebRTC Stream

→

Verified Result

Your agent talks to our API. We provision the device, stream the screen, ground the instruction, execute, verify, and return the result. You focus on the intelligence.

• • •

Try it

$ brew install revylai/tap/revyl

$ npm install -g @revyl/cli

View on GitHub →

If you’re building an agent that needs to interact with real mobile devices, we’d like to help. Reach out at anam@revyl.ai or book a demo.