v0.5.0 · open source

Computer-use agents fail silently.
farscry tells them when it happens.

Agent clicks a button. Tool returns {"success": true}. Screen doesn't change. Agent has no idea.
farscry augment detects this inline and tells the agent immediately, zero model changes, zero retraining, three lines of MCP config.

View on GitHub → npm install -g farscry

farscry extract screen.png

$ farscry extract screen.png === farscry visual context === source: screen.png screen_type: config state_id: phash:8f4a2c9d1e3b7f6a confidence: high agent_context: "Payment settings - Save available" --- [middle-right] button "Save Changes" enabled:true [middle-center] input "Card number" empty:true [middle-center] input "Expiry" value:"12/26" [bottom] error "Value must be ≤ 10000" affordances: click → "Save Changes" at (400,300) enabled:true type → "Card number" at (300,200) current:""

Coordinate Extraction

Coordinates, not captions

Screenshot tooling returns descriptions. Workflows guess where to click, miss the target, and fail. OSWorld benchmarks show a ~88% overall task failure rate (GPT-4V baseline, arXiv:2404.07972; Claude CUA: ~15% success).

Agents that guess coordinates fail. Agents with exact coordinates act.

farscry returns typed elements with exact pixel coordinates. The workflow knows the button is at (400,300). It clicks. It succeeds.

Screenshot, error printout, Figma export, phone photo of a screen. Offline. No GPU.

farscry output

[middle-right] button "Submit" enabled:true at (640,480) [middle-center] input "Username" empty:true at (400,200) [middle-center] input "Password" empty:true at (400,280) [bottom] error "Invalid password" at (400,340) [bottom] link "Forgot password?" at (400,380) affordances: type → "Username" at (400,200) type → "Password" at (400,280) click → "Submit" at (640,480)

Performance

65×

faster than Tesseract on 4K screens

38ms warm · $0 · offline · N=223, ScreenSpot-Pro (MIT)

65× faster · 4K screens vs Tesseract · 38ms warm

~9× fewer tokens · 1080p vs Claude Vision (measured)

~16× fewer tokens · 4K screens vs Claude Vision · N=223

Speed

vs Tesseract 5.5 (4K) ~2,500ms → 38ms 65×

vs Cloud Vision API ~2-5s → 38ms 65–130×

Token usage per image

vs Claude Vision · 1080p 1,568 tokens → ~175 tokens ~9×

vs Claude Vision · 4K (N=223) 1,568 tokens → ~97 tokens ~16×

Cost · 10,000 images / day

Claude Sonnet 4.6 · $3/MTok $17,000/year → $0

Claude Opus 4.7 · $5/MTok (4K) $90,000/year → $0

Methodology → github.com/teles-forge/farscry/benchmarks

Automatic Diff

38ms instead of 5 seconds, every action.

After an action, farscry diffs before → after. No re-upload. No tokens wasted on pixels that didn't change.

without

action → re-screenshot → 1,568 tokens to cloud → wait 2-5s → read result

~$0.0047 · ~3 seconds

farscry

action → farscry diff → 38ms → read result

Silent Failure Detection

Agents now know when their actions fail.

farscry augment compares visual state before and after every action. When nothing changed, it tells the agent inline, before it wastes more steps.

Without farscry augment

1Agent calls click(450, 320)

2Tool returns {"success": true}

✗Screen didn't change. Agent has no idea.

…8 more steps on the same broken state

~12,000 tokens wasted per loop

With farscry augment

1farscry_mark_action()

2Agent calls click(450, 320)

3farscry_extract(screenshot)

✓Agent receives inline warning. Pivots immediately.

detected at step 3 · zero extra tokens

farscry_extract response when action had no effect

=== farscry visual context === state_id: phash:8f4a2c9d1e3b7f6a --- [middle-right] button "Save Changes" enabled:true ⚠ SILENT_FAILURE DETECTED action had no visual effect state_id_before: phash:8f4a2c9d1e3b7f6a state_id_after: phash:8f4a2c9d1e3b7f6a recommendation: try a different approach

Three lines to enable farscry augment
// add to your MCP config, works with Claude Code, Cursor, any MCP client
"farscry": "command": "farscry", "args": ["serve", "--mcp"]

One binary. Four ways to use it.

Runs local for all four. No server, no GPU, no account.

mode 1: describe Any image → typed coordinates

The output is typed UI elements with exact pixel coordinates, not a description. Pipe to any agent, MCP client, or CLI tool.

farscry extract checkout.png

$ farscry extract checkout.png === farscry visual context === screen_type: form --- [top-center] heading "Checkout" [middle-center] input "Name" empty:true [middle-center] input "Card number" empty:true [middle-center] input "Expiry" value:"12/26" [middle-right] input "CVV" masked:true [bottom] button "Pay $24.99" enabled:true [bottom] button "Cancel" enabled:true affordances: type → "Name" at (340,180) type → "Card number" at (340,240) click → "Pay $24.99" at (400,420)

mode 2: diff After every action, only what changed

In MCP mode, the daemon tracks state automatically, no extra command needed. After each farscry_extract call, the next call returns what changed since the last one. Via CLI, pass two screenshots directly.

farscry diff before.png after.png

$ farscry diff before.png after.png === farscry diff === from: phash:8f4a2c3d to: phash:3d9b1e7a --- appeared: error "Card declined" at (0,48) changed: button "Submit" enabled:true → false changed: button "Pay $24.99" loading:true removed: spinner at (450,200) 38ms · 3 of 12 elements changed

mode 3: pipe Cmd+Shift+4, pipe, done

Capture a region with your system shortcut. Pipe straight to your agent. Zero files. Zero friction.

farscry extract --from-clipboard | claude -p "fix this"

Smart Paste

Cmd+V becomes image-aware.

After farscry setup, Cmd+V detects what's in your clipboard, image or text, and routes it. No command. No alias. Just paste.

farscry setup

$farscry setup

Configure smart Cmd+V? [y/N]: y

✓ ~/.farscry/smart-paste.sh created

iTerm2 · Warp · Kitty · Gnome Terminal · Windows Terminal

Screenshot → Cmd+V → image detected → sent to your agent.
Text in clipboard falls back to normal paste.

Visual Debug

See exactly what farscry sees.

Same screenshot, bounding boxes drawn. Proves accuracy. Shareable. Zero guessing.

mode 5: annotate Clipboard → annotated image

Every detected element gets a colored bounding box. Affordances get a thicker border.

farscry annotate --from-clipboard -o out.png

button input error heading label

Install

Up in 30 seconds.

npm npm install -g farscry

pip pip install farscry

homebrew brew install teles-forge/tap/farscry

cargo cargo install farscry

curl curl -fsSL https://farscry.dev/install | sh

$farscry setup

auto-detect MCP clients, wire up in 30 seconds

Models (~12MB English) download to ~/.farscry/models/ on first run. No account needed.

Protocol

The open protocol behind farscry

VASP (Visual Application State Protocol) defines how workflows receive visual context - as typed coordinates with positions, not descriptions. farscry is the reference implementation.

Like MCP standardized tool connectivity, VASP standardizes visual context for workflows. One format, any framework, any workflow.

vasp-protocol.github.io/spec -> Local docs ->

MCP = how workflows connect to tools
VASP = how workflows understand visual state

farscry setup

$ farscry setup farscry v0.4.0 Detected: Claude Code, Cursor ── Claude Code ── Config file: ~/.claude/mcp.json "mcpServers": "farscry": "command": "farscry", "args": ["serve", "--mcp"] farscry never modifies your config files automatically.

Coming soon

farscry cloud

Fleet-level AER visibility for AI Ops teams. Every agent deployment, one dashboard. Zero pixels ever leave your machine.

AER Dashboard

Deployments, not sessions

Track action effect rate per deployment, model, and environment over time. Spot regressions before users notice them.

Threshold Alerts

Your AER floor, your rules

Define the minimum AER you accept. Get notified the moment a deploy crosses the line. PagerDuty, Slack, or webhook.

Sector Baseline

Industry benchmark, opt-in

Compare your agent's AER against similar deployments in the field. Know if you're above or below the sector average.

What farscry cloud receives

metric payload — nothing else

{ "aer": 0.47, "sf_count": 18, "session_id": "abc123", "model": "claude-sonnet-4-5", "timestamp": 1716148440 }

Privacy model

No screenshots. No pixels. No user data. Everything is processed locally by farscry. Only aggregated metrics travel to the cloud. GDPR-safe and compliance-ready by design — teams in regulated industries can use it on day one.

Join early access → Free during early access · no credit card

Roadmap

What comes next.

v0.5.0 ships farscry augment: silent failure detection inline in every MCP response. farscry cloud is on the horizon.

v0.4.0 current

✓ Extract, diff, annotate: full VASP pipeline
✓ MCP server: 38ms warm response
✓ farscry hook: zero-friction terminal recording
✓ VASF sessions: pack + timeline
✓ Zero-copy pHash, 22MB daemon RSS (macOS)
✓ Linux: Docker + X11, 11MB VmRSS
✓ Global daemon: N terminals, one process
✓ npm, pip, Homebrew, crates.io

v0.5.0 current

✓ farscry augment: inline silent failure detection in MCP responses
✓ farscry_mark_action: explicit action marker MCP tool
✓ farscry analyze: AER and VLR metrics across sessions
✓ farscry mark-action: CLI action marker for terminal hook
✓ Headless Linux: Xvfb auto-start, works on any server
✓ farscry diff --json: structured diff output for tooling

v0.6.0 planned

VASP adapters: native Playwright and OpenAI Vision support
farscry install-lang: multilingual OCR via CDN
Per-window capture when minimized (SCContentFilter)
Screen-lock awareness in farscry serve

v0.7.0 — a11y grounding

Queries, not guesses

OS accessibility tree scraped into local SQLite. Agents query elements by role and name. Coordinates are pixel-perfect — not OCR estimates.

farscry_query("SELECT x,y FROM elements
WHERE role='button'
AND name='Save'")
→ (640, 480) in <1ms

v0.8.0 — rollback

Automatic recovery, no inference

When farscry detects a silent failure and the action is reversible, it injects a deterministic recovery before returning control to the agent. No extra model call.

SF detected → modal opened
farscry injects Escape via a11y
agent receives clean state

Changelog and full history: github.com/teles-forge/farscry/CHANGELOG.md

Computer-use agents fail silently.farscry tells them when it happens.