{"title":"Computer Use Agents","description":"","content":"Some systems still need to be automated through the screen.\n\nMany business workflows still live in GUIs: legacy ERP screens, internal admin dashboards, government portals, bank websites, virtual desktops, and desktop applications that were never designed for automation. A model can plan and call APIs, but if the only way to finish the work is to click a button in a browser or native app, an API-only agent has no useful path forward.\n\n**Computer use agents** operate through the user interface. They inspect the current screen, choose a mouse or keyboard action, execute it, and observe the result. The interface is no longer a JSON schema; it is the same visual environment a person would use.\n\nThat makes them useful for form filling, GUI regression testing, legacy system integration, dashboard extraction, and workflows that cross several applications without a shared backend.\n\nThe trade-off is real. Screen automation is slower, more expensive, and more fragile than API integration. Layouts shift, modals appear, focus moves, network delays change timing, and a coordinate that worked yesterday may click the wrong element today. A reliable system wraps the model in verification, recovery, permissions, logging, and hard limits.\n\n---\n\n# The Screenshot-Action Loop\n\nEvery computer use agent runs an observe-and-act loop. It captures the current UI state, gives the model either an image or a structured representation of that state, receives the next action, executes it, and observes again. The loop stops when the task is complete, a guardrail blocks the action, or the run hits a budget or step limit.\n\n\n```mermaid\nflowchart TD\n START[Task / Goal]:::primary --> CAP[Capture Screen]:::teal\n CAP --> PARSE[Parse Screen Content
Screenshot or DOM]:::orange\n PARSE --> REASON[Reason About Next Action
LLM decides what to do]:::teal2\n REASON --> ACT{Execute Action}:::yellow\n ACT -->|Click| CLICK[Click at x, y]:::green\n ACT -->|Type| TYPE[Type text into field]:::green\n ACT -->|Scroll| SCROLL[Scroll up/down]:::green\n ACT -->|Key press| KEY[Press key combo]:::green\n CLICK --> OBS[Observe Result]:::teal\n TYPE --> OBS\n SCROLL --> OBS\n KEY --> OBS\n OBS -->|Task incomplete| CAP\n OBS -->|Task complete| DONE[Return Result]:::green\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef yellow fill:#ffd43b,stroke:#000,color:#000\n classDef teal2 fill:#3bc9db,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nThe structure is the same observe-think-act loop used by other agents, but the observation is much noisier. An API call returns typed fields. A screenshot returns pixels. The model has to infer text, layout, clickable controls, disabled states, and current focus from what is visible.\n\nEach iteration is also more expensive. Image inputs usually cost more than short text tool results, and UI tasks often need many observations because every click, page load, and error state must be checked. Cap the step count, cap the spend, and use lower-resolution or cropped observations when the task allows.\n\nFailures compound. The model may click the wrong control, type into the wrong field, miss a toast notification, or misread a disabled button as active. After one bad action, the agent may no longer be in the state it expected, so later actions can affect the wrong record, page, or tab.\n\nA reliable agent verifies each important action and has a clear recovery path. It checks whether the expected page appeared, whether the value was entered, whether the submit button became enabled, and whether the final confirmation matches the task. Continuing after a failed UI action can corrupt data in the underlying system.\n\n---\n\n# DOM Parsing vs Screenshot-Based Approaches\n\nThere are two common ways for a computer use agent to understand what is on screen. Each has different strengths.\n\n**Screenshot-based approaches** send an image of the screen to the model. The model reads visible text, interprets layout, identifies controls, and returns an action such as click, drag, type, scroll, or press a key. This is the most general approach because it can work across web pages, native applications, remote desktops, and terminal UIs. Generality has limits: hidden state, tiny controls, low contrast, virtualized lists, and stale screenshots can still confuse the model.\n\n**DOM-based approaches** work in browsers. Instead of relying only on pixels, the agent extracts page structure: element types, accessible names, roles, labels, attributes, text content, and visibility. The agent can target an element by selector, role, or accessibility label instead of raw coordinates.\n\n\n| Aspect | Screenshot-Based | DOM-Based |\n|--------|-----------------|-----------|\n| Works on | Any application | Web browsers only |\n| Input to model | Image (high token cost) | Text/HTML (lower token cost) |\n| Accuracy | Depends on visual recognition | Often higher for structured page elements |\n| Speed | Slower (image processing) | Faster (text processing) |\n| Handles dynamic content | Sees the visible state | Can inspect structured page state, but lazy or virtualized elements may not exist yet |\n| Fragility | Breaks with visual changes | Breaks with HTML structure changes |\n| Accessibility | Can read any visible text | Can use aria labels, roles |\n\n\nProduction browser agents usually combine both. They use the DOM or accessibility tree for precise targeting and screenshots for layout, visual confirmation, and cases where the DOM does not explain what the user can actually see. The model gets enough context to choose a sensible action, and the runtime uses deterministic selectors whenever it can.\n\n\n```mermaid\nflowchart LR\n PAGE[Web Page]:::primary --> DOM[DOM Parser
Extract HTML tree]:::teal\n PAGE --> SS[Screenshot
Capture pixels]:::orange\n DOM --> MERGE[Combined Context
Structure + Visual]:::teal2\n SS --> MERGE\n MERGE --> LLM[LLM Decides
Next Action]:::green\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal2 fill:#3bc9db,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nDOM parsing does not help with native desktop applications, remote desktops, or virtualized enterprise apps. In those environments you rely on screenshots, OCR, and OS accessibility APIs. Accessibility trees are valuable when available, but their quality varies by application. Some enterprise apps expose rich semantics; others expose little more than a bitmap.\n\n---\n\n# Computer Use Tooling\n\nModern model providers expose computer use through tool interfaces. The details vary by provider. OpenAI's current API includes a `computer` tool for computer-use workflows. Anthropic's computer use tool is still documented as a beta feature with versioned tool identifiers and beta headers. Those names change over time, so treat exact model names, tool types, and opt-in headers as deployment details to check against the provider docs.\n\nThe architecture is more stable than the API names: the model asks for a UI action, your application decides whether that action is allowed, executes it inside a controlled environment, and returns the next observation.\n\nCommon capabilities look like this:\n\n- **computer**: take screenshots, move the mouse, click, type, scroll, and press key combinations\n- **browser automation**: navigate pages, click selectors, fill forms, take screenshots, and inspect browser state\n- **file or shell tools**: edit files or run commands when the task genuinely requires those privileges\n\nTreat these as separate privilege levels. A browser-only workflow may need a computer or browser tool. A coding workflow may need file editing and shell execution. Many business workflows should not receive shell access at all.\n\nHere is the shape of a minimal Anthropic-style loop. This example is intentionally small and macOS-specific: it uses `screencapture` for screenshots and `cliclick` for mouse and keyboard actions. In production you would usually run the same idea inside a Linux VM or container with a virtual display, use the provider's current tool version, and add stricter action validation.\n\n\n**main.py**\n\n```python\nimport anthropic\nimport base64\nimport os\nfrom pathlib import Path\nimport subprocess\n\nclient = anthropic.Anthropic()\nCLAUDE_MODEL = os.getenv(\"CLAUDE_MODEL\", \"claude-opus-4-8\")\nSCREENSHOT_PATH = Path(\"/tmp/computer-use-screenshot.png\")\n\ntools = [\n {\n \"type\": \"computer_20251124\",\n \"name\": \"computer\",\n \"display_width_px\": 1280,\n \"display_height_px\": 800,\n \"display_number\": 1,\n }\n]\n\ndef take_screenshot() -> str:\n \"\"\"Capture the macOS screen and return a base64-encoded PNG.\"\"\"\n subprocess.run([\"screencapture\", \"-x\", str(SCREENSHOT_PATH)], check=True)\n with SCREENSHOT_PATH.open(\"rb\") as f:\n return base64.standard_b64encode(f.read()).decode(\"utf-8\")\n\ndef screenshot_result():\n screenshot_b64 = take_screenshot()\n return [{\n \"type\": \"image\",\n \"source\": {\n \"type\": \"base64\",\n \"media_type\": \"image/png\",\n \"data\": screenshot_b64,\n },\n }]\n\ndef run_computer_use_loop(task: str, max_steps: int = 15):\n \"\"\"Run a small computer use agent loop.\"\"\"\n messages = [{\"role\": \"user\", \"content\": task}]\n\n for _ in range(max_steps):\n response = client.beta.messages.create(\n model=CLAUDE_MODEL,\n max_tokens=4096,\n betas=[\"computer-use-2025-11-24\"],\n tools=tools,\n messages=messages,\n )\n\n if response.stop_reason == \"end_turn\":\n for block in response.content:\n if hasattr(block, \"text\"):\n return block.text\n return \"Task completed.\"\n\n tool_results = []\n for block in response.content:\n if block.type == \"tool_use\":\n result = execute_tool(block.name, block.input)\n tool_results.append({\n \"type\": \"tool_result\",\n \"tool_use_id\": block.id,\n \"content\": result,\n })\n\n if not tool_results:\n for block in response.content:\n if hasattr(block, \"text\"):\n return block.text\n return \"No tool requested; stopping.\"\n\n messages.append({\"role\": \"assistant\", \"content\": response.content})\n messages.append({\"role\": \"user\", \"content\": tool_results})\n\n return \"Reached maximum steps without completing.\"\n\ndef execute_tool(tool_name: str, tool_input: dict):\n \"\"\"Execute an allowed computer action and return a fresh observation.\"\"\"\n if tool_name != \"computer\":\n return \"Tool not allowed.\"\n\n action = tool_input.get(\"action\")\n\n if action == \"screenshot\":\n return screenshot_result()\n\n if action == \"left_click\":\n x, y = tool_input[\"coordinate\"]\n subprocess.run([\"cliclick\", f\"c:{x},{y}\"], check=True)\n return screenshot_result()\n\n if action == \"type\":\n text = tool_input[\"text\"]\n subprocess.run([\"cliclick\", f\"t:{text}\"], check=True)\n return screenshot_result()\n\n if action == \"key\":\n key = tool_input[\"text\"]\n subprocess.run([\"cliclick\", f\"kp:{key}\"], check=True)\n return screenshot_result()\n\n return f\"Unsupported computer action: {action}\"\n```\n\n\nComputer interaction is still tool use. The model requests an action, the runtime validates it, performs it in a controlled environment, and returns an observation. The same engineering disciplines apply as with any other tool: schema design, permission checks, logging, retries, timeouts, and clear stopping conditions.\n\nAfter every action, the agent takes a new screenshot and sends it back to the model. This is the observe step. The model looks at the updated screen, checks whether the action succeeded, and decides what to do next. If a click landed in the wrong place or a page did not load as expected, the next observation gives the agent a chance to recover.\n\nSeveral practical details matter. The display resolution in the tool definition must match the screenshot coordinate space. Retina scaling, browser zoom, remote desktop scaling, and CSS zoom can all produce off-by-two or off-by-fraction errors. Keep the environment deterministic: fixed viewport, fixed zoom, predictable fonts, stable locale, and no personal accounts or secrets available to the agent. Run computer-use agents in a sandboxed VM, a container with a virtual display, or a disposable browser profile. The model is controlling a real machine, so the environment must be isolated enough that a wrong click cannot reach data outside the sandbox.\n\n---\n\n# Browser Automation with Playwright\n\nFor web automation, full computer use is often the wrong default. Browser libraries such as Playwright provide deterministic control over navigation, selectors, form filling, screenshots, tracing, downloads, and network events. When the target is a web application, a Playwright-based agent is usually cheaper and more reliable than pixel-level clicking.\n\nPlaywright lets you launch a browser, navigate to URLs, interact with page elements using CSS selectors, extract content, and take screenshots. Combined with an LLM for decision-making, you get an agent that can navigate web workflows without asking the model to reason entirely from pixels.\n\n\n**main.py**\n\n```python\nimport asyncio\nimport json\nimport os\nfrom playwright.async_api import TimeoutError as PlaywrightTimeoutError\nfrom playwright.async_api import async_playwright\nfrom openai import OpenAI\n\nclient = OpenAI()\nMODEL = os.getenv(\"OPENAI_AGENT_MODEL\", \"gpt-5.4-mini\")\n\nasync def wait_for_page_update(page):\n try:\n await page.wait_for_load_state(\"domcontentloaded\", timeout=5000)\n except PlaywrightTimeoutError:\n pass\n\nasync def browse_and_extract(task: str, start_url: str):\n \"\"\"Use Playwright + LLM to complete a web task.\"\"\"\n async with async_playwright() as p:\n browser = await p.chromium.launch(headless=True)\n context = await browser.new_context(viewport={\"width\": 1280, \"height\": 800})\n page = await context.new_page()\n await page.goto(start_url, wait_until=\"domcontentloaded\")\n\n for step in range(10):\n page_title = await page.title()\n page_url = page.url\n\n page_content = await page.evaluate(\"\"\"\n () => {\n const text = document.body.innerText.substring(0, 3000);\n const elements = [];\n\n document.querySelectorAll(\n 'a, button, input, select, textarea, [role=\"button\"]'\n ).forEach((el) => {\n if (el.offsetParent !== null) {\n elements.push({\n index: elements.length,\n tag: el.tagName.toLowerCase(),\n type: el.type || '',\n text: (el.innerText || el.value || el.placeholder || '')\n .substring(0, 100),\n ariaLabel: el.getAttribute('aria-label') || '',\n id: el.id || '',\n name: el.name || '',\n href: el.href || '',\n });\n }\n });\n return { text, elements };\n }\n \"\"\")\n\n elements_json = json.dumps(page_content['elements'][:30], indent=2)\n page_text = page_content['text'][:2000]\n prompt = f\"\"\"You are a web browsing agent. Complete this task: {task}\n\nCurrent page: {page_title} ({page_url})\nPage text (truncated): {page_text}\n\nInteractive elements:\n{elements_json}\n\nRespond with a JSON action:\n- {{\"action\": \"click\", \"index\": N}} to click element N\n- {{\"action\": \"type\", \"index\": N, \"text\": \"...\"}} to type in element N\n- {{\"action\": \"navigate\", \"url\": \"...\"}} to go to a URL\n- {{\"action\": \"done\", \"result\": \"...\"}} when the task is complete\n\"\"\"\n\n response = client.chat.completions.create(\n model=MODEL,\n messages=[{\"role\": \"user\", \"content\": prompt}],\n response_format={\"type\": \"json_object\"},\n )\n\n action = json.loads(response.choices[0].message.content)\n print(f\"Step {step + 1}: {action}\")\n\n if action[\"action\"] == \"done\":\n await browser.close()\n return action.get(\"result\", \"Task completed.\")\n\n elif action[\"action\"] == \"click\":\n elements = await page.query_selector_all(\n 'a, button, input, select, textarea, [role=\"button\"]'\n )\n visible = [el for el in elements if await el.is_visible()]\n index = action.get(\"index\")\n if isinstance(index, int) and 0 <= index < len(visible):\n await visible[index].click()\n await wait_for_page_update(page)\n\n elif action[\"action\"] == \"type\":\n elements = await page.query_selector_all(\n 'a, button, input, select, textarea, [role=\"button\"]'\n )\n visible = [el for el in elements if await el.is_visible()]\n index = action.get(\"index\")\n if isinstance(index, int) and 0 <= index < len(visible):\n await visible[index].fill(action[\"text\"])\n\n elif action[\"action\"] == \"navigate\":\n await page.goto(action[\"url\"], wait_until=\"domcontentloaded\")\n\n await browser.close()\n return \"Reached max steps.\"\n\n# Run it\nresult = asyncio.run(browse_and_extract(\n task=\"Find the More information link and summarize where it points.\",\n start_url=\"https://example.com\"\n))\nprint(result)\n```\n\n\nThis approach is DOM-based. It still runs the same observe-think-act loop as the screenshot agent earlier, but the observation is a list of structured elements rather than a base64 image. The model sees controls with labels and metadata, which is easier to reason about than raw pixels and cheaper to send through the context window.\n\nThe model chooses a semantic action, but Playwright performs the interaction. When the model says \"click element 5,\" the runtime maps that index to a concrete visible DOM element and calls `click()`. This is safer than asking the model to click coordinate `(450, 320)`, because the element can still be found even if the page shifts by a few pixels.\n\nThe trade-off is coverage. DOM extraction only works in browsers, and even then you must handle iframes, shadow DOM, canvas-heavy apps, virtualized lists, popups, downloads, authentication flows, and bot defenses. Frameworks such as Browser Use build on Playwright and add element annotation, screenshot fallbacks, multi-tab handling, and higher-level browser state management.\n\n---\n\n# Desktop Automation\n\nWeb browsers are not the only thing you might need to automate. Desktop applications, system dialogs, file managers, and native apps require a different approach. Desktop automation operates at the OS level, interacting with windows, menus, and UI elements through accessibility APIs or direct input simulation.\n\nOn macOS, the main option is the accessibility framework through tools such as `pyobjc` or AppleScript. On Windows, UI Automation is available through libraries such as `pywinauto`. On Linux, common options include `xdotool` and AT-SPI. For cross-platform work, `pyautogui` is a simple starting point because it simulates mouse and keyboard input at the OS level. It is also less reliable because it has little understanding of which window is focused or what a control means.\n\n\n**main.py**\n\n```python\nimport pyautogui\nimport time\n\ndef open_app_and_interact():\n \"\"\"Simple desktop automation example with pyautogui.\"\"\"\n # Open an application (macOS example)\n pyautogui.hotkey(\"command\", \"space\") # Open Spotlight\n time.sleep(0.5)\n pyautogui.typewrite(\"TextEdit\", interval=0.05)\n time.sleep(0.5)\n pyautogui.press(\"enter\")\n time.sleep(1)\n\n # Type some text\n pyautogui.typewrite(\"Quarterly report draft\", interval=0.03)\n\n # Save the file\n pyautogui.hotkey(\"command\", \"s\")\n time.sleep(0.5)\n pyautogui.typewrite(\"agent_output.txt\", interval=0.05)\n pyautogui.press(\"enter\")\n```\n\n\nThis code is fragile. It uses fixed delays (`time.sleep`), types at specific intervals, and assumes the UI responds exactly as expected. Production desktop automation combines input simulation with screen observation or accessibility checks so it can verify each step before moving on.\n\nA more reliable pattern looks like this:\n\n\n```mermaid\nflowchart TD\n TASK[Desktop Task]:::primary --> FIND[Find Target Window
Accessibility API]:::teal\n FIND --> VERIFY[Take Screenshot
Verify correct window]:::orange\n VERIFY --> ACT[Perform Action
Click / Type / Hotkey]:::teal2\n ACT --> WAIT[Wait for UI Update
Poll for changes]:::yellow\n WAIT --> CHECK{Expected
result?}:::orange\n CHECK -->|Yes| NEXT[Next Step]:::green\n CHECK -->|No| RETRY[Retry or
Recover]:::red\n NEXT --> FIND\n RETRY --> FIND\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef yellow fill:#ffd43b,stroke:#000,color:#000\n classDef teal2 fill:#3bc9db,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n```\n\n\nThe difference from web automation is the quality of structure. Desktop automation may have an accessibility tree, but it is not as consistent as a browser DOM. Some applications expose names, roles, values, and actions. Others expose little useful metadata. Screenshot-based models are useful here because they can reason over the visible interface when structured UI data is incomplete.\n\nFor production desktop automation, run agents inside virtual machines or containers with virtual displays when possible. Tools such as `Xvfb` on Linux give an agent a display it can interact with without using a physical monitor. This makes it easier to run multiple agents in parallel and keeps them away from your real desktop environment.\n\n---\n\n# Build vs Browse: When to Use Computer Use Agents\n\nNot every problem needs a computer use agent. Most do not. The decision between an API integration, browser automation, and full computer use is one of the most important architectural choices in this space.\n\nHere is a decision framework:\n\n\n| Scenario | Best Approach | Why |\n|----------|--------------|-----|\n| Target has a well-documented API | API integration | Faster, cheaper, more reliable |\n| Target has no API at all | Computer use agent | Practical fallback when the GUI is the real workflow |\n| Target has an API but it is incomplete | Hybrid | API for most tasks, computer use for gaps |\n| You need to automate many different sites | Browser automation or hybrid | Prefer stable APIs where available; use UI automation for the gaps |\n| Task requires high reliability (99.9%+) | API integration | Computer use agents have higher failure rates |\n| One-off data extraction from a website | Browser automation | Quick to set up, does not need ongoing reliability |\n| Testing your own web application | Browser automation | Playwright is purpose-built for this |\n| Automating legacy desktop software | Computer use agent | No API exists, no DOM to parse |\n| Interacting with a system behind MFA/SSO | Usually avoid full automation | Use approved delegated access, human approval, or official APIs when possible |\n\n\nIn practice: use an API when it exists and covers the workflow. Use browser automation when the target is web-based and selectors are stable enough. Use full computer use when the workflow only exists through a GUI, spans applications, or requires visual interaction that cannot be represented cleanly as a DOM operation.\n\nA common enterprise pattern is legacy system integration. Many organizations still run critical processes on software that has no maintained API and will not be replaced soon. A computer use agent can bridge that gap by filling forms and extracting confirmations while the organization works toward a proper integration or migration. The healthiest use is short-term coverage during a migration window, not a permanent layer the rest of the system depends on.\n\nOther strong use cases include:\n\n- **Automated testing**: having an agent explore your web app and report possible bugs, complementing traditional test suites\n- **Data extraction**: pulling structured data from websites that have no API and block scrapers\n- **Form filling**: completing repetitive forms across multiple systems that do not integrate with each other\n- **Process automation**: end-to-end workflows that span multiple applications, such as copying data from one system into another and recording the confirmation\n\nThe weakness is reliability at scale. A person can notice when a page renders differently and pause. A model-driven agent may misread the new state and continue with a stale plan. Systems that hold up in production add explicit verification, retries with limits, fallbacks to human review, idempotency checks, and audit logs for every action.\n\n---\n\n# Quiz\n\n---\n\n### References\n\n- [Anthropic Computer Use Documentation](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use)\n- [OpenAI Computer Use Documentation](https://developers.openai.com/api/docs/guides/tools-computer-use)\n- [Playwright Documentation](https://playwright.dev/python/)\n- [Browser Use: Web AI Agent Framework](https://github.com/browser-use/browser-use)\n- [pyautogui Documentation](https://pyautogui.readthedocs.io/)","pageType":"ai-engineering"}

Get Premium