GUI Automation Agent

Vision-language agent that grounds goals into real keyboard/mouse actions across any app.

Why it mattersA template for operating legacy enterprise software without custom APIs. Relevant anywhere a long-tail of GUI tools blocks wider automation.

UnityPythonVLMs

What it does

An AI agent that operates software the way a person does - watching the screen, reasoning about what it sees, and driving the keyboard and mouse to finish the task. No bespoke API integration required.

Where it applies

Legacy enterprise software with no API surface, where automation has stalled for years.
Long-tail tooling in back-office, finance, and ops - the "last mile" that process-automation vendors never reach.
Any environment where procurement for a new integration is slower than just teaching an agent to click the right buttons.

How it works (high level)

A vision-language model grounds natural-language goals into pixel-level understanding of the current screen, then decomposes them into primitive input actions. The agent verifies each step's effect before moving on, which is what stops it from silently drifting off the plan.

Stack

Unity · Python · vision-language models · input-automation APIs.

← Back to all projects