This project implements an autonomous agent capable of navigating and interacting with Android applications to accomplish a predefined "mission." The agent uses a combination of computer vision and UI analysis to understand the device's screen and decide on the next course of action.
The core of the agent is powered by a large language model (specifically, gpt-4o), which receives the current screen's screenshot, the UI's XML layout, the overall mission, and a history of past actions. Based on this context, the model determines the next action to take.
- Python 3: The primary programming language.
- OpenAI API (
gpt-4o): The "brain" of the agent, responsible for decision-making. - Pure Python ADB (
ppadb): For communication with the Android device via the Android Debug Bridge (ADB). - Pillow: For handling and processing screenshots.
- python-dotenv: For managing environment variables, particularly the OpenAI API key.
- fpdf2: For generating PDF reports.
The project's logic is consolidated in agent.py, which is structured into four main logical components, separated by comments:
- Vision Module: Responsible for analyzing the device's screen. It encodes the screenshot, prepares the prompt for the language model, and sends the request to the OpenAI API.
- Interaction Module: Handles all direct communication with the Android device. This includes taking screenshots, retrieving the UI XML layout, and executing actions.
- Reporting Module: Generates a PDF report of the agent's run, including the mission, steps taken, screenshots, and the agent's reasoning.
- Agent Core: The central orchestrator. It manages the main loop of the agent, which consists of perceiving the environment, thinking about the next action, and acting on the environment.
The agent has a memory mechanism to avoid getting stuck in loops. It keeps a history of the actions it has taken on each screen it has visited. When it revisits a screen, it is reminded of its previous actions on that screen, which helps it to make a better decision and avoid repeating failed attempts.
The agent can perform the following actions:
- TAP: Taps on a specific element on the screen, identified by its bounds in the UI XML.
- GO_BACK: Executes the system's "back" command, which is equivalent to pressing the back button.
- SCROLL: Scrolls the screen down or up to discover more content.
At the end of each run, the agent generates a PDF report that summarizes the mission. The report includes:
- The mission statement.
- A step-by-step log of the agent's actions.
- The screenshot for each step.
- The agent's reflection and thought process for each step.
- The final status of the mission (successful or failed).
The reports are saved in the reports directory.
To run the agent, follow these steps:
-
Prerequisites:
- Python 3.
- An Android device with USB debugging enabled.
- ADB (Android Debug Bridge) installed and running on your computer.
-
Setup:
- Clone the repository.
- Create and activate a Python virtual environment:
# For macOS/Linux python3 -m venv venv source venv/bin/activate # For Windows # python -m venv venv # .\venv\Scripts\activate
- Install the required Python packages:
pip install -r requirements.txt
- Create a
.envfile in the root of the project and add your OpenAI API key:OPENAI_API_KEY="your-api-key-here"
-
Running the Agent:
- Connect your Android device to your computer.
- Ensure the ADB server is running and your device is recognized (
adb devices). - Execute the
agent.pyscript with a mission passed as an argument:python agent.py "Your mission description here" - You can also optionally specify the maximum number of steps the agent can take:
python agent.py "Your mission description here" --max-steps 20
The agent will then start running, attempting to complete the mission provided via the command line.
- Modular Design: The code is organized into logical modules within
agent.py. Future development should maintain this separation of concerns. - Clear Naming: Functions and variables are named descriptively to enhance readability.
- Environment Variables: Sensitive information, such as API keys, should be stored in a
.envfile and not hardcoded in the source. - Error Handling: The code includes basic error handling for common issues like missing API keys or device connection failures.
- Extensibility: The agent's "mission" is configurable via a command-line argument.