Your smartphone knows everything about you, yet most AI assistants still can’t use it like you do. Siri can set a timer. Google Assistant can play music. But ask either to “open Meituan, find nearby hotpot restaurants, and add the top-rated one to my cart” and you’re back to manual tapping.
That frustration ends with Open-AutoGLM — the open-source phone agent model and framework from ZAI Org (the team behind Zhipu AI and GLM) that finally puts true device agency in everyone’s hands.
Released on December 8, 2025, Open-AutoGLM combines a powerful 9B-parameter multimodal vision-language model (AutoGLM-Phone-9B and its Multilingual variant) with a complete automation framework. You speak or type a natural-language goal in English or Chinese, the agent takes screenshots of your actual phone screen, understands the UI like a human, reasons step-by-step, and executes precise actions (tap, swipe, type, launch apps) via ADB, HDC, or WebDriverAgent.
After 32 months of development starting in April 2023, the project hit major milestones: the first stable full-operation chain on a real device (October 25, 2024), the world’s first AI-sent “red packet” via screen navigation (November 2024), and scaled reinforcement learning (MobileRL, ComputerRL, AgentRL) in 2025. Now it’s fully open: models under MIT license, code under Apache-2.0, hosted at https://github.com/zai-org/Open-AutoGLM with 23.6k+ stars and growing fast.
This isn’t another brittle script or simulated environment. It runs on your real phone, supports 50+ Android and 60+ HarmonyOS apps (plus experimental iOS), works remotely over Wi-Fi, and includes built-in safety checks plus human takeover for logins or CAPTCHAs.
In this 1800-word guide we cover everything: what Open-AutoGLM is, its standout features, technical architecture, real benefits, step-by-step getting started, practical use cases, and the exciting future of open AI phones. Whether you’re a developer, business owner, or everyday user tired of repetitive taps, Open-AutoGLM makes advanced AI phone capabilities accessible to everyone.
What is Open-AutoGLM and Why It Matters
Open-AutoGLM is a complete open-source AI phone agent framework built around the AutoGLM-Phone vision-language model family. It turns any compatible smartphone into an autonomous agent that can perceive, plan, and act on real devices using nothing more than natural language instructions.
Unlike traditional automation tools that rely on fragile UI selectors or brittle scripts, Open-AutoGLM uses multimodal AI: it literally “sees” the pixels on your screen, understands context (icons, text, layout), reasons about your goal, and generates the optimal sequence of actions. The agent loops intelligently until the task is complete or needs human help.
ZAI Org open-sourced it for three powerful reasons (straight from their December 2025 blog announcement):
- Prevent a handful of manufacturers from monopolizing “Phone Use” capabilities
- Return privacy and control to users through fully local deployment
- Accelerate the Agent Era by giving the entire ecosystem reusable tools, models, and knowledge
The result? Anyone can now run a state-of-the-art phone agent locally, customize it, integrate it into larger systems, or even train improved versions. No API bills, no data leaks, no vendor lock-in.
Key Features of Open-AutoGLM
- Multimodal Screen Understanding – The 9B AutoGLM-Phone model analyzes screenshots in real time, handling dynamic UIs that break traditional automation.
- Natural Language Control – Works in Chinese and English. Example: “打开小红书搜索美食” or “Open Xiaohongshu and search for food recommendations”.
- Cross-Platform Execution – Full Android (ADB) and HarmonyOS (HDC) support; experimental iOS via WebDriverAgent.
- Rich Action Set – Launch, Tap, Swipe, Type (via ADB Keyboard), Back, Home, Long Press, Double Tap, Wait, and human takeover.
- Safety & Human-in-the-Loop – Automatic confirmation prompts for sensitive operations; seamless handover for CAPTCHAs or logins.
- Remote & Wireless Operation – Control phones over Wi-Fi with a single
adb connectorhdc tconn. - Flexible Deployment – Run the model locally with vLLM/SGLang (GPU recommended) or use hosted APIs from Zhipu, ModelScope, Novita AI, or Parasail.
- Extensibility – Custom prompts, callback functions, integration with Midscene.js for web/mobile hybrid automation.
- Verbose Reasoning – Watch the AI think out loud with
--verbosemode.
How Open-AutoGLM Works: Technical Architecture
The agent follows a clean perceive-plan-act loop:
- User Command → Passed to the model with conversation history.
- Screenshot Capture → Device streams the current screen image.
- Multimodal Reasoning → AutoGLM-Phone-9B processes image + text prompt, outputs structured reasoning and next action (often in
<think>tags internally). - Action Execution → Python framework translates the action into ADB/HDC commands (tap coordinates, swipe vectors, keyboard input, etc.).
- Observation & Loop → New screenshot → repeat until success or max steps reached.
- Error Recovery & Human Help → If stuck, asks for confirmation or hands control to you.
Everything is OpenAI-compatible, so swapping the model backend is trivial. The English and Chinese system prompts are fully editable in phone_agent/config/.
Benefits of This Open-Source AI Phone Agent Framework
- True Privacy – Run 100% locally; your screenshots and data never leave your machine unless you choose a cloud backend.
- Zero Ongoing Cost – Download models once, run forever.
- Unmatched Customizability – Tweak prompts for domain-specific apps (banking, healthcare, enterprise tools), add custom actions, or fine-tune the model.
- Community Velocity – Already 23.6k stars, active WeChat group, contribution incentives from ZAI, and easy integration paths.
- Future-Proof – As better vision-language models emerge, just swap the backend—no framework rewrite needed.
- Democratization – Students, indie developers, small businesses, and researchers can now build production-grade mobile agents without Big Tech permissions.
Compared to closed phone agents or script-based tools, Open-AutoGLM handles UI changes gracefully and scales from one-off tasks to complex multi-app workflows.
Getting Started with Open-AutoGLM (Step-by-Step Guide)
Prerequisites
- Python 3.10+
- Android device with USB debugging enabled + ADB Keyboard (or HarmonyOS with HDC)
- Optional: NVIDIA GPU with ≥24 GB VRAM for local inference
Installation (5 minutes)
git clone https://github.com/zai-org/Open-AutoGLM.git
cd Open-AutoGLM
pip install -r requirements.txt
pip install -e .1. Device Setup
- Enable Developer Options & USB Debugging
- Install ADB Keyboard for Android text input
- Connect via USB or Wi-Fi (
adb connect 192.168.x.x:5555)
2. Start the Model Service
- Cloud (easiest): Use Zhipu or ModelScope API key
- Local (private):
python -m vllm.entrypoints.openai.api_server \
--model zai-org/AutoGLM-Phone-9B-Multilingual \
--port 8000 \
[additional flags for vision support]3. Run Your First Task
python main.py --base-url http://localhost:8000/v1 \
--model autoglm-phone-9b-multilingual \
"Open Chrome and search for the latest Grok 4 updates"Interactive mode, Python API, batch scripts, and custom callbacks are all documented in the README.
Real-World Applications and Use Cases
- Daily Life – Order food, book rides, manage shopping lists across apps
- Productivity – Cross-app workflows (e.g., copy invoice from email → paste into accounting app)
- Developer & QA – Automated UI testing on real devices, regression checks
- Business Automation – Customer support bots, sales data entry, attendance workflows
- Accessibility – Voice-driven navigation for users with motor challenges
- Remote Support – Family tech help without sharing screens
Supported apps span social (WeChat, TikTok, Instagram), e-commerce (Taobao, Amazon, eBay), food (Meituan, DoorDash equivalents), navigation, productivity, and more.
The Future of Open AI Phones with Open-AutoGLM
ZAI Org envisions an “AI-Native Phone” era where every user owns their personal Jarvis. With community contributions already pouring in, expect:
- Mature iOS support
- Multi-agent collaboration
- Voice + vision integration
- Fine-tuned domain models
- Deeper Midscene.js and ecosystem integrations
The project is designed for the “Decade of the Agent” — open, collaborative, and user-owned.
FAQ
Q: Is Open-AutoGLM completely free?
A: Yes — models MIT, code Apache-2.0. No usage fees.
Q: What devices does it support?
A: Android 7.0+, HarmonyOS NEXT+, iOS (experimental via WebDriverAgent).
Q: Do I need a powerful computer?
A: Local inference needs a good GPU; cloud APIs work on any laptop.
Q: How secure is it?
A: Fully auditable open-source code + explicit confirmation for sensitive actions.
Q: Can I use it commercially?
A: Yes, permissive licenses allow commercial use and modification.
Q: How do I contribute?
A: Star the repo, submit PRs, join the WeChat group, or apply for ZAI developer incentives.
Conclusion
Open-AutoGLM is more than a technical release — it’s a movement that puts the power of advanced AI phone agents directly into developers’, businesses’, and users’ hands. By open-sourcing both the model and the complete framework, ZAI Org has removed every barrier that previously kept this technology locked away.
The era of truly autonomous, private, and customizable mobile AI is here — and it’s open source.
Ready to unlock AI on your phone?
- Visit the GitHub repository and star it
- Download the models from Hugging Face
- Follow the 5-minute quick start and try your first task today
- Join the community on WeChat or follow updates on X (@Autotyper_Agent)
Whether you build the next killer mobile agent, automate your personal life, or simply contribute a new supported app — your participation shapes the future of AI phones.
The phone is no longer just a device. With Open-AutoGLM, it becomes your intelligent partner.
Start building today. The Agent Era is open.








