A voice/text-controlled browser automation agent powered by GPT-4o-mini. 100% Chrome Extension based - no Playwright, no Selenium!
┌─────────────────────────────────────────────────────────────┐
│ Chrome Browser │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Chrome Extension (extension/) │ │
│ │ • Collects page context (DOM, text, elements) │ │
│ │ • Executes actions (click, type, scroll, etc.) │ │
│ │ • Opens/manages tabs │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
↕ HTTP (localhost:8765)
┌─────────────────────────────────────────────────────────────┐
│ Python Backend (agent.py) │
│ • AI Brain (GPT-4o-mini) │
│ • Plans actions based on page context │
│ • Web UI for sending commands │
└─────────────────────────────────────────────────────────────┘
pip3 install -r requirements.txtexport OPENAI_API_KEY='your-key-here'- Open Chrome and navigate to:
chrome://extensions - Enable Developer mode (toggle in top-right corner)
- Click "Load unpacked"
- Select the
extensionfolder from this project - You should see "Agent Bridge" extension loaded
python3 agent.pyYou should see:
🎭 BROWSER AGENT (Chrome Extension)
======================================================================
📋 Setup Instructions:
1. Open Chrome and go to: chrome://extensions
2. Enable 'Developer mode' (top right)
3. Click 'Load unpacked' and select the 'extension' folder
4. The extension will handle all browser interactions!
📋 How to use:
• Web UI: http://127.0.0.1:8765
• Terminal: Type commands here
• Shortcut: Type '1' for a GUI dialog box
✅ Agent ready! Type 'exit' or press Ctrl+C to quit.
======================================================================
- Open
http://127.0.0.1:8765in your browser - Type a command (e.g., "Go to Reddit and click the first post")
- Click "Run"
- Type commands directly in the terminal where
agent.pyis running - Example:
Go to YouTube and search for cats
- Type
1in the terminal - A dialog box will appear
- Enter your command and click OK
💬 Go to reddit.com
💬 Open a new tab and go to youtube.com
💬 Click on the first post
💬 Search for python tutorials
💬 Scroll down and click the login button
💬 Type "hello world" in the search box
-
OBSERVE: Extension sends page context to Python backend
- URL, title, text content
- All interactive elements (buttons, links, inputs)
- Element positions (x, y coordinates)
-
DECIDE: GPT-4o-mini analyzes the context
- Reads what's on the page
- Plans the next action
-
ACT: Backend sends action to extension
- Extension executes the action (click, type, etc.)
- Action happens in the actual Chrome browser
-
LOOP: Repeat until task is complete
manifest.json: Extension configuration
- Permissions: tabs, scripting, activeTab
- Runs on all URLs
content.js: Runs on every web page
- Collects page context every 4 seconds
- Executes actions (click, type, scroll, navigate)
- Polls backend for actions every 1.2 seconds
service_worker.js: Background script
- Relays context to Python backend
- Handles tab creation
- Polls for actions every 1 second
popup.html/js: Extension popup
- Shows connection status
- Quick link to open backend UI
agent.py: Main entry point
- Starts web UI server
- Manages command queue
- Routes commands to agent_runner
agent_runner.py: AI brain
- Sends context + command to GPT-4o-mini
- Receives tool calls (actions)
- Executes actions via extension_bridge
actions.py: Action functions
click(x, y): Click at coordinatestype_text(text): Type textscroll(direction, amount): Scroll pagepress_key(key): Press keyboard keynavigate_url(url): Navigate current tabopen_tab(url): Open new tab
extension_bridge.py: Communication layer
- Queue for actions (Python → Extension)
- Storage for page context (Extension → Python)
- Thread-safe
context_capture.py: Context retrieval
- Gets latest page context from extension
- Formats it for GPT-4o-mini
web_ui.py: HTTP server
- Serves web interface
- API endpoints for extension communication
The Python backend exposes these endpoints:
GET /- Web UI homepagePOST /api/run- Submit a commandPOST /api/extension/context- Extension sends page contextGET /api/extension/context- Get latest contextGET /api/extension/next_action- Extension polls for actionsPOST /api/extension/action_result- Extension reports results
The AI can use these tools:
- get_screen_state() - See what's on the page
- click(x, y) - Click at coordinates
- type_text(text) - Type text
- scroll(direction, amount) - Scroll page
- press_key(key) - Press keyboard key
- navigate_url(url) - Navigate to URL
- open_tab(url) - Open new tab
- task_complete() - Mark task as done
- ask_user(question) - Ask for user input
- Make sure the Chrome Extension is loaded
- Check that you're on a web page (not chrome:// URLs)
- Open the extension popup to check connection status
- Make sure
python3 agent.pyis running - Check that port 8765 is not in use
- Try restarting the Python backend
- Check browser console for errors (F12 → Console)
- Make sure the extension has permissions
- Try reloading the extension
- Go to
chrome://extensions - Click the refresh icon on the "Agent Bridge" extension
- Reload the web page
- The extension works on any website (except chrome:// pages)
- Context is sent automatically every 4 seconds
- Actions are polled every 1-1.2 seconds
- You can have multiple tabs open - the extension works on all of them
- The AI sees up to 80 interactive elements per page
This version completely removes:
- ❌ Playwright
- ❌ Firefox automation
- ❌ PyAutoGUI for browser control
- ❌ Any Python-based browser automation
Everything is now handled by the Chrome Extension!
- Model: GPT-4o-mini
- Cost: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens
- Speed: ~2-3 seconds per action
- Token usage: ~500-1000 tokens per observation
- Backend only listens on
127.0.0.1(localhost) - Extension only communicates with localhost:8765
- No external connections except OpenAI API
Use freely for personal or commercial projects!