Browser Agent - Chrome Extension Architecture

A voice/text-controlled browser automation agent powered by GPT-4o-mini. 100% Chrome Extension based - no Playwright, no Selenium!

🎯 Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Chrome Browser                          │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Chrome Extension (extension/)                     │    │
│  │  • Collects page context (DOM, text, elements)     │    │
│  │  • Executes actions (click, type, scroll, etc.)    │    │
│  │  • Opens/manages tabs                              │    │
│  └────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                          ↕ HTTP (localhost:8765)
┌─────────────────────────────────────────────────────────────┐
│              Python Backend (agent.py)                      │
│  • AI Brain (GPT-4o-mini)                                   │
│  • Plans actions based on page context                      │
│  • Web UI for sending commands                              │
└─────────────────────────────────────────────────────────────┘

🚀 Setup Instructions

1. Install Python Dependencies

pip3 install -r requirements.txt

2. Set OpenAI API Key

export OPENAI_API_KEY='your-key-here'

3. Load the Chrome Extension

Open Chrome and navigate to: chrome://extensions
Enable Developer mode (toggle in top-right corner)
Click "Load unpacked"
Select the extension folder from this project
You should see "Agent Bridge" extension loaded

4. Start the Python Backend

python3 agent.py

You should see:

🎭 BROWSER AGENT (Chrome Extension)
======================================================================

📋 Setup Instructions:
  1. Open Chrome and go to: chrome://extensions
  2. Enable 'Developer mode' (top right)
  3. Click 'Load unpacked' and select the 'extension' folder
  4. The extension will handle all browser interactions!

📋 How to use:
  • Web UI: http://127.0.0.1:8765
  • Terminal: Type commands here
  • Shortcut: Type '1' for a GUI dialog box

✅ Agent ready! Type 'exit' or press Ctrl+C to quit.
======================================================================

📝 How to Use

Option 1: Web UI

Open http://127.0.0.1:8765 in your browser
Type a command (e.g., "Go to Reddit and click the first post")
Click "Run"

Option 2: Terminal

Type commands directly in the terminal where agent.py is running
Example: Go to YouTube and search for cats

Option 3: GUI Dialog

Type 1 in the terminal
A dialog box will appear
Enter your command and click OK

🎬 Example Commands

💬 Go to reddit.com
💬 Open a new tab and go to youtube.com
💬 Click on the first post
💬 Search for python tutorials
💬 Scroll down and click the login button
💬 Type "hello world" in the search box

🔧 How It Works

The Observe-Act Cycle

OBSERVE: Extension sends page context to Python backend
- URL, title, text content
- All interactive elements (buttons, links, inputs)
- Element positions (x, y coordinates)
DECIDE: GPT-4o-mini analyzes the context
- Reads what's on the page
- Plans the next action
ACT: Backend sends action to extension
- Extension executes the action (click, type, etc.)
- Action happens in the actual Chrome browser
LOOP: Repeat until task is complete

Extension Components

manifest.json: Extension configuration

Permissions: tabs, scripting, activeTab
Runs on all URLs

content.js: Runs on every web page

Collects page context every 4 seconds
Executes actions (click, type, scroll, navigate)
Polls backend for actions every 1.2 seconds

service_worker.js: Background script

Relays context to Python backend
Handles tab creation
Polls for actions every 1 second

popup.html/js: Extension popup

Shows connection status
Quick link to open backend UI

Python Backend Components

agent.py: Main entry point

Starts web UI server
Manages command queue
Routes commands to agent_runner

agent_runner.py: AI brain

Sends context + command to GPT-4o-mini
Receives tool calls (actions)
Executes actions via extension_bridge

actions.py: Action functions

click(x, y): Click at coordinates
type_text(text): Type text
scroll(direction, amount): Scroll page
press_key(key): Press keyboard key
navigate_url(url): Navigate current tab
open_tab(url): Open new tab

extension_bridge.py: Communication layer

Queue for actions (Python → Extension)
Storage for page context (Extension → Python)
Thread-safe

context_capture.py: Context retrieval

Gets latest page context from extension
Formats it for GPT-4o-mini

web_ui.py: HTTP server

Serves web interface
API endpoints for extension communication

🌐 API Endpoints

The Python backend exposes these endpoints:

GET / - Web UI homepage
POST /api/run - Submit a command
POST /api/extension/context - Extension sends page context
GET /api/extension/context - Get latest context
GET /api/extension/next_action - Extension polls for actions
POST /api/extension/action_result - Extension reports results

🎯 Available Actions

The AI can use these tools:

get_screen_state() - See what's on the page
click(x, y) - Click at coordinates
type_text(text) - Type text
scroll(direction, amount) - Scroll page
press_key(key) - Press keyboard key
navigate_url(url) - Navigate to URL
open_tab(url) - Open new tab
task_complete() - Mark task as done
ask_user(question) - Ask for user input

🔍 Troubleshooting

"No context received"

Make sure the Chrome Extension is loaded
Check that you're on a web page (not chrome:// URLs)
Open the extension popup to check connection status

"Backend offline"

Make sure python3 agent.py is running
Check that port 8765 is not in use
Try restarting the Python backend

"Actions not executing"

Check browser console for errors (F12 → Console)
Make sure the extension has permissions
Try reloading the extension

Extension not working after Chrome update

Go to chrome://extensions
Click the refresh icon on the "Agent Bridge" extension
Reload the web page

💡 Tips

The extension works on any website (except chrome:// pages)
Context is sent automatically every 4 seconds
Actions are polled every 1-1.2 seconds
You can have multiple tabs open - the extension works on all of them
The AI sees up to 80 interactive elements per page

🚫 What Was Removed

This version completely removes:

❌ Playwright
❌ Firefox automation
❌ PyAutoGUI for browser control
❌ Any Python-based browser automation

Everything is now handled by the Chrome Extension!

📊 Cost & Performance

Model: GPT-4o-mini
Cost: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens
Speed: ~2-3 seconds per action
Token usage: ~500-1000 tokens per observation

🔐 Security

Backend only listens on 127.0.0.1 (localhost)
Extension only communicates with localhost:8765
No external connections except OpenAI API

📄 License

Use freely for personal or commercial projects!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
extension		extension
.gitattributes		.gitattributes
BROWSER_AGENT_README.md		BROWSER_AGENT_README.md
CHANGES.md		CHANGES.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
actions.py		actions.py
agent.py		agent.py
agent_runner.py		agent_runner.py
browser_manager.py		browser_manager.py
config.py		config.py
context_capture.py		context_capture.py
extension_bridge.py		extension_bridge.py
main.py		main.py
overlay_prompt.py		overlay_prompt.py
requirements.txt		requirements.txt
start_chrome_debug.sh		start_chrome_debug.sh
tempCodeRunnerFile.py		tempCodeRunnerFile.py
web_ui.py		web_ui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Browser Agent - Chrome Extension Architecture

🎯 Architecture

🚀 Setup Instructions

1. Install Python Dependencies

2. Set OpenAI API Key

3. Load the Chrome Extension

4. Start the Python Backend

📝 How to Use

Option 1: Web UI

Option 2: Terminal

Option 3: GUI Dialog

🎬 Example Commands

🔧 How It Works

The Observe-Act Cycle

Extension Components

Python Backend Components

🌐 API Endpoints

🎯 Available Actions

🔍 Troubleshooting

"No context received"

"Backend offline"

"Actions not executing"

Extension not working after Chrome update

💡 Tips

🚫 What Was Removed

📊 Cost & Performance

🔐 Security

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Browser Agent - Chrome Extension Architecture

🎯 Architecture

🚀 Setup Instructions

1. Install Python Dependencies

2. Set OpenAI API Key

3. Load the Chrome Extension

4. Start the Python Backend

📝 How to Use

Option 1: Web UI

Option 2: Terminal

Option 3: GUI Dialog

🎬 Example Commands

🔧 How It Works

The Observe-Act Cycle

Extension Components

Python Backend Components

🌐 API Endpoints

🎯 Available Actions

🔍 Troubleshooting

"No context received"

"Backend offline"

"Actions not executing"

Extension not working after Chrome update

💡 Tips

🚫 What Was Removed

📊 Cost & Performance

🔐 Security

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages