Generate governed agent tools and serve the right ones through a token-optimized MCP router.
Tool Forge turns integration intent into reviewed, sandbox-validated Python tool bundles, publishes them into a governed catalog, and exposes that catalog through an MCP-compatible Router that lets agents discover only the tools they need.
Tool Forge now includes a serious OSS self-hosting baseline: CI, release hygiene checks, Docker Compose, health/readiness endpoints, queued workers, sandbox validation, and governance hooks. Generated tools are still untrusted code until reviewed, sandboxed, and approved for your environment.
Quick Start | How It Works | What You Get | Documentation
LLMs are good at writing code. Production agent systems need more than code: they need governed tools, runtime proof, and a way to load the right tools without flooding the model context.
Tool Forge closes the gap between "the model wrote a script" and "this is a cataloged operational tool an agent can discover, authorize, route to, and invoke."
| Without Tool Forge | With Tool Forge |
|---|---|
| Ad hoc generated scripts | Structured tool bundles |
| Manual SDK debugging | Docs-grounded API generation |
| No runtime proof | Sandbox validation and repair |
| Scattered metadata | Canonical tool_card.json |
| Hard to publish safely | Local catalog plus GitHub PR flow |
| Tool-schema token bloat | Intent-scoped MCP routing |
The intended lifecycle is simple: generate and validate tools, publish them into a catalog, then let agents load the right tools through Tool Forge Router.
Intent -> Enhance -> Generate -> Review -> Sandbox validate -> Publish -> Router index -> MCP resolve_tools -> Invoke
Tool Forge runs on Mac, Windows, and Linux. Developers can use cloud models through Gemini/OpenAI, or run Tool Forge against local/OSS small language models through an OpenAI-compatible endpoint such as llama.cpp, LM Studio, Ollama, LocalAI, vLLM, or SGLang.
Core capabilities:
- Intent-to-tool generation for Python tool bundles.
- Cloud model mode for higher-capability Gemini/OpenAI-backed generation.
- Local/OSS model mode for SLM experimentation without Gemini/OpenAI keys.
- Deterministic scaffolding, review, dependency policy, and sandbox validation.
- Versioned tool catalog, GitHub publishing flow, and optional Postgres system of record.
- Tool Forge Router: token-optimized MCP serving over validated catalog tools.
- Governed third-party MCP import into normalized Tool Forge cards.
Fresh clone to running app:
git clone https://github.com/nextmoca/tool-forge.git
cd tool-forge
cp .env.example .env
# Add GEMINI_API_KEY and OPENAI_API_KEY to .env.
./toolforge doctor
./toolforge setup
./toolforge start --clean --syncOpen the app:
http://localhost:5173
Stop it when you are done:
./toolforge stop./toolforge start --clean --sync clears only local webapp catalog state,
syncs from the repo-level toolcatalog/, and starts the backend and frontend.
To run the SaaS-style database system of record locally, point Tool Forge at
Postgres and start in Postgres mode. Docker is optional here; this command runs
only a plain postgres:16 database container, not Tool Forge itself:
docker run --name toolforge-postgres \
-e POSTGRES_USER=toolforge \
-e POSTGRES_PASSWORD=toolforge \
-e POSTGRES_DB=toolforge \
-p 127.0.0.1:5432:5432 \
-v toolforge-postgres-data:/var/lib/postgresql/data \
-d postgres:16Then configure and start Tool Forge against that database:
./toolforge postgres --database-url postgresql+psycopg://toolforge:toolforge@localhost:5432/toolforge
./toolforge start --mode postgres --clean --syncStop the app and the optional local database separately:
./toolforge stop
docker stop toolforge-postgresStart the same Postgres container again later with:
docker start toolforge-postgresEvery generated tool is written as a full operational bundle:
tool_name/
README.md
cli.py
cmd.txt
sandbox_result.json
technical_review.json
technical_review_summary.md
test_harness.py
tests/
tool/
__init__.py
tool_name.py
tool_card.json
tool_requirements.txt
The important pieces:
| Artifact | Purpose |
|---|---|
tool/tool_name.py |
Executable Python implementation |
cli.py |
Command-line entry point |
tool/tool_card.json |
Canonical tool schema, auth metadata, dependencies, category, invocation type |
tool/tool_requirements.txt |
Runtime Python dependencies |
tests/ |
Generated unit tests, with external calls mocked |
test_harness.py |
Local validation harness |
sandbox_result.json |
Final sandbox execution result |
technical_review_summary.md |
Human-readable review gate output |
For agent runtimes, Tool Forge can also serve the catalog through Tool Forge Router, an MCP-compatible routing layer that exposes compact discovery first and resolves full tool schemas only for the intent at hand. This keeps agents from loading every tool schema into context while preserving governance, validation state, and catalog metadata.
Router artifacts and behavior include:
| Router Capability | Purpose |
|---|---|
| Compact discovery | Advertise small tool summaries instead of full schemas |
resolve_tools |
Select the best validated tools for a task intent |
| Tool sessions | Keep a short-lived resolved tool set for agent execution |
| Profiles | Scope routing by runtime, tenant, catalog, or agent use case |
| Third-party MCP import | Normalize external MCP tools into governed Tool Forge cards |
Tool Forge has two connected loops: the generation loop that creates validated tools, and the routing loop that helps agents load the right tools from the catalog.
flowchart LR
A["Intent"] --> B["Intent Enhancer"]
B --> C["Capability Card"]
C --> D["Docs-Grounded API Contract"]
D --> E["Tool Bundle Generation"]
E --> F["SDK/API Correctness Pass"]
F --> G["Technical Review"]
G --> H{"Blockers?"}
H -->|Yes| I["Scoped Repair"]
I --> G
H -->|No| J["Sandbox Validation"]
J --> K{"Runtime Passes?"}
K -->|No| L["Runtime Repair"]
L --> J
K -->|Yes| M["Review Artifacts"]
M --> N["Publish Version"]
N --> O["Open GitHub PR"]
Tool Forge combines deterministic checks, model-assisted generation, and runtime validation:
- Enhances rough intent into a structured generation brief.
- Synthesizes a capability card from the intent.
- Retrieves official API documentation when URLs are provided.
- Generates implementation, CLI, README, unit tests, harness, and metadata.
- Runs SDK/API correctness review against the requested SDKs and models.
- Runs a deterministic technical review before sandbox execution.
- Optionally runs an LLM architect review when enabled by system config.
- Uses a scoped repair prompt for review blockers when possible.
- Executes the generated tool in an isolated local environment.
- Writes final artifacts only after validation completes.
flowchart LR
A["Validated Tool Catalog"] --> B["Router Index"]
C["Imported Third-Party MCP Cards"] --> B
D["Agent Intent + Profile"] --> E["resolve_tools"]
B --> E
E --> F["Compact Candidate Set"]
F --> G["Resolved Tool Session"]
G --> H["MCP Client Invokes Selected Tools"]
Tool Forge Router is designed for the place where agents meet Tool Forge. A standard MCP client can connect to the Router, ask for relevant tools, and then work with a narrow session instead of carrying the entire catalog in-context. That makes large tool catalogs practical without turning every prompt into a tool-schema dump.
Tool Forge has two model modes:
| Mode | Provider | Best For |
|---|---|---|
cloud |
Gemini/OpenAI or configured cloud providers | Higher-quality full-bundle generation, API-heavy tools, richer repair loops |
local |
OpenAI-compatible local/OSS endpoint | SLM experimentation, offline/private development, lower-cost simple and medium tools |
Generation recipes decide how much of the bundle the model owns:
| Recipe | What the Model Writes | What Tool Forge Scaffolds | Architect Review |
|---|---|---|---|
tiny |
Main tool.py only |
CLI, tests, README, metadata, requirements, manifests | Skipped by design |
full_bundle |
Full artifact bundle | Deterministic validation and normalization still apply | Eligible, but only runs when enabled by env/config |
auto |
Cloud resolves to full_bundle; local resolves to tiny |
Depends on resolved recipe | Depends on resolved recipe and review config |
Architect review is not controlled by the recipe alone. full_bundle and
cloud auto are eligible, but TOOLFORGE_ARCHITECT_REVIEW=off skips it.
tiny skips architect review even if the environment variable is enabled.
Local mode always resolves to tiny so small local models can focus on the
highest-value task: writing the implementation while Tool Forge supplies the
repeatable operational wrapper.
The defaults are configurable in .env:
| Stage | Default |
|---|---|
| Intent enhancer | gemini-3.1-pro-preview |
| API contract generation | gemini-3.1-pro-preview |
| Tool bundle generation | gemini-3.1-pro-preview |
| Capability synthesis | gpt-5.4 |
| SDK/API correctness pass | gpt-5.4 |
| Architect review, optional | gpt-5.4 |
| Technical review repair | gpt-5.4 |
| Runtime repair loop | gpt-5.4 |
Generated tools may use other models when the user explicitly requests them,
for example an image tool that calls gemini-3-pro-image-preview.
Describe the tool you want. Good intents include:
- Tool name.
- Inputs and CLI flags.
- Required environment variables.
- Source-of-truth API docs.
- Success JSON shape.
- Error cases.
- Live sandbox validation values.
- Unit test constraints.
Example:
Generate a Python tool named send_slack_file that uploads one local file to a
Slack channel using Slack's official Web API documentation.
Use Slack documentation as the source of truth:
https://api.slack.com/web
https://api.slack.com/methods/files.uploadV2
Inputs:
- channel_id: Slack channel ID
- file_path: path to the local file
- message: optional initial comment
The generated CLI should allow:
- --channel-id
- --file-path
- --message
For live sandbox validation only, use:
- channel_id: <YOUR_TEST_CHANNEL_ID>
- file_path: <PATH_TO_SMALL_TEST_FILE>
- message: Test file upload from generated tool
Authenticate using SLACK_BOT_TOKEN. Do not hardcode credentials.
Return JSON with status, channel_id, file_id, file_name, file_path, message,
and upload_response_metadata.
Unit tests must mock all network/API requests.
Tool Forge runs the generated tool end to end:
- Installs declared dependencies.
- Runs generated tests.
- Runs the CLI help path.
- Runs live sandbox validation when credentials and sample values are available.
- Captures stdout, stderr, return code, token usage, and repair loop state.
Inspect the generated bundle before publishing:
- Read and edit generated files.
- Run CLI and tests from the UI.
- Review technical findings and sandbox output.
- Save artifact edits.
Publishing marks a validated version as ready in the local Tool Forge catalog. It does not merge anything into GitHub by itself.
After publishing, Tool Forge can open a repo PR:
- Fetches the latest base branch.
- Creates a temporary worktree.
- Copies the published bundle into
toolcatalog/<tool_name>/. - Commits the catalog change.
- Pushes a
toolforge/publish-...branch. - Opens a draft PR with
gh pr create.
Tool Forge has two catalog layers:
| Layer | Location | Purpose |
|---|---|---|
| Local webapp registry | webapp/backend/toolcatalog/ |
Local generated runs, draft versions, published local versions |
| Repo catalog | toolcatalog/ |
Versioned source of truth committed through GitHub PRs |
Use a clean sync when you want the UI to reflect GitHub:
git fetch --prune origin main
./toolforge start --clean --syncTools removed from the repo catalog are marked removed locally and hidden from active catalog views after sync.
Tool Forge uses GitHub CLI to open catalog PRs.
Install and authenticate once:
brew install gh
gh auth login
gh auth statusYou do not need to run gh auth login every time. Run it again only if
gh auth status says the session is missing or expired.
The root ./toolforge helper is the preferred local workflow:
| Command | Purpose |
|---|---|
./toolforge doctor |
Check Python, Node, npm, Git, GitHub CLI auth, env files, dependencies, catalog state, and dev ports |
./toolforge setup |
Create local config, backend virtualenv, frontend dependencies, and fetch origin/main |
./toolforge start --clean --sync |
Start from clean local catalog state and sync from the repo catalog |
./toolforge start --model-mode cloud --clean --sync |
Start with cloud Gemini/OpenAI providers even if .env currently contains a local LLM preset |
./toolforge postgres |
Configure the Postgres system of record, run migrations, and import the local catalog |
./toolforge start --mode postgres --clean --sync |
Start the app with the database-backed system of record enabled |
./toolforge stop |
Stop backend and frontend dev servers |
./toolforge reset |
Clear local webapp catalog state without touching repo-level toolcatalog/ |
./toolforge benchmark |
Run deterministic offline benchmarks and print precision/recall/F1 |
./toolforge benchmark-e2e --suite benchmarks/e2e/l2_realistic_50.json --limit 5 |
Generate and score credible local-only benchmark tools end to end |
./toolforge benchmark-router |
Benchmark Tool Forge Router selection precision/recall/F1 and token savings |
./toolforge benchmark-status --run-id <run-id> |
Check a background E2E benchmark run and see whether all cases passed |
./toolforge worker run |
Run the local OSS worker queue |
./toolforge router index |
Build the token-optimized Tool Forge Router index over validated catalog tools |
./toolforge router serve --transport stdio |
Serve Tool Forge Router as a standard MCP stdio server |
./toolforge mcp import <name> --tools-file tools.json |
Import third-party MCP tools into governed Tool Forge cards |
./toolforge sandbox --bundle-dir <bundle> -- python test_harness.py |
Run a generated bundle in a Docker sandbox |
./toolforge models preset cloud |
Persist cloud Gemini/OpenAI mode back into local env files |
./toolforge models preset llama_cpp |
Configure local/OSS model mode for llama.cpp |
./toolforge models test |
Test the configured local/OpenAI-compatible model provider |
toolforge-release-gate --json |
Run OSS release hygiene and secret-pattern checks |
Self-hosting baseline:
cp .env.production.example .env.production
docker compose --env-file .env.production run --rm --build backend python -m alembic upgrade head
docker compose --env-file .env.production up --build
curl http://localhost:8000/readySee docs/self-hosting.md for auth headers, approvals, workers, health/readiness endpoints, sandbox controls, and deployment notes.
See docs/mcp-tool-router.md for Tool Forge
Router, including compact discovery, resolve_tools, tool sessions,
profiles, third-party MCP import, and governed MCP exposure.
CLI generation is also available:
tool-generator \
--intent "Generate a Python tool that converts markdown into a styled PDF" \
--tool-name save_pdf \
--output-dir ./toolcatalogOr from a capability file:
tool-generator \
--capability-file examples/capability.save_pdf.json \
--output-dir ./toolcatalogLite benchmark:
./toolforge benchmark
./toolforge benchmark --markdown benchmark-report.md
./toolforge benchmark --case slack_file_upload \
--bundle-dir webapp/backend/toolcatalog/send_file_to_slackStandard 100-intent benchmark:
./toolforge benchmark --suite benchmarks/standard/toolforge_100.json --list-cases
./toolforge benchmark --suite benchmarks/standard/toolforge_100.json
./toolforge benchmark --suite benchmarks/standard/toolforge_100.json \
--bundles-root webapp/backend/toolcatalogBoth suites are deterministic and do not call live APIs. The lite suite includes
golden fixtures for runner self-tests. The standard suite contains 100 offline
intent specs and skips cases until you point it at generated bundles with
--bundles-root.
Router benchmark:
./toolforge benchmark-router
./toolforge benchmark-router --json
./toolforge benchmark-router --catalog-size 500 --max-tools 3
./toolforge benchmark-router --suite benchmarks/router/l2_realistic_50.json
./toolforge benchmark-router --suite benchmarks/router/l3_adversarial_25.json
./toolforge benchmark-router \
--markdown benchmark-router-report.md \
--fail-under-f1 0.85 \
--fail-under-token-reduction 0.95The Router benchmark is also offline. It builds a synthetic governed catalog, asks Tool Forge Router to resolve task intents, scores selected tools against expected tools, and reports how many estimated tokens are avoided compared with loading every full tool schema into agent context.
The Router benchmark tiers are:
| Tier | Suite | Purpose |
|---|---|---|
| Lite | benchmarks/router/toolforge_router_lite.json |
Fast regression and token-bloat demo with 8 intents and 250 tools. |
| L2 | benchmarks/router/l2_realistic_50.json |
Realistic single-tool and multi-tool routing across 50 intents and 600 tools. |
| L3 | benchmarks/router/l3_adversarial_25.json |
Confusable, negated, and read/write adversarial routing across 25 intents and 500 tools. |
End-to-end local benchmarks:
./toolforge benchmark-e2e --suite benchmarks/e2e/local_100.json --list-cases
./toolforge benchmark-e2e --suite benchmarks/e2e/local_100.json --limit 5
./toolforge benchmark-e2e --suite benchmarks/e2e/l2_realistic_50.json --limit 10
./toolforge benchmark-e2e --suite benchmarks/e2e/l3_adversarial_25.json --limit 5
./toolforge benchmark-e2e --tier-mix --l1-limit 10 --l2-limit 20 --l3-limit 5
./toolforge benchmark-e2e --suite benchmarks/e2e/local_100.json \
--limit 20 \
--output-root benchmark_runs/e2e \
--max-rounds 3For L2/L3 runs that generate file-writing tools, provide OUTPUT_DIR once:
./toolforge benchmark-e2e \
--tier-mix \
--l1-limit 10 \
--l2-limit 20 \
--l3-limit 5 \
--output-dir ./benchmark_runs/runtime-output \
--run-id l1-10-l2-20-l3-5Long runs can be detached and checked later:
./toolforge benchmark-e2e \
--tier-mix \
--l1-limit 10 \
--l2-limit 20 \
--l3-limit 5 \
--output-dir ./benchmark_runs/runtime-output \
--run-id l1-10-l2-20-l3-5 \
--background
./toolforge benchmark-status --run-id l1-10-l2-20-l3-5The E2E benchmark is tiered:
| Tier | Suite | Purpose |
|---|---|---|
| L1 | benchmarks/e2e/local_100.json |
Broad smoke coverage for simple local tools. |
| L2 | benchmarks/e2e/l2_realistic_50.json |
Realistic local tools with file fixtures, OUTPUT_DIR, structured data, reports, archives, and filesystem behavior. |
| L3 | benchmarks/e2e/l3_adversarial_25.json |
Adversarial local tools for path safety, parser safety, malformed input, deterministic output, fail-closed behavior, and unsafe-code rejection. |
The E2E runner uses Tool Forge to generate real bundles, writes them under
benchmark_runs/e2e/<run-id>/bundles, and then scores those bundles with the
offline scorer. Generated tools in these suites do not require live third-party
APIs. Tool generation still uses your configured model providers.
Mixed tier runs write one aggregate report at benchmark_runs/e2e/<run-id>/score.json
and score.md, with per-tier child runs under l1/, l2/, and l3/.
Create local configuration from the template:
cp .env.example .envMinimum model-provider keys:
GEMINI_API_KEY=...
OPENAI_API_KEY=...For local/OSS LLM mode, Gemini/OpenAI keys are not required. Point Tool Forge at an OpenAI-compatible local endpoint instead:
./toolforge models preset llama_cpp
./toolforge models testor configure manually:
TOOLFORGE_MODEL_MODE=local
TOOLFORGE_LLM_PROVIDER=openai_compatible
TOOLFORGE_LLM_BASE_URL=http://127.0.0.1:8080/v1
TOOLFORGE_LLM_API_KEY=local
TOOLFORGE_LOCAL_MODEL=local
TOOLFORGE_LLM_TIMEOUT_SECONDS=600
TOOLFORGE_LLM_CONTEXT_TOKENS=32768
TOOLFORGE_LLM_RESPONSE_TOKENS=8192
TOOLFORGE_GENERATION_RECIPE=autoGeneration recipes:
auto: cloud uses the full-bundle recipe; local/OSS mode resolves to tiny.tiny: the model writes onlytool.py; Tool Forge scaffolds CLI, tests, README, metadata, and requirements.full_bundle: cloud full artifact generation; architect review is eligible only when enabled byTOOLFORGE_ARCHITECT_REVIEW.
See docs/local-llms.md for llama.cpp, LM Studio, Ollama, LocalAI, and vLLM setup.
Common app settings:
PURE_PYTHON_GENERATION=1
USE_LLM_KEY_CLASSIFIER=0
TOOL_GENERATOR_GEMINI_BUNDLE_MODEL=gemini-3.1-pro-preview
TOOL_GENERATOR_GEMINI_MAX_RETRIES=3
TOOL_GENERATOR_GEMINI_RETRY_BASE_SECONDS=5
TOOL_GENERATOR_GEMINI_RETRY_MAX_SECONDS=30
TOOL_GENERATOR_REPAIR_MODEL=gpt-5.4
TOOL_GENERATOR_SDK_FIX_MODEL=gpt-5.4
TOOLFORGE_DEPENDENCY_POLICY_ENABLED=1
TOOLFORGE_DEPENDENCY_POLICY_PATH=
TOOLFORGE_DETERMINISTIC_REVIEW_MODE=strict
TOOLFORGE_ARCHITECT_REVIEW=off
TOOLFORGE_ARCHITECT_REVIEW_MODEL=gpt-5.4
TOOLFORGE_REVIEW_TEST_MAX_REPAIRS=1TOOLFORGE_DEPENDENCY_POLICY_* controls deterministic dependency cleanup before
sandbox execution. The default policy rewrites known stale pins, drops test-only
packages from runtime requirements, and blocks known hallucinated runtime package
choices so generated tools fail early with actionable review findings instead of
opaque pip install errors. Set TOOLFORGE_DEPENDENCY_POLICY_PATH to a JSON policy
file to customize this for your organization.
Review modes:
TOOLFORGE_DETERMINISTIC_REVIEW_MODE=strictkeeps local objective checks blocking.TOOLFORGE_ARCHITECT_REVIEW=offskips the probabilistic LLM reviewer.TOOLFORGE_ARCHITECT_REVIEW=advisoryrecords architect findings without blocking or repairing.TOOLFORGE_ARCHITECT_REVIEW=strictallows architect blockers to trigger repair and stop sandbox validation.
Generated tools may require their own runtime credentials, such as:
SLACK_BOT_TOKEN=...
FRAMEIO_API_KEY=...
GITHUB_TOKEN=...
OUTPUT_DIR=...Tool Forge reads the root .env for model keys and app-wide settings, then
webapp/backend/.env for backend-only overrides. Real secrets stay local because
.env is gitignored.
| Guide | Use It For |
|---|---|
| Quick Start | First successful local run |
| Tool Forge User Guide | Product concepts and user journey |
| Setup and Operations Guide | Install, run, reset, sync, troubleshoot |
| Local and OSS LLMs | Local/SLM setup with llama.cpp, LM Studio, Ollama, LocalAI, vLLM, and SGLang |
| Tool Catalog and GitHub Publishing | Catalog lifecycle and PR workflow |
| Self-Hosting Baseline | Docker Compose, workers, readiness checks, and deployment posture |
| Tool Forge Router | Token-optimized MCP serving, governed import, and intent-scoped tool loading |
| Architecture | End-to-end technical design |
| Open Source Readiness Checklist | Release hygiene and public-repo preparation |
The same docs are available inside the web app from the top-right Docs button. First-time users also get an obvious Quick Start entry point.
Generated code should be treated as untrusted until reviewed.
Tool Forge improves safety by:
- Running deterministic review before sandbox execution.
- Keeping the LLM architect review optional through system configuration.
- Rejecting hardcoded credentials.
- Checking environment-variable consistency.
- Enforcing dependency policy when
PURE_PYTHON_GENERATION=1. - Keeping generated runtime credentials out of committed files.
- Publishing through explicit user review and GitHub PR flow.
Recommendations:
- Review generated code and dependencies before production use.
- Use least-privilege API tokens for live validation.
- Keep
.envfiles local. - Prefer official API documentation URLs in intents.
- Require PR review before merging new catalog tools.
Install local dev dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pip install -r webapp/backend/requirements.txtRun checks:
python -m pytest
python -m ruff check .
python -m tool_generator.release_gate --json
find tool_generator webapp/backend tests \
-path '*/.venv/*' -prune -o \
-path '*/__pycache__/*' -prune -o \
-name '*.py' -print0 | xargs -0 python -m py_compile
cd webapp/frontend
npm run buildPlanned directions:
- JavaScript and TypeScript tool generation.
- JS/TS sandbox validation.
- Stronger containerized sandbox isolation.
- Multi-tenant credential boundaries.
- Larger generation benchmarks and scoring dashboards.
- MCP-native packaging.
- Hosted catalog APIs.
- Formal tool-card schema releases.
Contributions are welcome.
Start with:
For design changes, open an issue first so the architecture and catalog contract stay coherent.
Licensed under the Apache License, Version 2.0.
Next Moca builds infrastructure for agent orchestration, tool catalogs, RAG pipelines, governance, observability, and operational AI systems.
Tool Forge is the workbench for creating the executable tools those systems need.