NotHotDog is an open-source platform for testing and evaluating AI agents. Built for developers who need quality assurance for their conversational AI systems, NotHotDog provides automated testing, hallucination detection, and performance monitoring.
Perfect for individuals and small teams getting started with AI agent testing.
Includes:
- ✅ Core agent testing framework
- ✅ Test case generation & execution
- ✅ Hallucination detection
- ✅ Response time metrics
- ✅ Validation rules
- ✅ 3 pre-configured personas
- ✅ CSV/Excel test upload
- ✅ Basic performance metrics
For teams needing advanced features, collaboration, and scale.
Everything in Community, plus:
- 🏢 Multi-tenant organizations
- 👥 Team collaboration & RBAC
- 🎭 Custom persona creation
- 📊 Custom metrics & KPIs
- 📈 Advanced analytics dashboard
- 🔐 SSO & enterprise security
- 🎯 Priority support & SLAs
- 🔄 CI/CD integrations
Generate test variations automatically using AI. Define a simple example, and NotHotDog creates diverse test cases covering different phrasings and scenarios.
Test with 3 pre-configured personas:
- Friendly User: Polite, clear communication style
- Technical Expert: Detailed technical questions
- Confused User: Needs clarification and simple explanations
Advanced AI-powered detection to identify when agents provide incorrect or fabricated information - a critical feature for production AI systems.
- Response Time: Measure agent latency and performance
- Validation Pass/Fail: Ensure responses meet your requirements
- Hallucination Detection: AI-powered factual verification
- Test Success Rate: Track overall agent reliability
- Parallel Processing: Run 50+ test cases simultaneously
- Conversation Memory: Maintain context across multi-turn interactions
- Adaptive Testing: Dynamic conversation flow based on responses
- Real-Time Monitoring: Track test progress and results live
Define custom validation rules using:
- JSON path expressions for response parsing
- Conditional logic (equals, contains, regex matching)
- Criticality levels for rule prioritization
- Custom metric definitions
- Node.js 18+ and npm
- PostgreSQL 13+
- API keys for LLM providers (Anthropic, OpenAI, etc.)
# Clone the repository
git clone https://github.com/AgentEvaluation/NotHotDog.git
cd NotHotDog
# Install dependencies
npm install
# Set up the database
npx prisma generate
npx prisma db push
# Configure environment variables
cp .env.example .env.local# Database
DATABASE_URL="postgresql://user:password@localhost:5432/nothotdog"
# Authentication (Clerk)
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_test_...
CLERK_SECRET_KEY=sk_test_...
# LLM Provider Keys
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
DEEPSEEK_API_KEY=...
GOOGLE_API_KEY=...
# Application
NEXT_PUBLIC_APP_URL=http://localhost:3000# Development mode
npm run dev
# Production build
npm run build
npm startgraph TB
subgraph "Frontend"
UI[Next.js App Router]
Components[React Components]
State[Client State Management]
end
subgraph "API Layer"
Routes[API Routes]
Auth[Authentication]
Validation[Request Validation]
end
subgraph "Business Logic"
TestEngine[Test Execution Engine]
PersonaEngine[Persona Engine]
MetricsEngine[Metrics Calculator]
ValidationEngine[Validation Engine]
end
subgraph "Data Layer"
Prisma[Prisma ORM]
DB[(PostgreSQL)]
end
subgraph "External Services"
LLM[LLM Providers]
Clerk[Clerk Auth]
end
UI --> Routes
Routes --> Auth
Auth --> Validation
Validation --> TestEngine
TestEngine --> PersonaEngine
TestEngine --> MetricsEngine
TestEngine --> ValidationEngine
PersonaEngine --> LLM
MetricsEngine --> Prisma
ValidationEngine --> Prisma
Prisma --> DB
Auth --> Clerk
- Framework: Next.js 15 (App Router)
- UI Library: React 18 with TypeScript
- Styling: Tailwind CSS + Radix UI
- State Management: React Hooks + Context
- Charts: Recharts
- Runtime: Node.js
- API: Next.js API Routes
- Database: PostgreSQL with Prisma ORM
- Authentication: Clerk
- Validation: Zod
- AI Integration: LangChain
- Deployment: Vercel / AWS / Self-hosted
- Database: PostgreSQL 15+
- File Storage: Local / S3 compatible
POST /api/tools/generate-tests
Content-Type: application/json
{
"inputExample": "Sample user query",
"agentDescription": "Customer support chatbot",
"count": 50
}
Response:
{
"testCases": [
{
"input": "Generated test input",
"expectedBehavior": "Expected response pattern",
"category": "edge-case"
}
]
}POST /api/tools/test-runs
Content-Type: application/json
{
"agentConfig": {
"endpoint": "https://api.example.com/chat",
"headers": { "Authorization": "Bearer token" }
},
"testCases": ["test-id-1", "test-id-2"],
"personas": ["persona-id-1", "persona-id-2"],
"metrics": ["response-time", "validation-score"]
}
Response:
{
"runId": "run-123",
"status": "running",
"totalTests": 100,
"completed": 0
}GET /api/tools/metrics/{runId}
Response:
{
"summary": {
"totalTests": 100,
"passed": 85,
"failed": 15,
"avgResponseTime": 1.2
},
"metrics": {
"responseTime": { "min": 0.5, "max": 3.2, "avg": 1.2 },
"validationScore": { "min": 0.7, "max": 1.0, "avg": 0.92 },
"hallucinationRate": 0.05
}
}const agentConfig = {
name: "Customer Support Bot",
endpoint: "https://api.example.com/chat",
headers: {
"Authorization": "Bearer your-token",
"Content-Type": "application/json"
},
inputFormat: {
message: "string",
context: "object"
},
outputFormat: {
response: "string",
metadata: "object"
}
};const validationRules = [
{
name: "Response Format",
path: "$.response",
condition: "exists",
criticality: "high"
},
{
name: "No PII Exposure",
path: "$.response",
condition: "not_contains",
value: ["ssn", "credit card"],
criticality: "high"
}
];const personas = [
{
name: "Frustrated Customer",
traits: {
temperature: 0.9,
messageLength: "short",
communicationStyle: "casual",
emotionalState: "frustrated",
slangUsage: true
}
},
{
name: "Technical Expert",
traits: {
temperature: 0.3,
messageLength: "detailed",
techSavviness: "expert",
communicationStyle: "formal"
}
}
];const testRun = await executeTests({
agent: agentConfig,
testCases: generatedTests,
personas: selectedPersonas,
validationRules: rules,
parallel: true,
maxConcurrency: 50
});| Metric | Type | Description |
|---|---|---|
| Response Time | Quantitative | Time taken for agent to respond |
| Validation Score | Quantitative | Percentage of passed validation rules |
| Hallucination Rate | Quantitative | Frequency of factually incorrect responses |
| Format Compliance | Binary | Whether response matches expected format |
| Conversation Coherence | Quantitative | Quality of multi-turn interactions |
| Sentiment Alignment | Quantitative | Match between expected and actual tone |
Define custom metrics for your specific use cases:
const customMetric = {
name: "Domain Knowledge Score",
type: "quantitative",
criticality: "high",
calculator: async (response, expected) => {
// Custom calculation logic
return score;
}
};- Data Isolation: Multi-tenant architecture with organization-level data separation
- Authentication: Enterprise-grade auth via Clerk
- API Security: Rate limiting and request validation
- Sensitive Data: Automatic PII detection and masking options
- Audit Logging: Comprehensive activity tracking
We welcome contributions from the community! Please see our Contributing Guide for details on:
- Code of conduct
- Development setup
- Pull request process
- Coding standards
- Testing requirements
# Fork and clone the repository
git clone https://github.com/yourusername/NotHotDog.git
cd NotHotDog
# Create a feature branch
git checkout -b feature/your-feature-name
# Install dependencies
npm install
# Set up pre-commit hooks
npm run prepare
# Run tests
npm test
# Run linting
npm run lintThis project is licensed under the MIT License. See the LICENSE file for details.
Built with excellent open-source technologies:
- Next.js - The React framework
- Prisma - Next-generation ORM
- LangChain - LLM application framework
- Radix UI - Accessible component library
- Tailwind CSS - Utility-first CSS framework