An AI-powered visual assistant that uses computer vision to detect objects and describe scenes in real-time. It's designed to help visually impaired users understand their surroundings through audio descriptions and can also serve as an educational tool for object recognition.
- Real-time Object Detection: Identifies objects in the camera feed with bounding boxes
- Scene Description: Generates natural language descriptions of what the camera sees
- Text-to-Speech: Converts descriptions to spoken audio for accessibility
- Responsive UI: Works on both desktop and mobile devices
- Eye Animation: Unique eye-themed interface with opening/closing animations
- Frontend: Next.js, React, TypeScript, Socket.IO client
- Backend: Flask, Python, Socket.IO server
- AI Services:
- Gemini Vision API for object detection and scene description
- ElevenLabs for high-quality text-to-speech
- Node.js (v16+)
- Python (v3.9+)
- Google Gemini API key
- ElevenLabs API key
-
Clone the repository
git clone https://github.com/IanPTan/dataquest25.git cd dataquest25 -
Set up the backend
cd backend pip install -r requirements.txt -
Create a
.envfile in the backend directory with your API keysGEMINI_API_KEY=your_gemini_api_key ELEVENLABS_API_KEY=your_elevenlabs_api_key -
Set up the frontend
cd ../app npm install
-
Start the backend server
cd backend python app.py -
Start the frontend development server
cd ../app npm run dev -
Open your browser and navigate to
http://localhost:3000
- Click the "Open Eyes" button to activate the camera
- Allow camera permissions when prompted
- Point your camera at objects or scenes you want to identify
- The app will detect objects, draw bounding boxes, and speak descriptions
- Toggle object detection visualization with the "Show/Hide Objects" button
- Click "Close Eyes" to stop the camera feed
The frontend can be deployed to Vercel:
npm run build
vercel deploy
The backend can be deployed to any platform that supports Python applications:
- Heroku
- Google Cloud Run
- AWS Elastic Beanstalk
- Railway
Remember to update the WebSocket connection URL in the frontend code to point to your deployed backend.
├── app/ # Frontend Next.js application
│ ├── components/ # React components
│ │ ├── camera.tsx # Main camera component
│ │ └── camera.css # Camera styling
│ └── page.tsx # Main page component
├── backend/ # Flask backend
│ ├── app.py # Main server file
│ ├── requirements.txt # Python dependencies
│ ├── tts.py # Text-to-speech utilities
│ └── simple_tts.py # Simplified TTS implementation
└── vercel.json # Vercel configuration
- Requires a stable internet connection for API calls
- Camera access requires HTTPS in production environments
- Object detection accuracy depends on lighting conditions and camera quality
- Offline mode with on-device models
- Support for multiple languages
- User profiles with customized voice preferences
- Improved object tracking between frames
- Haptic feedback for mobile devices
This project is licensed under the MIT License - see the LICENSE file for details.
- Google Gemini API for vision capabilities
- ElevenLabs for natural-sounding TTS
- The open-source community for various libraries and tools