josbach.dev

Spokzi AI represents the next evolution in language learning, moving beyond static lessons to interactive, AI-driven coaching. It acts as a personal speaking tutor available 24/7, capable of listening, understanding context, and providing granular feedback on grammar, vocabulary, and pronunciation.

Features

AI Roleplay: Engage in realistic scenarios (ordering coffee, job interview) with AI avatars that respond dynamically to what you say.
Instant Grammar Correction: The AI analyzes your speech in real-time and highlights mistakes with explanations and improved alternatives.
Pronunciation Scoring: Phoneme-level analysis of your speech, providing visual feedback on which sounds need improvement.
Topic Generator: Infinite conversation starters generated based on your proficiency level and interests.
Usage Analytics: Tracks speaking speed, pause frequency, and vocabulary diversity over time.

Technical Deep Dive

Spokzi merges mobile engineering with heavy AI integration, requiring a seamless flow of audio data and strict latency control.

Audio Pipeline & Latency

The critical metric for an AI conversation app is "Time to Respond" (TTR). A delay of more than 2 seconds breaks immersion.

Voice Activity Detection (VAD): We implemented on-device VAD (Silero VAD) via React Native JSI to detect when the user stops speaking instantly, triggering the API call without manual button presses.
Streaming Architecture: Audio is streamed in chunks to the backend. We use a WebSocket connection to pipe audio data directly to the Speech-to-Text engine, allowing us to start processing the text intent even before the user finishes the sentence.

AI & Large Language Models (LLM)

Context Management: We maintain a sliding window of conversation context to keep LLM costs manageable while ensuring the AI remembers previous turns.
Prompt Engineering: Extensive tuning of system prompts to ensure the AI corrects the user constructively rather than just ignoring mistakes or being too pedantic.
Custom TTS: We integrated premium Text-to-Speech voices that sound natural and emotive, caching common phrases to reduce latency and cost.

React Native Integration

Microphone Handling: Used react-native-audio-recorder-player with custom native modifications to support raw PCM streaming required for the AI backend.
UI Responsiveness: The UI updates optimistically. As the user speaks, we visualize the waveform in real-time using react-native-skia for 60fps performance on both platforms.

Technology Stack

Mobile: React Native, Expo, TypeScript, mmKV
AI/ML: OpenAI GPT-4o, Whisper (STT), Azure TTS / ElevenLabs
Audio: FFmpeg (for format conversion), Silero VAD (On-device)
Backend: Python (FastAPI), Celery (Async tasks)
Infa: Google Cloud Platform, Cloud Run

Challenges & Solutions

Challenge: Hallucinations in Grammar Correction. Solution: We devised a "Verifier" step where a second, smaller model specifically trained on grammar rules validates the primary model's corrections before showing them to the user. This reduced false positives by 40%.

Challenge: Battery Drain from continuous audio processing. Solution: We optimized the audio capture service to downsample audio to 16kHz mono (sufficient for voice) before processing, significantly reducing CPU usage and memory footprint on the device. We also aggressively manage the audio session to release hardware resources immediately when the user is idle.