NaviGlasses: An Offline AI Wearable Navigation Assistant

A while back, I built a wearable captioning system — a glasses attachment with a small OLED HUD that could transcribe speech in real time and identify speakers. It was a prototype, held together with a PCB and a 3D printed enclosure I designed.

Fast forward a couple of months and I was assigned a new project. I had the freedom to choose what domain to focus on and whether I wanted it to be a software or hardware based product. Since I had built a product to assist the deaf, it only made sense for me to try building a product for assisting the visually impaired.

The idea was to build a wearable assistive navigation system that uses a camera, on-device AI inference, a ranging system, and gyroscope data to help visually impaired users understand their environment in real time. It can identify roads, sidewalks, people, vehicles, and other objects, and relay that information through audio feedback.

The Main Idea

The core concept is straightforward: put a camera on a pair of glasses, run an object detection model on a phone, and describe what's in front of the user through audio.

There are quite a bit of similar products to the Navi glasses in the market today, but what I am trying to achieve is to create a product that is much more affordable and can have the same functionalities that these existing products have.

Offline-first — everything runs on the phone, no cloud dependency
Low latency — inference needs to be fast enough to be useful in real time
Wearable — the hardware has to be small enough to actually put on your glasses without causing disturbances

The glasses stream video to a companion Android app. The app runs an AI model, fuses that with distance data from a sensor, and speaks relevant detections to the user through audio. No screen needed. No internet needed. (An app interface was created for demo purposes)

The Left side of the glasses will hold the sensor box, which contains the distance and gyroscope sensor.

The Right side of the glasses will hold the camera box, which contains the Xiao ESP32 S3 camera module.

System Overview

NaviGlasses is built around three microcontrollers, each handling a distinct responsibility, tied together by a mobile app doing all the heavy AI lifting.

OV2640 Camera (Xiao ESP32-S3 Sense)
        ↓ MJPEG over Wi-Fi
Mobile App (Android)
        ↓ ONNX inference
Web Speech API (TTS)
        ↑ JSON over Wi-Fi
VL53L0X + MPU6050 (ESP32 Supermini)

Hardware

ESP32-S3 Sense — The Eyes

The ESP32-S3 is the visual input unit. It runs an OV2640 camera module and streams a live MJPEG feed over Wi-Fi. This feed is consumed by the mobile app, decoded frame by frame, and passed into the on-device object detection model.

The OV2640 is a compact, low-power camera that fits the wearable form factor well. Though the only downside is that it gets extremely hot really quick. Hence, heat sinks provided along with the board.

ESP32 Supermini — The Spatial Sensor Hub

The ESP32 supermini handles ranging and orientation data. It's connected to:

VL53L0X ToF — a Time-of-Flight distance sensor that provides accurate proximity measurements to objects directly ahead. This gives the system a sense of depth that a camera alone can't reliably provide.
MPU6050 IMU — a 6-axis gyroscope and accelerometer. This tracks head orientation and movement, feeding into the sensor fusion pipeline for fall detection and directional context.

The ESP32 supermini also handles battery management. It's paired with a TP4056 charging module for safe charging. The board connects to the battery through a switch and exposes pogo pin contacts for charging. Pogo pins make the charging experience clean and snag-free, which matters a lot when the device is worn daily.

Components in total:

Xiao ESP32 S3
ESP32 Supermini
TP4056
VL53L0X ToF
MPU6050 IMU
Pogo Pins
Custom PCB
Battery
Switch

PCB Design

This time I made a lot of improvements in my PCB design in comparison to the first PCB I built for the captioning project. The PCB was developed to only house the sensors which calculated distance and head movement. The camera module on the other hand didn't require any dedicated PCB, just a 3D printed case with a snug fit and a battery. (Learnt from last time that I do not need a TP4056 for this module)

It houses the following:

TP4056
ESP32 Supermini
Switch (3 Pin)
Pogo pin
Gyroscope
Distance sensor

3D Model Design

ESP32-S3 camera case: Thanks to ScottyDoesKnow for the 3D model for the camera module. I slightly changed some features to accommodate the battery and the neodymium magnet on the side, to attach to the frame.

Sensor PCB case: I tried to make it as slim as possible, after a couple of iterations I came up with this final design. The bump on the left is the battery housing and the distance sensor. The bump on the right has the pogo pins for magnetic charging. (Yes, I reused an old cable to create a pogo pin charger)

Final Design:

Mobile App with Capacitor

The mobile app is built with React + TypeScript + Vite, compiled to Android via Capacitor.

Why Capacitor?

Speed of iteration. React with Vite is fast to develop in.

onnxruntime-web. The ONNX Runtime has a WebAssembly build that runs in a browser or WebView environment. Running YOLOv8n inference inside a native Android app would require the Android-specific ONNX Runtime bindings, a different build pipeline, and a different deployment model. With Capacitor, the app runs in a WebView, which means onnxruntime-web works out of the box — the same code that would run in a browser runs on the phone.

Web Speech API. Text-to-speech via the Web Speech API is available in Capacitor's WebView on Android. No additional native plugin needed.

Cross-platform potential. A Capacitor app can target iOS with minimal changes. A native Android app can't.

The trade-off is that WebView environments have quirks — more on that in the challenges section.

Software and AI Pipeline

On-Device Object Detection with YOLOv8n

The core model is YOLOv8n, a lightweight variant of the YOLOv8 architecture, exported to ONNX format and running entirely on-device via onnxruntime-web (WebAssembly). I trained the model with publicly available models from Roboflow and used Google Colab for training it.

No cloud. No round trips. Everything runs locally on the phone.

The pipeline works like this:

MJPEG frames from the ESP32-S3 are received and decoded in the app.
Each frame is resized — the channel-first format ONNX Runtime expects.
The tensor is fed into the YOLOv8n ONNX model.
The model outputs bounding boxes, class scores, and confidence values.
Relevant detections — roads, sidewalks, people, vehicles, obstacles — are passed to the audio feedback layer.

Why YOLOv8n?

YOLOv8n is the nano variant of the YOLOv8 family — the smallest and fastest model in the lineup. For a real-time wearable application, inference speed matters more than raw accuracy. A model that's 95% accurate but takes 2 seconds per frame is useless. YOLOv8n hits a reasonable accuracy/speed balance on mobile hardware, and its ONNX export is well-supported.

Contextual Audio Feedback

Detected objects are described to the user through the Web Speech API. The system uses a priority weighting scheme — a person or vehicle directly ahead at close range will interrupt a lower-priority sidewalk detection. This ensures the most actionable information reaches the user without flooding them with constant audio.

Sensor Fusion

LiDAR data from the VL53L0X and orientation data from the MPU6050 are fused with the visual detections to add spatial grounding. If the model sees a person and the LiDAR reports an object at 1.2 metres, the audio output reflects that proximity. The gyroscope data feeds a fall detection module that monitors for sudden orientation changes outside normal movement envelopes.

How Does the Glasses Know What to Say?

One of the goals of NaviGlasses was to avoid overwhelming the user with constant announcements. Instead of reading every detected object aloud, the application first filters detections by confidence and relevance before deciding whether a spoken prompt is necessary.

Now let's say that you stop in the middle of the sidewalk or on the side of a road and want to know if its safe to walk ahead, by nodding your head up and down 2 times (can be changed), you open up the voice assistant through which you can ask questions as shown in the image below.

Nearby obstacles and navigation-critical objects such as vehicles, pedestrians, potholes, or blocked paths are given higher priority than objects that do not immediately affect movement.

This allows the audio feedback to remain short and useful, with responses such as:

"Person ahead."
"Vehicle approaching on the left."
"Pothole ahead."
"Sidewalk detected on the right."

By combining object detection, road segmentation, and sensor information, the application attempts to provide context-aware guidance rather than simply listing everything visible in the camera frame.

What It Can Do Right Now

NaviGlasses currently detects and announces:

Roads and sidewalks
People and cyclists
Vehicles
General obstacles in the path

It does this in real time, offline, with no dependency on an internet connection.

Results:

In the below image, object detection happens first.

At the same time, segmentation happens, categorizing sidewalks, roads, etc.

Both outputs from the object detection and the road/sidewalk segmentation is combined into one single output where an inference is created.

Challenges

Running AI Completely Offline

One of the goals of the project was to avoid relying on cloud APIs, so the object detection model runs entirely on the smartphone. Getting this working wasn't as easy as expected. The model and its runtime had to be packaged with the application, and every camera frame had to be converted into the exact format expected by the model before inference. Once everything was configured correctly, the app was able to perform object detection locally without requiring an internet connection or external server.

Android Network Configuration

The ESP32 devices serve HTTP, not HTTPS. Android by default blocks cleartext HTTP traffic from apps. This required an explicit network_security_config.xml entry to whitelist the ESP32 IP ranges.

AbortController Compatibility

The Capacitor WebView on Android has some edge cases around AbortController and AbortSignal that don't behave identically to a desktop browser. Stream cancellation logic that worked fine in Chrome needed adjustment to work reliably in the WebView environment.

Form Factor

Fitting three microcontrollers, a battery circuit, a ToF sensor, a gyroscope, and a camera into something wearable is genuinely hard. The current prototype is functional but not yet compact. That's the next improvement I plan to make in the future.

What's Next

There are major updates and changes that can be made to not only improve the size factor but the overall project itself. Some of them are as follows:

The sensor module was more heavy than the camera module, this caused an imbalance when attaching both the modules on, creating stress at the nose. This can be avoided by stretching out the PCB, meaning, by making the overall PCB board longer we can evenly distribute the weight. A smaller battery can also be used instead of the current battery to improve weight distribution.
Create an Android native application instead of a web assembly app so we can use TensorFlow Lite instead.
Use better datasets. I have attached images of how the model performed in some cases. The model did do a decent job, but when it came to real scenarios, like real roads, there were some mishaps and wrong segmentation results which is obviously not at all what we want if we are making a product for the visually impaired.
Add real time navigation using maps. With apps like Open Street Maps, there is a possibility that navigation towards locations can be added. Bit of a stretch but still not impossible.

Project Stack

Hardware:

Xiao ESP32-S3 Sense
ESP32 Supermini
VL53L0X
MPU6050
TP4056
Custom PCB

Software:

React
TypeScript
Vite
Capacitor
ONNX Runtime Web
YOLOv8n
Web Speech API

AI:

Object Detection
Road/Sidewalk Segmentation
Sensor Fusion
Temporary Spatial Memory

Final Notes

NaviGlasses is still a work in progress. The core pipeline works — the glasses can see, process, and speak. The hardware is functional. The software architecture is solid. What's left is refinement: a more compact form factor, better model accuracy on edge cases, and a smoother audio experience.

NaviGlasses

The Main Idea

System Overview