INES - Edge Edge AI Running in Your Browser

/img/posts/ines-edge-ai-browser.png

🚀 Edge AI running in your browser

Think about Language Model as “cloud-only” systems: APIs, servers, tokens, latency, privacy trade-offs.

But what if none of that was necessary?

I built INES - Intelligent Neural Edge System, a full Small Language Model, SLM-powered application that runs entirely inside the browser.

  • No server.
  • No API key.
  • No data leaving your device.

🧠 What makes INES different?

INES runs a SML like Gemma model directly on-device using:

  • Google’s MediaPipe LLM Inference API
  • WebGPU acceleration
  • WASM SIMD fallback when needed

The model is loaded once as a binary file and then everything happens locally in the tab.

After that moment → the browser becomes the AI runtime.


⚙️ One model → five AI systems

A single LlmInference instance powers five specialized tools:

  • 💬 Chat → multi-turn assistant
  • ✉️ Email → rewrite, summarize, adjust tone
  • 🎙️ Meeting → speech → transcript → structured decisions
  • 🌍 Translator → offline, auto-detect, 10 languages
  • Planner → AI-generated daily task system
  • 👨‍💻 Coding → code generation + preview sandbox
  • 💬 Learning → Learning

Each tab injects its own system prompt — effectively turning one model into multiple agents.


🔒 Privacy by design (not policy)

  • Prompts never leave the device
  • Meeting audio processed locally
  • Todo lists stored in localStorage
  • No analytics, no telemetry, no accounts

This is what “local-first AI” actually looks like in practice.


⚡ Why Edge AI matters

This architecture enables use cases that cloud AI struggles with:

  • 🔐 Corporate environments with strict data policies
  • 🌐 Offline or air-gapped systems
  • 📶 Low-connectivity regions
  • 🧪 Educational and demo environments
  • 🧍 Privacy-first personal tools

Edge AI isn’t just about performance - it’s about control shifting back to the user.


🧩 The technical reality

Running an SLM in-browser is not trivial:

  • WebGPU vs WASM fallback paths
  • Model loading as ArrayBuffer (not blob URLs)
  • Single shared inference runtime across UI modules
  • Streaming token generation in real time
  • Prompt orchestration per feature (chat/email/meeting/etc.)

But the result is simple:

A complete AI workstation that runs inside a single HTML/SPA client.


🔗 Project

GitHub: https://github.com/marcomattolab/ines

Built with:

  • Angular with PWA
  • MediaPipe GenAI
  • WebGPU
  • Zero backend infrastructure

Proof of Concept

Watch the POC demonstration on video below

play-video-ines

Presentation

Watch the presentation on ppt below

play-presentation-ines


💡Final thoughts

A new phase where “AI apps” won’t necessarily mean “cloud services”

  • 🔐 software that runs locally — and thinks locally

Written By

Marco Martorana

Lead Software Engineer | Computer Science Teacher