ion7-labs
LuaJIT × llama.cpp — modular local LLM runtime.
Each module is independent and usable standalone or as part of the full stack.
Available — API reference
LuaJIT FFI → libllama.so. 84 bridge functions across 4 translation units: model, context, KV cache, speculative decoding, chat templates (Jinja2), sampling, LoRA, reasoning budget, grammar constraints.
Zero malloc per generated token. KV snapshot/restore. Prefix cache. Full libcommon surface (DRY, XTC, EAGLE3, NGRAM_CACHE).
API Reference →Grammar engine for LuaJIT. Eight input formats (regex / ABNF / EBNF / JSON Schema / type DSL / enum / tool / auto-detect) all yielding the same composable Grammar_obj.
AST + LPeg-backed parsers. Per-seq Backtrack and DCCD (multi-tenant safe). GrammarContext for stateful SQL agents. Pure-Lua fuzzer. Composition algebra (union, sequence, wrap, interleave).
API Reference →Chat pipeline + multi-session inference orchestration. Per-seq KV snapshots, prefix cache, slot pool, fork. Engine + Pool (~6× aggregate speedup).
Mid-generation eviction, RadixAttention exact-match prefix cache, Y-Token sink hook. 4-channel streaming (content/thinking/tool_call_delta/tool_call_done/stop). Format-aware tool extraction (OpenAI/Qwen/Mistral/Hermes). Interleaved-thinking tool loop. Reasoning budget. Embeddings.
API Reference →In Development — docs coming
Sparse Autoencoder on LLM embeddings. Validates superposition hypothesis: 0.91 cosine reconstruction, 0.500 Jaccard between related concept clusters.
SAE:edit() for primitive surgery (zero/set/scale). 64 primitives, K=16 active. Adam sparse, LuaJIT + OpenBLAS FFI. x16 embedding compression.
Visual node editor for ion7 pipelines. React Flow + Bun WebSocket server + LuaJIT executor. Each ion7-core function is a wireable node.
Browser ↔ Bun WS ↔ LuaJIT. Topological execution. Nodes: Model_load, Ctx_decode, Sampler_chain, Generate, Display.
Neovim plugin for in-editor LLM generation. Subprocess-based streaming via jobstart(). Supports multi-turn via --msgs-file.
Protocol: TOKEN:
Planned
Local embeddings without llama-server. Load Qwen3-Embedding-8B directly via ion7-core FFI — no HTTP, no subprocess.
3-layer persistent memory: hot index (always in context) + topics (on-demand) + session archives (grep only).
Retrieval-Augmented Generation pipeline. SQLite + sqlite-vss vector store. Query → embed → cosine search → context injection.
Local text-to-speech via Kokoro-82M FFI. Streaming token→audio pipeline for <250ms first-sound latency in NPC AI pipelines.
Local speech-to-text via Whisper FFI. Streaming voice input with <50ms segment latency.
Fine-tuning and distillation via GGML autograd. LoRA/QLoRA on RTX 3060. Teacher→student distillation. No Python.
Stack layers
Application
High-level
Mid-level
Core