ion7-labs
Local LLM inference runtime for LuaJIT.
Direct FFI into libllama.so — microseconds, not milliseconds.
0
Python
0
malloc / token
0
http overhead
484
documented functions
zero malloc / token
llama_batch pre-allocated at context creation. KV cache managed, never reallocated. Every generated token is a pure compute step.
direct FFI
No subprocess, no HTTP, no JSON serialization. LuaJIT ffi.call() straight into libllama.so — call overhead in microseconds.
full llama.cpp surface
84 bridge functions across 4 translation units. Chat templates (Jinja2), LoRA, speculative decoding, grammar, reasoning budget — all exposed.
grammar engine
GBNF, JSON Schema, regex, tool calling in pure Lua. CRANE-style lazy grammar activation. Constrained generation without sacrificing reasoning.
Quick start
local Model = require "ion7.core.model"
local Sampler = require "ion7.core.sampler"
local model = Model.load("model.gguf", { n_gpu_layers = -1 })
local ctx = model:context({ n_ctx = 4096 })
local vocab = model:vocab()
local samp = Sampler.chain(vocab):temp(0.8):top_p(0.95):build()
local tokens, n = vocab:tokenize("Hello, world!", true)
ctx:decode(tokens, n)
repeat
local token = samp:sample(ctx, -1)
samp:accept(token)
io.write(vocab:piece(token))
until vocab:is_eog(token)
Stack
LuaJIT FFI → llama.cpp. Zero malloc per token. 84 bridge functions, 4 translation units.
API Reference →Grammar engine for LuaJIT. Compiles regex, ABNF, EBNF, JSON Schema, Lua type annotations to GBNF. Per-seq Backtrack + DCCD runtime, pure-Lua fuzzer, format auto-detect.
API Reference →Chat pipeline + multi-session inference. Per-seq KV snapshots, prefix cache, three-channel streaming, schema-constrained sampling, interleaved-thinking tool loop.
API Reference →Sparse Autoencoder on LLM embeddings. Superposition hypothesis, 0.91 cosine reconstruction.
Visual node editor. React Flow + Bun WebSocket + LuaJIT executor.
Neovim integration. Subprocess-based streaming token generation.
Local embeddings without llama-server. Cosine similarity, pooling, batch encoding.
3-layer persistent memory: hot index + topics + session archives.
SQLite + vector search. Query → embed → retrieve pipeline.