class

ion7.core.Context

_ptr cdata `llama_context*` (auto-freed via ffi.gc).

_model_ref table Parent Model — keeps the model alive as long as we are.

_mem cdata `llama_memory_t` accessor (cached).

_decode_batch cdata Pre-allocated `llama_batch` reused on every decode.

_batch_gc cdata GC sentinel that disposes of `_decode_batch`.

_n_past integer Current KV fill position (Lua-side mirror).

_n_batch integer Batch capacity (cached, immutable).

_n_ctx integer Context window size (cached, immutable).

_n_ubatch integer Micro-batch size (cached, immutable).

_n_seq_max integer Maximum concurrent sequences (cached).

_is_embed boolean? Set by `Model:embedding_context`.

Functions

Context.new

Wrap a raw `llama_context*` returned by `llama_init_from_model`. Prefer `model:context()` over calling this directly — it does the params dance and the OOM-retry cascade for you.

Context.new(model, ptr)

modelion7.core.ModelParent model (back-reference for GC ordering).

ptrcdata`llama_context*` (will be freed on GC).

→ ion7.core.Context

Context:ptr

Return the raw `llama_context*` cdata pointer (used by samplers and any FFI call that needs the context handle).

Context:ptr()

→ cdata

Context:memory

Return the cached `llama_memory_t` accessor for this context. Re-calling `llama_get_memory` on every KV op would be wasteful — we cache it once at construction.

Context:memory()

→ cdata

Context:free

Explicitly free the context (and its batch buffers) immediately. Idempotent. Normally the GC handles this ; call it manually inside tight benchmark loops to avoid accumulating dead VRAM allocations between iterations.

Context:free()

Context:n_ctx

Context:n_ctx()

→ integerContext window size in tokens.

Context:n_batch

Context:n_batch()

→ integerBatch capacity (max tokens per `llama_decode` call).

Context:n_ubatch

Context:n_ubatch()

→ integerMicro-batch size (the chunk the backend processes at once).

Context:n_seq_max

Context:n_seq_max()

→ integerMaximum concurrent sequences.

Context:n_ctx_seq

Context:n_ctx_seq()

→ integerPer-sequence context window (asks llama.cpp live).

Context:n_threads

Context:n_threads()

→ integerCurrent generation thread count.

Context:n_threads_batch

Context:n_threads_batch()

→ integerCurrent batch processing thread count.

Context:set_n_threads

Update the thread counts on a live context — no recreate required.

Context:set_n_threads(n_threads, n_threads_batch)

n_threadsinteger

n_threads_batchinteger?Defaults to `n_threads`.

Context:pooling_type

Symbolic pooling strategy of the context, e.g. `"mean"` for an embedding context. See `POOLING_NAMES` for the full mapping.

Context:pooling_type()

→ string

Context:set_embeddings

Toggle embedding extraction mode at runtime.

Context:set_embeddings(on)

onboolean

Context:set_causal_attn

Toggle causal attention. Pass `false` to use bidirectional attention (the embedding mode used by encoder-style models).

Context:set_causal_attn(on)

onboolean

Context:set_warmup

Mark the context as "in warmup" so llama.cpp does not pollute its perf counters with the dummy decode shaders are JIT-compiling on. See `Context:warmup()` for the high-level helper.

Context:set_warmup(on)

onboolean

Context:synchronize

Block until every async GPU command queued so far has finished. Useful before reading logits or tensors out of a backend buffer.

Context:synchronize()

Context:set_abort_callback

Register an abort callback that llama.cpp will poll periodically during a decode. Returning `true` from the callback aborts.

Context:set_abort_callback(cb, data)

cbcdataFunction pointer of type `bool(*)(void* data)`.

datacdata?Opaque user data forwarded to `cb`.

Context:n_past

Context:n_past()

→ integerCurrent Lua-tracked KV fill position. Mirrors what we

Context:set_n_past

Manually realign the Lua-side `n_past` mirror after a state restore (when llama.cpp resumes from a snapshot it knows the position but we don't). Most callers should NOT need this.

Context:set_n_past(n)

ninteger