module

context.logits

Functions

M.logits

Raw logits buffer for batch position `idx`. Returns a `float*` cdata that is OWNED by the context — do not free, do not retain past the next decode.

M.logits(self, idx)

idxinteger?Default 0 (typical after `decode_single`).

→ cdata

M.sampled_token

Sampled token at batch position `i`, or `-1` if not available. (Set by an attached `llama_set_sampler` ; see `Context:set_sampler`.)

M.sampled_token(self, i)

iinteger?

→ integer

M.sampled_probs

Probability array for the sampled token at position `i`.

M.sampled_probs(self, i)

iinteger?

→ cdatafloat* (raw, not owned).

M.sampled_probs_count

Count matching `sampled_probs(i)`.

M.sampled_probs_count(self, i)

iinteger?

→ integer

M.sampled_logits

Logit array for the sampled candidates at position `i`.

M.sampled_logits(self, i)

iinteger?

→ cdatafloat* (raw, not owned).

M.sampled_logits_count

Count matching `sampled_logits(i)`.

M.sampled_logits_count(self, i)

iinteger?

→ integer

M.sampled_candidates

Candidate-token array at position `i`.

M.sampled_candidates(self, i)

iinteger?

→ cdatallama_token* (raw, not owned).

M.sampled_candidates_count

Count matching `sampled_candidates(i)`.

M.sampled_candidates_count(self, i)

iinteger?

→ integer

prepare_softmax

Internal : numerically-stable log-softmax preparation. Returns `(logits, n, max_l, sum)` for reuse by the public methods. When the batch has no logits at `idx`, `logits` is nil.

prepare_softmax(self, idx)

M.logprob

Log-probability of `token_id` given the logits at batch position `idx`. Returns `-math.huge` when the token is out of vocab range or the batch slot has no logits.

M.logprob(self, idx, token_id)

idxinteger

token_idinteger

→ number

M.entropy

Shannon entropy (in nats) of the logit distribution at batch position `idx`. Returns `0` when no logits are available.

M.entropy(self, idx)

idxinteger

→ number

M.logprob_entropy

Combined `logprob` and `entropy` in ONE pass over `n_vocab`. Use when both metrics are needed — halves the work compared to two separate calls.

M.logprob_entropy(self, idx, token_id)

idxinteger

token_idinteger

→ numberlogprob (`-math.huge` on error).

→ numberentropy (`0` on error).

M.embedding_ptr

Raw embedding cdata pointer for a sequence (zero-copy). Falls back to the per-batch embedding when the sequence-keyed accessor returns nil. Caller must NOT free and must NOT retain past the next decode.

M.embedding_ptr(self, seq_id)

seq_idinteger?Default 0.

→ cdata|nil`float*` or nil if no embedding is available.

M.embedding

Pooled embedding as a Lua table (1-based, copies the floats out so the caller can hold it past the next decode).

M.embedding(self, seq_id, dim)

seq_idinteger?Default 0.

diminteger?Embedding dimension. Defaults to the model's

→ table|nil

M.set_control_vector

Apply a control vector (activation steering) to this context. Effective on the next decode. Pass either a Lua table (we'll copy to a float buffer) or a pre-built cdata `float*`.

M.set_control_vector(self, data, n_embd, il_start, il_end)

datatable|cdata`n_embd × (il_end - il_start + 1)` floats.

n_embdintegerEmbedding dimension.

il_startintegerFirst layer (inclusive).

il_endintegerLast layer (inclusive).

→ boolean

M.clear_control_vector

Remove any previously-applied control vector.

M.clear_control_vector(self)

M.set_sampler

Attach a sampler to a specific sequence. After this, `llama_decode` will sample automatically using `sampler` whenever it processes `seq_id`. EXPERIMENTAL upstream — the surface may shift.

M.set_sampler(self, seq_id, sampler)

seq_idinteger

samplerion7.core.Sampler|cdataEither a Sampler instance

→ boolean

M.attach_threadpool

Attach a CPU threadpool to this context. `tp` may be either a raw `ggml_threadpool_t` cdata or a `Threadpool` instance. The second pool slot is for batch operations and defaults to NULL, which tells llama.cpp to fall back to the primary pool internally — passing the same pool twice via two distinct attach calls would corrupt the pool's wait state.

M.attach_threadpool(self, tp, tp_batch)

tpcdata|ion7.core.ThreadpoolPrimary pool.

tp_batchcdata|ion7.core.Threadpool?Optional dedicated batch pool.

M.detach_threadpool

Detach the current threadpool (reverts to llama.cpp's internal one).

M.detach_threadpool(self)

M.perf_print

Print perf counters to stderr.

M.perf_print(self)

M.perf_reset

Reset every perf counter to zero.

M.perf_reset(self)

M.perf

Read the perf counters into a Lua table.

M.perf(self)

→ table{ t_load_ms, t_p_eval_ms, t_eval_ms, n_p_eval, n_eval, n_reused, tokens_per_s }

M.warmup

Run a single dummy decode through this context to force the GPU backend to JIT-compile its shaders. After the call the KV cache is wiped and `n_past` reset, so the context is left in the same state it would have been right after creation — minus the cold-shader penalty on the next real decode. Use once, right after `model:context()` and before any real work, to shave 1-3 s off the time-to-first-token of the first real request.

M.warmup(self)