michael giba

Grammar-Based Sampling Quick Summary

This short post intends to give an overview of sampling and then focus on a specific type of sampling useful for structured outputs in models.

During inference, LLMs generate sequences of "tokens" the primitives that these systems string together to produce sentences. First these transformer-based machines produce raw scores or "logits" over the set of possible tokens and then "sampling" decides which token to choose given those scores.

Language Model The cat is Logits (Raw Scores) big small cute ... Sampling cute 1. Process input tokens 2. Generate logits for possible next tokens 3. Sample next token

Greedy Sampling

The simplest approach possible here is just to take the token with the largest logit at each timestep and this is a perfectly viable option. Below you can see through a visualization what this looks for real completions of sample phrases thanks to phi-4

To generate all of the below visualizations I added custom logit_processors in a script wrapping llama.cpp inference. This let me both record and manipulate the sampler inputs/outputs. You can find the code here.

As you can see, a token could represent more than one character to the model. Tokens are really just strings of bytes.

Softmax Sampling

Being greedy with token selection is often not what we want. Instead we might want some variability in our sampling. It is common to transform the logits from the model into a distribution through a nice stable mechanism like applying the softmax function. Once we have reinterpreted these as probabilities, we can sample from them like expected.

Notice that the model is sometimes not very sure about any particular token option (i.e the single-timestep probability has a high entropy) This can result in some unlikely samplings which lead the model on vastly different future trajectories.

Constrained Sampling

But what if we want to generate outputs with a strict format or an enumerable set of options? For example, what if you wanted to only have a model respond "true" or "false" to a specific question, or what if you wanted to force the model responses to fit a specific JSON or YAML schema?

This is where grammar-constrained generations can help. If we can represent our desired output in a specific form, we can programatically constrain the sampling to guarantee the outputs will match those rules. At each step of the model generation process we take our generation so far $t_0 \dots t_i$ and:

  1. Get the logits from the model for $t_{i+1}$
  2. Ask our rules if $t_0 \dots t_{i+1}$ would be valid, if not we set its score to $-\infty$
  3. Perform one of our conventional sampling techniques across the valid options
Grammar-Constrained Sampling t₀ t₁ ... tᵢ ? 1. Generate logits for t(i+1) token A token B token C token D 2. Filter with grammar rules token A token B token C token D 3. Sample token C Selected
But how do we express these rules? We can use conventional context-free grammars

The following is a grammar which defines a subset of JSON in EBNF.

?start: object

?value: object
    | array
    | string
    | "true"             -> true
    | "false"            -> false
    | "null"             -> null

array  : "[" [value ("," value)*] "]"
object : "{" pair ("," pair)* "}"
pair   : string ":" value

letter: "a" | "b" | "c" | "d" | "e" | "f"
    | "i" | "j" | "k" | "l" | "m" | "n"
    | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v"
    | "w" | "x" | "y" | "z" | "_"

string : "\"" letter+ "\""

I passed this definition to a parser generator Lark which gave the "checker" mentioned above for validity of generated sequences at each timestep. This could then be used at each timestep to screen only our valid sequences of tokens, allowing us to produce the visualizations below.

Constrained Sampling (Greedy)

We can still use our regular-old sampling methods after we've constrained our tokens. The following are greedily sampled tokens from the constrained set.

Constrained Sampling (Distributional)

Or we can also treat the constrained outputs as a distribution.

And that's the gist of constrainted sampling. The way this is implemented in practice to be performant can get fairly complicated but is even more interesting. I recommend reading the source of llama-grammar.cpp to see how llama.cpp handles the incremental parsing, but the principle is the same.

Some Notes

  1. You might be wondering, since we know our grammar and at each generation timestep we have a fixed number of options, can't we just choose from those options and find the corresponding logit with the highest value? Unfortunately no because of how tokenization works. Since the model thinks in tokens, the distribution it has learned internally may not vibe well with our character-by-character trajectories. Concretely, if we force a generation like { followed by }, the tokenization scheme might have a single token for {}. Forcing separate { and } generations can confuse the model and violate its assumptions about token sequences.

  2. Scanning the full set of possible tokens and checking for parse errors at each step can be slow. Instead for this visualization, we can prioritize checking the highest-logit tokens first and stop after finding a sufficient number of conformant options. This is akin to top-k sampling.

  3. This was not intended to be comprehensive by any measure. There are many more sampling techniques available including:

    • XTC
    • Microstat
    • Penalties
    • Top-n Sigma
    • DRY
    • Logit-Bias
    • Temperature Extended

    We also didn't cover searching strategies which investigate more than one timestep.

If you have any questions or comments about this post or in general feel free to reach out to me at michaelgiba@gmail.com.