Grammar-Based Sampling Quick Summary
April 25, 2025
This short post intends to give an overview of sampling and then focus on a specific type of sampling useful for structured outputs in models.
During inference, LLMs generate sequences of "tokens" the primitives that these systems string together to produce sentences. First these transformer-based machines produce raw scores or "logits" over the set of possible tokens and then "sampling" decides which token to choose given those scores.
Greedy Sampling
The simplest approach possible here is just to take the token with the largest logit at each timestep and this is a perfectly viable option. Below you can see through a visualization what this looks for real completions of sample phrases thanks to phi-4
To generate all of the below visualizations I added custom logit_processors
in a script wrapping
llama.cpp inference. This let me
both record and manipulate the sampler inputs/outputs. You can find the code here.
As you can see, a token could represent more than one character to the model. Tokens are really just strings of bytes.
Softmax Sampling
Being greedy with token selection is often not what we want. Instead we might want some variability in our sampling. It is common to transform the logits from the model into a distribution through a nice stable mechanism like applying the softmax function. Once we have reinterpreted these as probabilities, we can sample from them like expected.
Notice that the model is sometimes not very sure about any particular token option (i.e the single-timestep probability has a high entropy) This can result in some unlikely samplings which lead the model on vastly different future trajectories.
Constrained Sampling
But what if we want to generate outputs with a strict format or an enumerable set of options? For example, what if you wanted to only have a model respond "true" or "false" to a specific question, or what if you wanted to force the model responses to fit a specific JSON or YAML schema?
This is where grammar-constrained generations can help. If we can represent our desired output in a specific form, we can programatically constrain the sampling to guarantee the outputs will match those rules. At each step of the model generation process we take our generation so far $t_0 \dots t_i$ and:
- Get the logits from the model for $t_{i+1}$
- Ask our rules if $t_0 \dots t_{i+1}$ would be valid, if not we set its score to $-\infty$
- Perform one of our conventional sampling techniques across the valid options
The following is a grammar which defines a subset of JSON in EBNF.
?start: object
?value: object
| array
| string
| "true" -> true
| "false" -> false
| "null" -> null
array : "[" [value ("," value)*] "]"
object : "{" pair ("," pair)* "}"
pair : string ":" value
letter: "a" | "b" | "c" | "d" | "e" | "f"
| "i" | "j" | "k" | "l" | "m" | "n"
| "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v"
| "w" | "x" | "y" | "z" | "_"
string : "\"" letter+ "\""
I passed this definition to a parser generator Lark which gave the "checker" mentioned above for validity of generated sequences at each timestep. This could then be used at each timestep to screen only our valid sequences of tokens, allowing us to produce the visualizations below.
Constrained Sampling (Greedy)
We can still use our regular-old sampling methods after we've constrained our tokens. The following are greedily sampled tokens from the constrained set.
Constrained Sampling (Distributional)
Or we can also treat the constrained outputs as a distribution.