Introduction to Formal Methods (Part 2): From Spec to Code
In Part 1, we talked about the “Why”. Why write a spec? Because English is ambiguous, and debugging design flaws in code is expensive. We looked at Quint and modeled a simple TCP Handshake. We verified that our logic was sound (no safety violations).
But as I hinted at the end of Part 1: a spec is just a file. If I go off and write code and ignore the spec, I haven’t really gained anything. In fact, I’ve just wasted time writing a spec.
In this part, we will close the loop. We will use what is called Model-Based Testing to ensure our Python implementation behaves exactly like our verified spec.
The Strategy: Trace Replay
We can’t easily “compile” Quint to Python (yet). And we probably don’t want to, because the spec is an abstraction, not an implementation. The spec doesn’t care about hardware, dependencies or deployment! The implementation does.
Instead, we treat the spec as a test case generator.
- First, generate a trace by using quint to run a simulation and save the sequence of steps (the trace) to a file.
- Then, we replay the trace in Python, but we instrument it so that it looks like a test.
- For every step in the trace (e.g.,
SendSyn), we execute the corresponding method in our Python class. - Finally, after each step, we check if our Python object’s state matches the spec’s state.
If the test passes, we know our code handles the scenarios defined by the spec correctly.
Step 1: Generating the Trace
In Part 1, we ran quint run to see text output.
Now, we want a machine-readable format.
Quint supports a format called ITF.
Luckily, its a JSON format, meaning its both easy to read as well as integrate in code.
Running:
quint run --mbt --max-steps=10 --out-itf=trace.itf.json tcp_simple.qnt
This produces a JSON file that looks roughly like this:
{
"vars": ["client_state", "server_state", "mbt::actionTaken"],
"states": [
{
"#meta": { "index": 0 },
"client_state": { "tag": "INIT" },
"server_state": { "tag": "INIT" },
"mbt::actionTaken": "init"
},
{
"#meta": { "index": 1 },
"client_state": { "tag": "SYN_SENT" },
"server_state": { "tag": "INIT" },
"mbt::actionTaken": "SendSyn"
},
...
]
}
It captures the exact state of the system at every step. Note that this is just one possible execution path. In the “Scaling Up” section below, we will discuss how to test against many random traces.
Step 2: The Python Implementation
Now let’s write an implementation. We want to make sure that unrepresentable states are actually unrepresentable. We will use Pydantic, Enums, Tagged Unions, and one of my favourite newish feature of Python - Pattern Matching for this.
# tcp.py
from enum import Enum
from typing import Literal, Union, Annotated
from pydantic import BaseModel, Field
class State(str, Enum):
INIT = "INIT"
SYN_SENT = "SYN_SENT"
SYN_RCVD = "SYN_RCVD"
ESTABLISHED = "ESTABLISHED"
# We define each valid "System State" as a separate Model.
class InitState(BaseModel):
tag: Literal["Init"] = "Init"
client_state: Literal[State.INIT] = State.INIT
server_state: Literal[State.INIT] = State.INIT
class SynSentState(BaseModel):
tag: Literal["SynSent"] = "SynSent"
client_state: Literal[State.SYN_SENT] = State.SYN_SENT
server_state: Literal[State.INIT] = State.INIT
# ... other valid state models (SynRcvdState, FullyEstablishedState, etc.) ...
TCPState = Annotated[
Union[InitState, SynSentState, ...], # All valid states
Field(discriminator="tag")
]
class TCPModel:
def __init__(self):
self.state: TCPState = InitState()
def send_syn(self):
match self.state:
case InitState():
self.state = SynSentState()
return True
case _:
return False
# ... receive_syn, receive_syn_ack, receive_ack, etc.
This looks simple, but notice how the logic in send_syn uses the match statement?
By using specific Pydantic models for each state, it becomes impossible to even construct an invalid state (like Server being ESTABLISHED while Client is INIT).
If we messed up the transition logic, the state wouldn’t match the spec.
Step 3: The Replay Test
The way we ensure the implementation matches the spec is by doing replay tests. We write a test that reads the JSON trace and drives the Python model.
# test_tcp.py (simplified)
import json
from tcp import TCPModel
def main():
with open("trace.itf.json") as f:
trace = json.load(f)
model = TCPModel()
# Skip index 0 as it is the initial state
for i, state_json in enumerate(trace["states"][1:], 1):
action = state_json["mbt::actionTaken"]
match action:
case "SendSyn":
success = model.send_syn()
case "ReceiveSyn":
success = model.receive_syn()
# ... handle other actions ...
if not success:
raise Exception(f"Action {action} failed at step {i}")
# Verify state matches
assert model.state.client_state.value == state_json["client_state"]["tag"]
assert model.state.server_state.value == state_json["server_state"]["tag"]
print("Trace verified successfully!")
Its a little bit finicky, since we will have to do a match between the action as specified in the spec vs the function/method in the code. But, it does what we want, namely reads the action from the trace, executes it on the model, and asserts that the resulting state matches the spec. If the implementation (Python) and the Spec (Quint) disagree, this test fails.
Why is this powerful?
- It gives us fuzzing for free. Quint’s random simulation generates edge cases we might forget to test manually.
- The spec is documentation, and the tests ensure the code respects it.
- We made unrepresentable state impossible (within the margins of how strict one can be with types in Python). By using Pydantic tagged unions, we ensure that the code can only ever be in a valid state.
Scaling Up
In this simple TCP example, the logic is linear because of guards we put in the spec, so every random trace looks identical. However, for complex protocols, we typically run this process in a loop (generating lots of traces). Since Quint picks random paths, this effectively fuzzes any implementation against the spec.
However, note that the scale of this approach has a limit. More complicated specs have many different trace paths. And we cannot possibly test again all of them. But testing a sample of traces is definitely better than none.
What about Invariants?
You might ask: “Where are we checking the invariants (like Safety) in the Python test?”
Well, we don’t, Quint does!
During the simulation phase, if a sequence of steps leads to a violation,
Quint reports it as a Violation Error.
The job of the tracing test is purely to ensure that the code conforms to the spec.
And if the code matches the spec, then we will be reasonably confident that the code is correct.
The Caveat: We still need Unit Tests
Formal methods are great for logic and state machines, but they don’t replace unit tests entirely. Specs often abstract away details. For example, in TLS, the spec might say:
action Encrypt = {
encrypted_data' = encrypt(data, key)
}
The spec assumes encrypt works mathematically.
It doesn’t check if your AES-GCM implementation handles padding correctly, or if you have an off-by-one error in your buffer allocation.
For those lower-level implementation details, standard unit tests are still required.
We use formal methods to verify the orchestration and logic, and unit tests to verify the primitives.
I believe verification aware languages like Dafny could possibly bridge the gap but that is a topic for another day.
Conclusion
So, we went from high level requirement to a formal spec in Quint. We did verification via model checking in Quint. Then actually did an implementation and showed the mechanism for proving the implementation matches the spec (in a fuzzy sense). Pretty neat!
I hope that is a convincing argument to think about adopting specs to make working with Agents easier. Instead of struggling with English to produce concrete requirements, we can collaborate with our favourite agent to produce a spec for the set of components we are building. And if the tooling is in place, we can just tell the agent to build the component, and the trace tests will make sure we adhere to the spec.
Code
The code used in this post can be found here.