Communication closure

2026-05-27

Distributed protocols define how nodes should behave in a network to solve a collective problem. Sometimes, they don't need to communicate with each other. But most times, they need to send and receive messages. In my PhD, I focused on a specific class of distributed protocols called consensus protocols where nodes have to commonly agree on a value. These protocols are complex, and therefore implementing them correctly is challenging and bug prone. I developed new techniques to test the implementation and find bugs.

But prior to my work, my advisors had worked on an approach that depended on a particular property of these distributed protocols called Communication Closure. Why? it makes testing distributed protocols easier. I try to prove the property for two very popular consensus protocols, Paxos and Raft in Lean (with the help of AI).

To define communication closure, let's start with a simple model of distributed protocols. Protocols are state machines that define how each node should behave. The edges are labelled with messages received or sent. Most protocols tag messages with round numbers or counters. Rounds in Paxos and Terms in Raft. The trace of a distributed system is a sequence of states and messages. The trace can be projected onto a node, as a sequence of local states with messages sent and received by that node.

Communication closure (as defined in the testing paper) is stated as follows: For any trace where messages of different rounds intermingle (I'll explain in a bit), there is an equivalent trace where messages of the same round are grouped together. The figure below illustrates it more abstractly (borrowed from the paper with style adaptations).

In both the images, the states of each process is exactly the same, but there are clear round boundaries that messages do not cross. For this property to hold in a distributed protocol, it should essentially discard messages from rounds that are not current (locally).

I used both Claude and Codex to prove the communication closure of Paxos and Raft. All the proofs in lean are available here - zeu5/communication-closure.

First, I asked it define the Paxos and Raft protocols in Lean. For Paxos I added the original paper by Lamport as reference. For Raft, I added a TLA+ model with membership changes included. It was no suprise to me that the protocols were correctly defined in one shot. (It might've been surprising before November 2025.)

Then, I provided an informal definition of Communication Closure and asked it prove the property for both Raft and Paxos. For Paxos, it came up with the following property,

				
/--
A single Paxos transition satisfies the communication-closure discipline for
one primary round.
-/
def CommunicationClosedStep (s s' : State p) : Prop :=
∃ r : p.Round,
	CommunicationClosedRound s s' r (emptyMedium p) (emptyMedium p) ∧
	PrimaryRoundChangesOnly s s' r ∧
	ActionReadsMatchingRound s s' r
				
				

Which is rather trivial. It states that between two states in a Paxos trace, only the primary (or leader node) in Paxos can change rounds. For the remaining nodes, the action associated with the step matches the round in the state. Paxos is naturally communication closed. After I provided the formal definition from the paper, Claude/Codex was able to encode the right property and also prove it for Paxos.

However, Raft is not communication closed in the traditional sense of the definition. When a node receives a message from a higher term, it affects the state of the node. Further, the state of each node contains information from older terms in its log and affect future decisions. To prove Communication Closure of Raft, I had to try several different ways to model. But eventually, I settled on the following trick - to treat the decision index of the log as the round instead of the term number contained in the message. Claude introduced all the necessary mechanics and lemmas to prove a variant of Communication Closure for Raft:

				
/-- For any trace, there exists an equivalent trace with commit-index closure. -/
theorem exists_commit_grouped_trace (t : RaftTrace p) :
    ∃ t' : RaftTrace p, CommittedBehaviorEquiv t t' ∧ CommitIndexClosed t' :=
  ⟨t, fun _ => StutterEquiv.refl _, fun _ _ _ => trivial⟩
				
				

For every Raft trace, there exists an equivalent trace with commit-index closure. The proof is trivial as the supporting lemmas show that commit index closure is true for every trace of Raft.

That brings me to the following open question: To show communication closure of any consensus protocol, is it sufficient to show that there is an attribute in the state of each process that progresses monotonically? Is that attribute always the commit index of the log?

To conclude, AI was very helpful since I did not know Lean. This was an experience in learning how to read and prove trivial properties in Lean but also get to very complex stuff without spending all the time to skill up.

#tech