Tutorial — 30 min

Formal Methods

Understanding Software Through Mathematical Eyes

Ferhat Erata

CFG DFG PDG Slicing Alloy Hypothesis KLEE

      code → graph → analysis → understanding
    

The Problem

We write code. But do we understand it?

Humans write code in a programming language
Semantics tells us what each statement means — but not what the whole program does
We need mathematical models that make invisible structure visible
This talk: code → mathematical model → tool-supported analysis

"Take one step back. What is your mental image of software?"

Roadmap

Six lenses in 30 minutes

Time	Topic	Key Question
5 min	Control Flow Graph (CFG)	What paths can execution take?
5 min	Data Flow Graph (DFG)	Where do values come from and go to?
5 min	Program Dependence Graph (PDG)	What depends on what? → Slicing
5 min	Property-Based Testing (Hypothesis)	Does it hold for many random inputs?
5 min	Concolic Testing (KLEE)	Can we explore every path?
5 min	Formal Specification (Alloy)	What should the system do?

The Big Picture

A stack of mathematical lenses

graph TD
    SRC["Source Code<br>what humans write  "] --> CFG
    CFG["CFG<br>what paths exist?  "] --> DFG
    DFG["DFG<br>where do values flow?  "] --> PDG
    PDG["PDG<br>what depends on what?  "] --> SLC["Slicing<br>minimal relevant subset  "]

    SRC --> SPEC["Formal Spec — Alloy<br>is the design consistent?  "]
    SRC --> PROP["Property Testing — Hypothesis<br>do invariants hold?  "]
    SRC --> CONC["Concolic Testing — KLEE<br>what does every path do?  "]

    style SRC fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style CFG fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style DFG fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style PDG fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style SLC fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style SPEC fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style PROP fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style CONC fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0

All Graph Analysis Alloy Hypothesis KLEE

Graph-Based Analysis

CFG → DFG → PDG → Slicing

Build the graph. Traverse the graph. Answer the question.
Each layer adds a new kind of edge.

Specification & Testing

Alloy — check designs before code exists
Hypothesis — check properties across many random inputs
KLEE — systematically explore every path

We'll go through each of these one by one — then return to this picture at the end.

The Compiler Pipeline

Where do these representations live?

graph LR
    SRC["Source Code<br>Python, C, Java, ...  "] --> LEX["Lexer /<br>Parser  "]
    LEX --> AST["AST  "]
    AST --> SEM["Semantic<br>Analysis  "]
    SEM --> IR["IR / SSA<br>LLVM IR, GCC GIMPLE  "]
    IR --> OPT["Optimizer  "]
    OPT --> CG["Code Gen  "]
    CG --> BIN["Machine Code<br>x86, ARM, WASM  "]

    style SRC fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style LEX fill:#1a1a2e,stroke:#2a2a3a,color:#8888a0
    style AST fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style SEM fill:#1a1a2e,stroke:#2a2a3a,color:#8888a0
    style IR fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style OPT fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style CG fill:#1a1a2e,stroke:#2a2a3a,color:#8888a0
    style BIN fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0

Front-End

Source → AST
(Abstract Syntax Tree)

Lexing, parsing, type checking.
The AST is the first structured representation of your code.

Tools: tree-sitter, clang, javac

Middle-End (IR)

AST → CFG + SSA

The IR is where analysis happens:
CFG = control flow structure
DFG = data flow (def-use chains)
PDG = dependence → slicing

This is what we'll study today.

SSA Form

Static Single Assignment

Every variable is assigned exactly once.
Multiple assignments → distinct names.

φ nodes merge values at join points:
%x = φ(v₁, v₂)

Makes def-use chains explicit — the DFG is built into the IR.

Back-End

IR → Machine Code

Register allocation, instruction selection, scheduling.
The optimizer transforms the IR using the same graphs we'll see.

Tools: LLVM, GCC, V8 TurboFan

Control Flow Graph

What paths can execution take?

def fee(age, dist): if age < 25: rate = 3 else: rate = 2 if dist > 100: surcharge = 20 else: surcharge = 0 return dist * rate + surcharge

Key Concepts

Nodes = basic blocks (straight-line code)
Edges = possible transfers of control
Path = one possible execution
4 paths = 2 binary decisions = 2² leaves

graph TD
    A["entry: age, dist  "] --> B{"age < 25 ?"}
    B -- T --> C["rate = 3  "]
    B -- F --> D["rate = 2  "]
    C --> E{"dist > 100 ?"}
    D --> E
    E -- T --> F["surcharge = 20  "]
    E -- F --> G["surcharge = 0  "]
    F --> H["return dist*rate<br>+ surcharge  "]
    G --> H

    style A fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style B fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style C fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style D fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style E fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style F fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style G fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style H fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0

CFG with Nested Loops

Back-edges reveal loop structure

def matrix_search(matrix, target): for i in range(len(matrix)): for j in range(len(matrix[i])): if matrix[i][j] == target: return (i, j) return (-1, -1)

All Outer Loop Inner Loop

Nested Loop Structure

Outer back-edge: G → B (outer loop)
Inner back-edge: F → C (inner loop)
Outer header: B — dominates everything inside
Inner header: C — dominates D, E, F
Early exit: E breaks out of both loops

Why It Matters

Loop nesting depth drives complexity analysis
Dominance tree: B dom C dom D dom E
Loop-invariant code motion: hoist from inner
to outer, or out of both loops entirely

graph TD
    A["A: entry  "] --> B{"B: i < rows ?"}
    B -- F --> H["H: return (-1,-1)  "]
    B -- T --> C{"C: j < cols ?"}
    C -- F --> G["G: i += 1  "]
    C -- T --> D{"D: matrix[i][j]<br>== target ?"}
    D -- T --> E["E: return (i,j)  "]
    D -- F --> F["F: j += 1  "]
    F -->|"inner back-edge"| C
    G -->|"outer back-edge"| B

    style A fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style B fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style C fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style D fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style E fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style F fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style G fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style H fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0

LLVM IR

The CFG in the compiler's representation

Source (Python)

def fee(age, dist): if age < 25: rate = 3 else: rate = 2 if dist > 100: surcharge = 20 else: surcharge = 0 return dist * rate + surcharge

Source (C)

int fee(int age, int dist) { int rate; if (age < 25) rate = 3; else rate = 2; int surcharge; if (dist > 100) surcharge = 20; else surcharge = 0; return dist * rate + surcharge; }

Different frontends, same IR → same analyses

LLVM IR

define i32 @fee(i32 %age, i32 %dist) { entry: %cmp1 = icmp slt i32 %age, 25 br i1 %cmp1, label %young, label %older young: br label %check_dist older: br label %check_dist check_dist: %rate = phi i32 [3, %young], [2, %older] %cmp2 = icmp sgt i32 %dist, 100 br i1 %cmp2, label %far, label %near far: br label %compute near: br label %compute compute: %sur = phi i32 [20, %far], [0, %near] %base = mul i32 %dist, %rate %total = add i32 %base, %sur ret i32 %total }

What to Notice

Basic Blocks = CFG Nodes

Each label: is a basic block.
Straight-line instructions, ending
with a br (branch) or ret.

SSA = Explicit Data Flow

Every %var is assigned exactly once.
phi nodes merge values from different incoming edges — the DFG is built into the IR.

Two Independent phi Nodes

%rate merges from young/older
%sur merges from far/near
Two independent data flows
that converge in %compute.

The CFG and DFG aren't abstractions you draw on a whiteboard — they're the actual compiler IR.

Data Flow Graph

Where do values come from and go to?

def fee(age, dist): if age < 25: rate = 3 else: rate = 2 if dist > 100: surcharge = 20 else: surcharge = 0 return dist * rate + surcharge

Key Concepts

DEF = variable is assigned a value
USE = variable is read
Def-use chain = which DEFs reach a USE?
Reaching defs = at return, rate could be 3 OR 2

Why it matters:
Constant propagation — replace var with known value
Dead code elimination — DEF with no USE
Uninitialized variable — USE with no DEF
Two independent def-use chains visible here!

graph LR
    subgraph age_rate ["age → rate channel"]
        DA["DEF(age)<br>@ param  "] -.->|def-use| UA["USE(age)<br>@ age < 25  "]
        DR1["DEF(rate)<br>@ rate = 3  "] -.->|def-use| UR["USE(rate)<br>@ return  "]
        DR2["DEF(rate)<br>@ rate = 2  "] -.->|def-use| UR
    end
    subgraph dist_surcharge ["dist → surcharge channel"]
        DD["DEF(dist)<br>@ param  "] -.->|def-use| UD1["USE(dist)<br>@ dist > 100  "]
        DD -.->|def-use| UD2["USE(dist)<br>@ return  "]
        DS1["DEF(surcharge)<br>@ surcharge = 20  "] -.->|def-use| US["USE(surcharge)<br>@ return  "]
        DS2["DEF(surcharge)<br>@ surcharge = 0  "] -.->|def-use| US
    end

    style DA fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style DR1 fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style DR2 fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style DD fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style DS1 fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style DS2 fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style UA fill:#13131a,stroke:#2a2a3a,color:#8888a0
    style UR fill:#13131a,stroke:#ffaa44,color:#e8e8f0
    style UD1 fill:#13131a,stroke:#2a2a3a,color:#8888a0
    style UD2 fill:#13131a,stroke:#ffaa44,color:#e8e8f0
    style US fill:#13131a,stroke:#ffaa44,color:#e8e8f0

All age → rate dist → surcharge return (all USEs)

Program Dependence Graph & Slicing

What depends on what?

graph TD
    Entry["entry  "] ==>|ctrl| P1{"age < 25 ?"}
    Entry ==>|ctrl| P2{"dist > 100 ?"}
    Entry ==>|ctrl| RET["return<br>dist*rate+surcharge  "]
    P1 ==>|ctrl| S1["rate = 3  "]
    P1 ==>|ctrl| S2["rate = 2  "]
    P2 ==>|ctrl| S3["surcharge = 20  "]
    P2 ==>|ctrl| S4["surcharge = 0  "]
    Entry -.->|data| P1
    Entry -.->|data| P2
    Entry -.->|data| RET
    S1 -.->|data| RET
    S2 -.->|data| RET
    S3 -.->|data| RET
    S4 -.->|data| RET

    style Entry fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style P1 fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style P2 fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style S1 fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style S2 fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style S3 fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style S4 fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style RET fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0

All Backward: rate = 3 Backward: surcharge = 20 Forward: age < 25? Forward: dist > 100?

PDG = Control deps + Data deps
Solid ⇒ = control dependence
Dashed ‐‐> = data dependence

Program Slicing

"What is the minimal subset of the program that could affect variable v at line n?"

Backward slice: trace deps backwards — what contributed to this value?
Forward slice: trace deps forward — what does this statement affect?

Click the buttons below the graph to visualize slices.

Extensions

SDG (System Dependence Graph) — PDG + interprocedural edges (call/return) for whole-program slicing

Property-Based Testing

Hypothesis: Does it hold for many inputs?

from hypothesis import given, strategies as st # Property: fee is always non-negative @given(st.integers(0, 120), st.integers(0, 1000)) def test_fee_non_negative(age, dist): assert fee(age, dist) >= 0 # Property: young drivers always pay more @given(st.integers(0, 1000)) def test_young_pay_more(dist): young = fee(20, dist) older = fee(30, dist) assert young >= older # Property: fee is linear in distance @given(st.integers(0, 120), st.integers(1, 998)) def test_fee_linear(age, dist): step = fee(age, dist + 1) - fee(age, dist) prev = fee(age, dist) - fee(age, dist - 1) assert step == prev # Hypothesis generates 100s of inputs # and SHRINKS failures to minimal cases # # Found: fee(20,101)−fee(20,100) = 23 # but fee(20,100)−fee(20, 99) = 3 # surcharge jump breaks linearity!

Key Idea

You write properties (invariants), not examples.
The framework generates inputs — including edge cases you'd never think of.
Failing inputs are shrunk to the simplest counterexample.

              Live Generation
              ● running
            

The gap it bridges:

Example tests "works on my cases"
↓
Property tests "works for many random inputs"
↓
Formal proof "works for all inputs, proven"

Concolic Testing

KLEE: Concrete + Symbolic = every path

def fee(age, dist): if age < 25: rate = 3 else: rate = 2 if dist > 100: surcharge = 20 else: surcharge = 0 return dist*rate+surcharge

graph TD
    R["fee(age, dist)  "] --> A{"age < 25 ?"}
    A -- "T: age<25" --> B1{"dist > 100 ?"}
    A -- "F: age≥25" --> B2{"dist > 100 ?"}
    B1 -- T --> P1["path 1:<br>3×dist + 20  "]
    B1 -- F --> P2["path 2:<br>3×dist  "]
    B2 -- T --> P3["path 3:<br>2×dist + 20  "]
    B2 -- F --> P4["path 4:<br>2×dist  "]

    style R fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style A fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style B1 fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style B2 fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style P1 fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style P2 fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style P3 fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style P4 fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0

Overview 1. Concrete Run 2. Collect π 3. Negate 4. New Path 5. All Explored

How Concolic Works

1. Run concretely: $age{=}30,\; dist{=}50$
   → takes path F, F

2. Collect path constraint:
   $(age \ge 25) \;\land\; (dist \le 100)$

3. Negate last conjunct:
   $(age \ge 25) \;\land\; \color{#ff6c8c}{(dist > 100)}$

4. Solve → $age{=}30,\; dist{=}150$

5. Run with new input → new path

6. Repeat until all 4 paths explored

KLEE found bugs in all UNIX coreutils
(cat, echo, ls, ...) that were there for decades.

Formal Specification (Bounded Model Checking)

Alloy: What should the system do?

-- A simple access control model
sig User {}
sig Resource {}
sig Permission {
    user: one User,
    resource: one Resource,
    level: one Level
}
enum Level { Read, Write, Admin }
-- No user can have Write without Read
fact escalation {
    all p: Permission |
        p.level = Write implies
        some q: Permission |
            q.user = p.user
            and q.resource = p.resource
            and q.level = Read
}
-- Can a user get Admin on everything?
run find_omnipotent {
    some u: User | all r: Resource |
        some p: Permission |
            p.user = u and p.resource = r
            and p.level = Admin
} for 4

Overview 1. Signatures 2. Constraints 3. Analysis 4. Counterexample

sig = Signature (a type)

Each sig declares a set of atoms. Think of it like a class, but with no methods — only structure.

one = exactly one relation. Alloy is relational: fields are binary relations, not values.

run = "Find me an example"

The Analyzer searches all possible worlds with up to 4 atoms per sig.

for 4 = scope bound. Alloy checks every combination within this bound. Small scope hypothesis: most bugs show up in small instances.

If it finds one → your spec allows it.
If no instance exists → your constraints prevent it.

This is bounded model checking via SAT solving under the hood.

Key Idea

Declarative: say what you want, not how

The Alloy Analyzer exhaustively checks all instances up to a bound.

Finds counterexamples automatically — "here's a state your spec allows but you didn't intend."

Used for: Protocol design, API modeling, architecture validation, security policy checking

The Big Picture

A stack of mathematical lenses

graph TD
    SRC["Source Code<br>what humans write  "] --> CFG
    CFG["CFG<br>what paths exist?  "] --> DFG
    DFG["DFG<br>where do values flow?  "] --> PDG
    PDG["PDG<br>what depends on what?  "] --> SLC["Slicing<br>minimal relevant subset  "]

    SRC --> SPEC["Formal Spec — Alloy<br>is the design consistent?  "]
    SRC --> PROP["Property Testing — Hypothesis<br>do invariants hold?  "]
    SRC --> CONC["Concolic Testing — KLEE<br>what does every path do?  "]

    style SRC fill:#1a1a2e,stroke:#6c8cff,color:#e8e8f0
    style CFG fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style DFG fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style PDG fill:#1a1a2e,stroke:#4cde80,color:#e8e8f0
    style SLC fill:#1a1a2e,stroke:#ffaa44,color:#e8e8f0
    style SPEC fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style PROP fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0
    style CONC fill:#1a1a2e,stroke:#ff6c8c,color:#e8e8f0

All Graph Analysis Alloy Hypothesis KLEE

Graph-Based Analysis

CFG → DFG → PDG → Slicing

Build the graph. Traverse the graph. Answer the question.
Each layer adds a new kind of edge.

Specification & Testing

Alloy — check designs before code exists
Hypothesis — check properties across many random inputs
KLEE — systematically explore every path

Mathematical models make the invisible visible.
Tools automate what humans can't do by hand.

Why Now?

Formal Methods × Large Language Models

LLM Inference — Context Window

LLMs have a finite context window. You can't feed an entire codebase into a prompt.

Program slicing gives the principled answer: the backward slice is exactly the minimal code the model needs to reason about a bug or feature.

Slicing turns a whole-program problem into a focused, context-window-sized problem.

LLM Inference — Structured Generation

CFGs and formal specs can constrain LLM output at decode time — ensuring generated code follows valid control flow or satisfies type/API contracts.

Training — Data Quality

Property-based testing can validate training data at scale — checking that code examples satisfy invariants before they enter the training set.

Garbage in, garbage out — formal properties are a filter.

Training — Reward & Verification

Formal specifications can serve as automatic reward signals for RLHF — did the generated code satisfy the spec?

KLEE and Alloy can verify LLM-generated code without running user tests.

Formal methods aren't replaced by LLMs — they're the missing verification layer that makes LLM-generated code trustworthy.

Takeaway

We don't just write code.

We build mathematical models of code —

and let tools reason for us.

CFGs are in every compiler
Slicing is in every modern IDE — and the key to fitting code into LLM context windows
KLEE found bugs in coreutils that were there for decades
Formal specs can be reward signals and verification oracles for LLM-generated code
These aren't academic curiosities — they're already in your tools

"The mathematical models are already under the hood. Now you know what's there."