Bayesian networks tutorial

Bayesian networks

tutorial

machine learning

probabilistic models

closed question

Bayesian networks tutorial

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A Bayesian network is a compact way to describe uncertainty when several variables influence one another. Instead of storing one huge probability table for every possible combination, it uses a directed acyclic graph plus local conditional probabilities to represent the same information much more efficiently.

What a Bayesian Network Represents

A Bayesian network has two parts.

A graph, where each node is a random variable.
A conditional probability table for each node.

An arrow means that one variable directly helps explain another. If Rain points to WetGrass, the model is saying that the chance of wet grass depends directly on whether it rained.

The graph must be acyclic. In practice, that means you can follow arrows forward, but you can never loop back to where you started. This matters because the model is built from a factorization order, and cycles break that structure.

A common beginner mistake is to think the arrows always represent physical causality. They often do, but not always. In a Bayesian network, an edge mainly says that one probability is conditioned on another variable.

Factorization and Conditional Independence

The main value of a Bayesian network is that it turns a large joint distribution into smaller local pieces.

For a chain A -> B -> C, the joint distribution becomes:

text

P(A, B, C) = P(A) * P(B | A) * P(C | B)

That factorization says something important: once B is known, C does not need to look directly at A. This is a conditional independence statement. Bayesian networks are useful because the graph makes those statements explicit.

Here is a more familiar example from alarm systems:

text

Burglary -> Alarm <- Earthquake
Alarm -> JohnCalls
Alarm -> MaryCalls

That network factorizes as:

text

1P(Burglary, Earthquake, Alarm, JohnCalls, MaryCalls)
2= P(Burglary)
3* P(Earthquake)
4* P(Alarm | Burglary, Earthquake)
5* P(JohnCalls | Alarm)
6* P(MaryCalls | Alarm)

Without the network structure, you would need to store probabilities for every full combination of values. With the network, each variable only needs a local rule based on its parents.

A Small Worked Example

Consider a simple weather model:

Cloudy influences Sprinkler
Cloudy influences Rain
Sprinkler and Rain influence WetGrass

The graph is:

text

1Cloudy -> Sprinkler
2Cloudy -> Rain
3Sprinkler -> WetGrass
4Rain -> WetGrass

We can compute probabilities by multiplying local conditional probabilities. The following Python example implements exact inference by enumeration and asks for the probability of rain given that the grass is wet.

python

1from itertools import product
2
3variables = ["Cloudy", "Sprinkler", "Rain", "WetGrass"]
4parents = {
5    "Cloudy": [],
6    "Sprinkler": ["Cloudy"],
7    "Rain": ["Cloudy"],
8    "WetGrass": ["Sprinkler", "Rain"],
9}
10
11cpt = {
12    "Cloudy": {(): 0.5},
13    "Sprinkler": {(True,): 0.1, (False,): 0.5},
14    "Rain": {(True,): 0.8, (False,): 0.2},
15    "WetGrass": {
16        (True, True): 0.99,
17        (True, False): 0.90,
18        (False, True): 0.90,
19        (False, False): 0.01,
20    },
21}
22
23
24def p_true(var, assignment):
25    key = tuple(assignment[parent] for parent in parents[var])
26    return cpt[var][key]
27
28
29def p(var, value, assignment):
30    prob = p_true(var, assignment)
31    return prob if value else 1.0 - prob
32
33
34def joint_probability(assignment):
35    result = 1.0
36    for var in variables:
37        result *= p(var, assignment[var], assignment)
38    return result
39
40
41def posterior(query_var, evidence):
42    totals = {True: 0.0, False: 0.0}
43    hidden = [var for var in variables if var not in evidence and var != query_var]
44
45    for query_value in [True, False]:
46        for hidden_values in product([True, False], repeat=len(hidden)):
47            assignment = dict(evidence)
48            assignment[query_var] = query_value
49            for var, value in zip(hidden, hidden_values):
50                assignment[var] = value
51            totals[query_value] += joint_probability(assignment)
52
53    normalizer = totals[True] + totals[False]
54    return totals[True] / normalizer, totals[False] / normalizer
55
56
57p_rain_given_wet = posterior("Rain", {"WetGrass": True})
58print(f"P(Rain=True | WetGrass=True) = {p_rain_given_wet[0]:.4f}")
59print(f"P(Rain=False | WetGrass=True) = {p_rain_given_wet[1]:.4f}")

This example is intentionally small, but it shows the core workflow:

Define a directed acyclic graph.
Attach a conditional probability table to each node.
Multiply local probabilities to form joint probabilities.
Sum over hidden variables to answer queries.

For larger networks, exact enumeration becomes expensive, which is why real systems use optimized inference algorithms.

Inference in Practice

Inference means answering probability questions from partial evidence. Typical questions include:

What is the probability of a disease given a set of symptoms?
What is the probability a machine has failed given several sensor readings?
Which hidden cause is most plausible after we observe several effects?

Exact methods include enumeration and variable elimination. Approximate methods include sampling approaches such as Gibbs sampling and likelihood weighting.

When Bayesian Networks Are Useful

Bayesian networks work well when:

uncertainty is central to the problem
you understand the dependency structure reasonably well
interpretability matters

They are common in medical diagnosis, risk analysis, reliability modeling, and expert systems.

Common Pitfalls

One common pitfall is adding too many edges. That makes the model harder to estimate and removes useful independence assumptions.

Another issue is choosing the wrong direction for relationships without understanding the consequence. Different graph structures can encode different independence assumptions, so changing an arrow is not just cosmetic.

A third problem is filling probability tables with inconsistent numbers. Every conditional distribution must sum to 1 for each parent configuration. Small bookkeeping mistakes can make the whole model invalid.

Finally, many people underestimate inference cost. Even if the graph looks simple on paper, exact inference can become expensive when the network has many nodes with many parents. That is where model design matters as much as raw probability theory.

Summary

A Bayesian network combines a directed acyclic graph with local conditional probability tables.
Its main strength is compactly factorizing a joint distribution.
The graph encodes conditional independence assumptions.
Inference answers probability queries from partial evidence.
Small networks are easy to reason about, but large ones require careful structure and efficient inference methods.