What is a Sankey plot (and when not to use it)?

17 Feb

A Sankey plot is a directed flow diagram in which the width of each link indicates the quantity. It is designed to show how a total amount splits, flows, and sometimes re-splits across successive stages of a process.

Sankey plots are especially effective when you want readers to understand where things go, how much is lost, and how categories change over steps.

When should you use a Sankey plot?

Use a Sankey plot when your data represents flow through stages and when conservation or loss matters.

Typical and effective use cases include:

✔ Bioinformatics and genomics

Read fate across an RNA-seq or ATAC-seq pipeline
(raw reads → QC pass → aligned → assigned → filtered)
Variant consequences across annotation tiers
(all variants → coding → missense → pathogenic)

✔ Sample and project tracking

Sample triage
(collected → passed QC → sequenced → analysed)
Cohort filtering
(initial cohort → exclusions → final analysis set)

✔ Resources and cost breakdowns

Cost or time allocation across pipeline steps
Compute usage split by analysis stage

If the question is “where did the data go?”, a Sankey plot is often the right answer.

When should you not use a Sankey plot?

Sankey plots are powerful—but very easy to misuse.

Avoid them in the following situations:

✖ Time series data

If you want to show trends over time, use:

Line charts
Area charts

Sankey plots do not encode time well.

✖ Too many small categories

If you have dozens of very thin flows (≈ more than 40–50):

The diagram becomes unreadable
Important patterns disappear into visual noise

In these cases, use:

Bar charts
Faceted plots
Or group small categories into “Other”

✖ Cycles or bidirectional networks

Sankey plots assume a left-to-right, acyclic flow.

They are not suitable for:

Feedback loops
Iterative processes
Bidirectional interactions

For those, use:

Network diagrams
Chord diagrams
Graph layouts

Design best practices (the ones reviewers actually like)

1) Stages and ordering matter

Arrange stages logically from left to right
Keep the same order across related figures
Never reorder stages just to “make it look nicer”

Consistency builds trust.

2) Use colour with restraint

Assign consistent colours to the same category across all stages
Avoid rainbow palettes
Use muted backgrounds
Group related categories using colour families

If colour encodes meaning, explain it.

3) Labels and tooltips

Node labels: short, precise, unambiguous
(“Aligned reads”, not “Aligned”)
Link values: show both absolute counts and percentages
e.g. “700,000 (11.2%)”

For interactive figures, use tooltips.
For static figures, annotate only the key flows.

4) Handle small categories carefully

Very small flows (<1–2% of the total):

Add clutter
Distract from the main story

Group them into “Other”, and explain this choice in the caption.

5) Always state units and totals

Make it explicit what the widths represent:

Reads
Samples
Variants
Compute hours

A subtitle helps enormously, for example:

Total reads analysed: N = 6.5 million

6) Accessibility matters

Use colour-blind friendly palettes
Ensure sufficient contrast
Provide a text summary or table alongside the figure

Remember: figures should support your story, not replace it.

7) Avoid cycles and double-counting

Sankey diagrams must not loop.

If your process is iterative:

Summarise each iteration separately, or
Switch to a network-based visualisation

Never allow the same quantity to be counted twice without explicit explanation.

Lindsey Van Haute