What is a Sankey plot (and when not to use it)?

A Sankey plot is a directed flow diagram in which the width of each link indicates the quantity. It is designed to show how a total amount splits, flows, and sometimes re-splits across successive stages of a process.

Sankey plots are especially effective when you want readers to understand where things go, how much is lost, and how categories change over steps.

When should you use a Sankey plot?

Use a Sankey plot when your data represents flow through stages and when conservation or loss matters.

Typical and effective use cases include:

Bioinformatics and genomics

  • Read fate across an RNA-seq or ATAC-seq pipeline
    (raw reads → QC pass → aligned → assigned → filtered)

  • Variant consequences across annotation tiers
    (all variants → coding → missense → pathogenic)

Sample and project tracking

  • Sample triage
    (collected → passed QC → sequenced → analysed)

  • Cohort filtering
    (initial cohort → exclusions → final analysis set)

Resources and cost breakdowns

  • Cost or time allocation across pipeline steps

  • Compute usage split by analysis stage

If the question is “where did the data go?”, a Sankey plot is often the right answer.

When should you not use a Sankey plot?

Sankey plots are powerful—but very easy to misuse.

Avoid them in the following situations:

Time series data

If you want to show trends over time, use:

  • Line charts

  • Area charts

Sankey plots do not encode time well.

Too many small categories

If you have dozens of very thin flows (≈ more than 40–50):

  • The diagram becomes unreadable

  • Important patterns disappear into visual noise

In these cases, use:

  • Bar charts

  • Faceted plots

  • Or group small categories into “Other”

Cycles or bidirectional networks

Sankey plots assume a left-to-right, acyclic flow.

They are not suitable for:

  • Feedback loops

  • Iterative processes

  • Bidirectional interactions

For those, use:

  • Network diagrams

  • Chord diagrams

  • Graph layouts

Design best practices (the ones reviewers actually like)

1) Stages and ordering matter

  • Arrange stages logically from left to right

  • Keep the same order across related figures

  • Never reorder stages just to “make it look nicer”

Consistency builds trust.

2) Use colour with restraint

  • Assign consistent colours to the same category across all stages

  • Avoid rainbow palettes

  • Use muted backgrounds

  • Group related categories using colour families

If colour encodes meaning, explain it.

3) Labels and tooltips

  • Node labels: short, precise, unambiguous
    (“Aligned reads”, not “Aligned”)

  • Link values: show both absolute counts and percentages
    e.g. “700,000 (11.2%)”

For interactive figures, use tooltips.
For static figures, annotate only the key flows.

4) Handle small categories carefully

Very small flows (<1–2% of the total):

  • Add clutter

  • Distract from the main story

Group them into “Other”, and explain this choice in the caption.

5) Always state units and totals

Make it explicit what the widths represent:

  • Reads

  • Samples

  • Variants

  • Compute hours

A subtitle helps enormously, for example:

Total reads analysed: N = 6.5 million

6) Accessibility matters

  • Use colour-blind friendly palettes

  • Ensure sufficient contrast

  • Provide a text summary or table alongside the figure

Remember: figures should support your story, not replace it.

7) Avoid cycles and double-counting

Sankey diagrams must not loop.

If your process is iterative:

  • Summarise each iteration separately, or

  • Switch to a network-based visualisation

Never allow the same quantity to be counted twice without explicit explanation.

Next
Next

Machine learning explained simply: supervised vs unsupervised learning