Feature filtering for differential expression / binding analyses

Last updated: July 2, 2025

When performing differential expression (DE) analysis, you may be interested in deciding which genes should be included in the analysis. Pluto sets reasonable defaults for this task, but the platform also gives you control over the filtering through the advanced settings so that you can further tailor your analysis to your specific data and scientific question.

This guide will help you understand what these settings mean, why they matter, and when it makes sense biologically to adjust them.

⚙ What are these settings?

Pluto allows you to set two filters for each group (experimental and control):

Minimum read count
Only genes with at least this many raw counts will be considered.
Minimum % of samples meeting the read count
This ensures the gene is detected in a consistent proportion of your samples—not just by chance.

👉 Example: If you set a minimum of 10 reads in at least 70% of experimental samples, Pluto will only include genes that have 10+ reads in at least 70% of the samples in that group.

In many cases, it makes sense to keep these parameters the same for the experimental and control group. However, there may be some practical or biological reasons to set different parameters for the experimental and control groups. See the Examples section below to explore these scenarios.

Default parameters and recommendations

By default, Pluto filters features prior to differential analyses to include only genes with at least 3 reads counted in at least 20% of samples in any group. For most standard transcriptomics and epigenomics studies, this approach is a good balance between sensitivity and reliability.

It's good to filter genes with extremely low or inconsistent expression because they can:

Introduce noise into your DE results
Inflate the number of statistical tests, increasing false positives
Lead to misleading fold changes due to unstable estimates

Filtering them out improves the accuracy and biological relevance of your DE results.

Examples: when it makes sense to customize feature filtering

Here are a few scenarios where you may want to consider adjusting these settings:

Focused biomarker discovery

Typically, we want biomarkers that are robustly expressed genes across samples so that they can be readily measured and we have increased confidence in differentially expressed candidates.

👉 Example: Require a minimum of 10 reads in at least 80% of samples for both control and experimental groups.

Disease- or treatment-specific expression patterns

In some conditions, you may hypothesize that a disease or treatment may activate certain genes such that those genes are not expressed at all in controls, but turn on in a subset of treated patients (e.g., responders). You may keep a higher threshold in the experimental group to ensure you're measuring treatment-induced expression consistently.

👉 Example:
Require a minimum of 3 reads in at least 10% of samples for the control group. This allows for baseline low/no expression without filtering the gene out entirely.
Require a minimum of 10 reads in at least 90% of samples for the experimental group. This ensures you're capturing the genes that were consistently induced by the treatment.

Comparisons with very uneven sample size

If you're comparing a small group (e.g. rare cancer subtype with only n=6) to a larger cohort (e.g. other breast tumors, n=300), a small n in the experimental group would mean that requiring 70% of samples to meet the threshold would demand expression in 5 of the 6 samples, which might be too strict. Changing the thresholds based on the number of samples can help avoid biasing your results against genes that are inconsistently expressed due to small n.

👉 Example:
Require a minimum of 3 reads in at least 50% of samples (3 of the 6 patients) for the experimental group.
Require a minimum of 3 reads in at least 75% of samples (225 of the 300 patients) for the control group.

Tumor heterogeneity

In heterogeneous tumors, gene expression can vary widely. You might lower the frequency thresholds (e.g. to 10%) to avoid missing genes that are specific to a subgroup of responders or resistant patients.

Rare or lowly expressed genes of interest

If you’re looking at transcription factors, cytokines, or lncRNAs, these are often low abundance. You can lower the read count threshold to avoid filtering them out entirely, just be aware of the potential for increased noise.

Low-input or degraded samples (e.g. FFPE tissue)

RNA quality may vary widely across samples, so you might lower the % of samples required to include genes that are inconsistently detected but still relevant

👉 Example: Require a minimum of 10 reads in at least 10% of samples for both control and experimental groups.