SPAN Peak Analyzer
+----------------------------------+ |SPAN Semi-supervised Peak Analyzer| +----------------------------|/----+ , , __.-'|'-.__.-'|'-.__ ='=====|========|====='= ~_^~-^~~_~^-^~-~~^_~^~^~^
SPAN is a semi-supervised multipurpose peak caller capable of processing a broad range of ChIP-seq, ATAC-seq, and single-cell ATAC-seq datasets that robustly handles multiple replicates and noise by leveraging limited manual annotation information.
- Usage of SPAN
- Output Files
- Study Cases
- Error reporting
- Part of integrated peak calling solution
- Works with both conventional and ultra-low-input ChIP-seq data
- Works with both narrow and wide modifications
- Works with both single-end and paired-end libraries
- Fragment size prediction for single-end libraries
- Capable to process tracks with different signal-to-noise ratio
- Supports optional control track
- Supports replicates on model level
- False Discovery Rate correction
- Experimental: differential peak calling
- New Multistart during model fitting
SPAN Peak Analyzer (build 0.13.5244), released on Aug 12, 2020
|span-0.13.5244.jar||Multi-platform JAR package|
- 4 GB RAM minimum
- Download and install Java 8.
- Download the
<build>.chrom.sizeschromosome sizes of the organism you want to analyze from the UCSC website.
Here is the file used in our study.
java -Xmx4G -jar span-0.13.5244.jar [-h] [--version] analyze
-Xmx memory settings to configure memory
4 gigabytes are used in examples.
|Example of regular peak calling||
|Example of supervised peak calling||
|Example of model fitting||
To analyze a single (possibly replicated) biological condition use
-b, --bin BIN_SIZE
Peak analysis is performed on read coverage tiled into consequent bins, with size being configurable. Default value is 200bp, approximately the length of one nucleosome.
-t, --treatment TREATMENT
Required. ChIP-seq treatment file. Supported formats: BAM, BED, BED.gz or bigWig file. If multiple files are given, treated as replicates. Multiple files should be separated by commas:
. Multiple files are processed as replicates on model level.
-c, --control CONTROL
Control file. Multiple files should be separated by commas. Single control file or separate file per each treatment file required.
Follow instructions for
-t, --treatment TREATMENT.
-cs, --chrom.sizes CHROMOSOMES_SIZES
Required. Chromosome sizes file for genome build used in
Can be downloaded at
Fragment size. If provided, reads are shifted appropriately. If not provided, the shift is estimated from the data.
--fragment 0 argument is necessary for ATAC-Seq data processing.
Keep duplicates. By default SPAN filters out redundant reads, aligned at the same genomic position.
--keep-dup argument is necessary for single cell ATAC-Seq data processing.
-m, --model MODEL
This option is used to specify SPAN model path, if not provided, model name is formed by input names and other arguments.
-p, --peaks PEAKS
Resulting peaks file in ENCODE broadPeak* (BED 6+3) format. If omitted, only model fitting step is performed.
-f, --fdr FDR
Minimum FDR cutoff to call significant regions, default value is 1.0E-6.
SPAN reports p- and q- values for the
null hypothesis that a given bin is not enriched with a histone modification. Peaks are
formed from a list of truly
(in the FDR sense) enriched bins for the analyzed biological condition by thresholding the Q-value with a
FDR and merging spatially close peaks using
option to broad ones. This is
equivalent to controlling
q-values are are calculated from p-values using Benjamini-Hochberg procedure.
-g, --gap GAP
Gap size to merge spatially close peaks. Useful for wide histone modifications. Default value is 5, i.e. peaks separated by 5*
BIN distance or less are merged.
Labels BED file. Used in semi-supervised peak calling.
Print all the debug information, used for troubleshooting.
Turn off output.
-w, --workdir PATH
Path to the working directory (stores coverage and model caches).
Configures parallelism level.
SPAN utilizes both multithreading and specialized processor extensions
SSE2, AVX, etc.
Parallel computations were performed using open-source library
viktor for parallel
Kotlin programming language.
Number of multistart runs using different model initialization. Use 0 to disable (default: 5)
Number of iterations for each multistart run (default: 5)
Convergence threshold for EM algorithm, use --debug option to see detailed info (default: 0.1)
Supervised peak calling
LABELS parameter is given,
it is used to optimize peak caller parameters for markup.
SPAN workflow consists of several steps:
- Convert raw reads to tags using user-supplied
FRAGMENTparameter or maximum cross-correlation estimate.
- Compute coverage for all genome tiled into bins of
- Fit 3-state hidden Markov model that classifies bins as
ZEROstates with no coverage,
LOWstates of non-specific binding, and
HIGHstates of the specific binding.
- Compute posterior
HIGHstate probability of each bin.
- Trained model is saved into
- Peaks are computed using trained model and
LABELSare provided, optimal parameters are computed to conform with them.
Model fitting mode produces trained model file in binary format as output, which can be:
OUTPUTfile is given, it will contain predicted and FDR-controlled peaks in the ENCODE broadPeak format, i.e. BED 6+3:
<chromosome> <peak start> <peak end> <peak name> <score> . <coverage / foldchange> <-log p-value> <-log Q-value>
Same format is used by MACS2 peak caller.
- chromosome name
- start position of peak
- end position of peak
- peak name
- score of the peak, computed as log10(qvalue) * log(peak length). Useful for peak ranking with wide histone modifications.
- . (represents strand)
- summary reads coverage in peak averaged over replicates. fold-change in differential mode.
- -log10(pvalue) of null-hypothesis that given peak is in
- -log10(qvalue), calculated from p-values using Benjamini-Hochberg procedure. Median value for merged peak.
- In case of
SPANmodel fitting, it produces model file in binary format.
NOTE: after model is trained once, it will be reused automatically in other modes.
As a benchmark we applied SPAN peak calling approach to public conventional ChIP-seq datasets as well as to a ULI ChIP-seq dataset.
CD14+ classical monocytes tracks available in ENCODE database were a natural choice for a
We also used the data from Hocking et al. to evaluate
Chen C et al. presented an ultra-low-input micrococcal nuclease-based native ChIP (ULI-NChIP) and sequencing method to generate genome-wide histone mark profiles with high resolution and reproducibility from as few as one thousand cells. We used these tracks to estimate semi-supervised approach in extreme conditions.
SPAN produced high quality peak calling in all of these cases, see report.
This suggests that SPAN Peak Analyzer can be used as a general purpose peak calling solution.
Report any errors or comments in the public SPAN issue tracker.
Q: What is average running time?
SPAN is capable of processing single ChIP-Seq track in less than 1 hour on
moderate laptop (MacBook
Q: Which operating systems are supported?
SPAN is developed in modern Kotlin
programming language and
can be executed on any platform supported by
Q: Is differential peak calling supported?
A: This is experimental feature, see for details:
java -Xmx4G -jar span.jar compare -h
Q: Where is
SPAN source code?
A: Source code is available on GitHub
Q: Where did you get this lovely span picture?
A: From ascii.co.uk, it seems the original author goes by the name
Modified Wed Aug 12 17:19:12 2020 UTC