SPAN Peak Analyzer
+----------------------------------+ |SPAN Semi-supervised Peak Analyzer| +----------------------------|/----+ , , __.-'|'-.__.-'|'-.__ ='=====|========|====='= ~_^~-^~~_~^-^~-~~^_~^~^~^
SPAN is a semi-supervised multipurpose peak caller capable of processing a broad range of ChIP-seq, ATAC-seq, and single-cell ATAC-seq datasets that robustly handles multiple replicates and noise by leveraging limited manual annotation information.
Contents
Features
- Part of integrated peak calling solution
- Works with both conventional and ultra-low-input ChIP-seq data
- Works with both narrow and wide modifications
- Works with both single-end and paired-end libraries
- Fragment size prediction for single-end libraries
- Capable to process tracks with different signal-to-noise ratio
- Supports optional control track
- Supports replicates on model level
- False Discovery Rate correction
- Experimental: differential peak calling
- New Multistart during model fitting
Installation
SPAN Peak Analyzer (build 0.13.5244), released on Aug 12, 2020
Download | Description |
---|---|
span-0.13.5244.jar | Multi-platform JAR package |
Requirements:
- 4 GB RAM minimum
- Download and install Java 8.
- Download the
<build>.chrom.sizes
chromosome sizes of the organism you want to analyze from the UCSC website.
Here is the file used in our study.
Usage of SPAN
java -Xmx4G -jar span-0.13.5244.jar [-h] [--version] analyze
Use java
-Xmx
memory settings to configure memory
usage.
4 gigabytes are used in examples.
Example of regular peak calling |
java -Xmx4G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes -p Results.peak
|
---|---|
Example of supervised peak calling |
java -Xmx4G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes -l Labels.bed -p Results.peak
|
Example of model fitting |
java -Xmx4G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes
|
Peak calling
To analyze a single (possibly replicated) biological condition use analyze
command.
-b, --bin BIN_SIZE
Peak analysis is performed on read coverage tiled into consequent bins, with size being
configurable. Default value
is 200bp, approximately the length of one nucleosome.
-t, --treatment TREATMENT
Required. ChIP-seq treatment file. Supported formats: BAM, BED,
BED.gz or bigWig
file. If multiple files
are given, treated as replicates.
Multiple files should be separated by commas:
-t
A,B,C
. Multiple files are processed as replicates on model level.
-c, --control CONTROL
Control file. Multiple files should be separated by commas. Single control file or
separate file per each treatment file required.
Follow instructions for -t, --treatment TREATMENT
.
-cs, --chrom.sizes CHROMOSOMES_SIZES
Required. Chromosome sizes file for genome build used in TREATMENT
and
CONTROL
files.
Can be downloaded at https://hgdownload.cse.ucsc.edu/goldenPath/...
--fragment FRAGMENT
Fragment size. If provided, reads are shifted appropriately. If not provided,
the shift is estimated from the data.
--fragment 0
argument is necessary for ATAC-Seq data processing.
-k, --keep-dup
Keep duplicates. By default SPAN filters out redundant reads, aligned at the same genomic position.--keep-dup
argument is necessary for single cell ATAC-Seq data processing.
-m, --model MODEL
This option is used to specify SPAN model path, if not provided, model name is formed by input names and
other arguments.
-p, --peaks PEAKS
Resulting peaks file in ENCODE broadPeak* (BED 6+3) format.
If omitted, only model fitting step is performed.
-f, --fdr FDR
Minimum FDR cutoff to call significant regions, default value is 1.0E-6.
SPAN
reports p- and q- values for the
null hypothesis that a given bin is not enriched with a histone modification. Peaks are
formed from a list of truly
(in the FDR sense) enriched bins for the analyzed biological condition by thresholding the Q-value with a
cutoff FDR
and merging spatially close peaks using GAP
option to broad ones. This is
equivalent to controlling
FDR.
q-values are are calculated from p-values using Benjamini-Hochberg procedure.
-g, --gap GAP
Gap size to merge spatially close peaks. Useful for wide histone modifications. Default
value is 5, i.e. peaks
separated
by 5*
BIN
distance or less are merged.
--labels LABELS
Labels BED file. Used in semi-supervised peak calling.
-d, --debug
Print all the debug information, used for troubleshooting.
-q, --quiet
Turn off output.
-w, --workdir PATH
Path to the working directory (stores coverage and model caches).
--threads THREADS
Configures parallelism level.SPAN
utilizes both multithreading and specialized processor extensions
like
SSE2, AVX, etc.
Parallel computations were performed using open-source library
viktor for parallel
matrices
computations in
Kotlin programming language.
--ms, --multistarts
Number of multistart runs using different model initialization. Use 0 to disable (default: 5)
--ms-iterations, --msi
Number of iterations for each multistart run (default: 5)
--threshold, --tr
Convergence threshold for EM algorithm, use --debug option to see detailed info (default: 0.1)
Supervised peak calling
When LABELS
parameter is given,
it is used to optimize peak caller parameters for markup.
Model fitting
SPAN
workflow consists of several steps:
- Convert raw reads to tags using user-supplied
FRAGMENT
parameter or maximum cross-correlation estimate. - Compute coverage for all genome tiled into bins of
BIN
base pairs. - Fit 3-state hidden Markov model that classifies bins as
ZERO
states with no coverage,LOW
states of non-specific binding, andHIGH
states of the specific binding. - Compute posterior
HIGH
state probability of each bin. - Trained model is saved into
.span
binary format. - Peaks are computed using trained model and
FDR
andGAP
parameters. - If
LABELS
are provided, optimal parameters are computed to conform with them.
Model fitting mode produces trained model file in binary format as output, which can be:
- visualized directly in JBR Genome Browser
- used in integrated peak calling pipeline
Output files
- If
OUTPUT
file is given, it will contain predicted and FDR-controlled peaks in the ENCODE broadPeak format, i.e. BED 6+3:<chromosome> <peak start> <peak end> <peak name> <score> . <coverage / foldchange> <-log p-value> <-log Q-value>
Same format is used by MACS2 peak caller. -
- chromosome name
- start position of peak
- end position of peak
- peak name
- score of the peak, computed as log10(qvalue) * log(peak length). Useful for peak ranking with wide histone modifications.
- . (represents strand)
- summary reads coverage in peak averaged over replicates. fold-change in differential mode.
- -log10(pvalue) of null-hypothesis that given peak is in
ZERO
orLOW
state. - -log10(qvalue), calculated from p-values using Benjamini-Hochberg procedure. Median value for merged peak.
- In case of
SPAN
model fitting, it produces model file in binary format.
NOTE: after model is trained once, it will be reused automatically in other modes.
Study Cases
As a benchmark we applied SPAN peak calling approach to public conventional ChIP-seq datasets as well as to a ULI ChIP-seq dataset.
CD14+ classical monocytes tracks available in ENCODE database were a natural choice for a
conventional ChIP-seq
dataset.
We also used the data from
Hocking et
al.
to evaluate SPAN
.
Chen C et al. presented an ultra-low-input micrococcal nuclease-based native ChIP (ULI-NChIP) and sequencing method to generate genome-wide histone mark profiles with high resolution and reproducibility from as few as one thousand cells. We used these tracks to estimate semi-supervised approach in extreme conditions.
SPAN
produced high quality peak calling in all of these cases, see report.
This suggests that SPAN Peak Analyzer can be used as a general purpose peak calling
solution.
Galaxy
SPAN is available as tool in the official ToolShed for Galaxy. You can ask your Galaxy administrator to install it.
Error reporting
Report any errors or comments in the public SPAN issue tracker.
FAQ
Q: What is average running time?
A: SPAN
is capable of processing single ChIP-Seq track in less than 1 hour on
moderate laptop (MacBook
Pro 2015).
Q: Which operating systems are supported?
A: SPAN
is developed in modern Kotlin
programming language and
can be executed on any platform supported by
java
.
Q: Is differential peak calling supported?
A: This is experimental feature, see for details:
java -Xmx4G -jar span.jar compare -h
Q: Where is SPAN
source code?
A: Source code is available on GitHub
Q: Where did you get this lovely span picture?
A: From ascii.co.uk, it seems
the original author goes
by the name
jgs
.
Modified Wed Aug 12 17:19:12 2020 UTC