ERDS

Back to Zhu lab .

ERDS is an open-source software free to academia and nor-profit organization, designed for inferring copy number variants (CNVs) in high-coverage human genomes using next generation sequence (NGS) data. When a CNV presents in a test genome, multiple signatures, weak or strong, would present in the alignment data. ERDS starts from read depth (RD) information, and integrates other information including paired end mapping (PEM) and soft-clip signature to call CNVS sensitively and accurately.

Download
The most recent version is 1.1.

Installation and usage
The hmm packages were written in C and require compilation. Go to both hmm and phmm directories and type:

    make

The usage of ERDS is pretty straightforward as long as you set up correct parameters. There is no particular order of input parameters.
perl $ERDS_dir/erds_pipeline.pl -b $bam_file -v $variant_file -o $output_dir -r $ref_file OPTIONS
You must define the following parameters (order is not important–you can remember it as “brov”).

    -b <string>: specify the bam_file
    -v <string>: specify the predetermined SNV calls (in .vcf format).
    -o <string>: specify the output_dir
    -r <string>: specify the reference file

You can optionally define the following parameters. For those indication flags, no following strings is needed.

    -c: if indicated, run cluster. By default single cpu–samtools<string>: specify the samtools. By default version 1.12 as attached
    –sd <b36|b37|empty>: followed by the choices of the build of SD version, b36 or b37. By default empty.
    –exp_pct or -p: followed by expected percentage of deletions or duplications. By default 0.0025. (-p value)
    –name: followed by the string of the sample name. (–name sample_name)

You may also change the default setting of most of parameters. You can do this by specifying either a new parameter file or individual parameters in the command line. For help, type:

    perl $ERDS_dir/erds_pipeline.pl

Format of output files
ERDS output files in .events and .vcf format. Each row in the events file corresponds to a CNV detected and is in the format of
chromosome start end length CN_type summed_score precise_boundary reference_cn and inferred_cn

  • Scores were calculated using Poisson model. Usually the higher of a score the more reliable a CNV is, but the length is also an important factor. They are missing for some small deletions. Those deletions were generated by scanning through low coverage windows and scores were less reliable even they were calculated.
  • Precise_boundary is 1 if ERDS had confidence to infer both left and right boundaries at bp-resolutions and 0 otherwise. Precision is always 0 for duplications.
  • Reference_cn is the number of copies ERDS thinks the sequence of the region presents in the reference genome. In some regions the number in male genome differs from that in female genome even in autosomes.
  • Inferred_cn is the number of copies ERDS thinks the sequence of the region presents in the sequenced genome. It may equal reference_cn for a deletion due to the repeated sequences or inaccurate boundaries.

FAQs
Q1: What can ERDS do?
A1: ERDS can call deletions and duplications for whole genome sequencing data of human genomes sequenced at high coverage (above 20X recommended) with single library.

Q2: What can ERDS not do?
A2: ERDS is not functional in the following situations
(a) Not whole genome sequencing data, or merged data of whole genome and whole exome, or data of multiple libraries.
(b) Sequencing coverage lower than 10X.
(c) Types of structural variations other than deletions and duplications.

Q3: What is size range of ERDS calls?
A3: ERDS can call deletions (duplications) as small as 200bp (1kb). But this depends on the library size, distribution of the library size and the parameter setting.

Q4: How fast/slow is ERDS?
A4: The factors affecting the running time mainly include the machine conditions, the network connectivity to the drives and the total number of reads in the alignment file. For a sample sequenced at 40X with paired-end reads, read length 100bp, library size 300bp and data stored in a local drive, it took ERDS about 18 hours (version 1.1) to call CNVs in a PC with 8G memory. When using parallel computing with >20 nodes, it took about 2 hours.

Q5: I would to specify more parameters for the input.
A5: You can manually change parameters in the parameter.txt file under the software directory. This parameter.txt file serves as the blueprint and all parameters in the file parameter-$samplename.txt under the $output_dir will be used in the program. Or please contact me. Some common requests will be addressed in latter versions.

Q6: I received error messages when running ERDS.
A6: Please contact me with a copy of the error messages.

Q7: The error message says “mv: cannot stat `Your/PATH/wierd_contig_name.sd’: No such file or directory”.
A7: This is a common error caused by the missing sd files for contigs. The current sd files were prepared using the reference genome from 1000 Genomes Project. So when you are using an alternative reference, please don’t specify –sd (so by default empty), or contact me for solutions.

Q8: Will this tool be maintained regularly?
A8: In the predictable future, YES.

Q9: I was pissed off by ERDS. What are the other tools for detection of CNVs in whole genome sequencing data?
A9: Some famed ones include but not limited to Genome STRiPCNVnator and Breakdancer.

Q10: I understand ERDS cannot be applied to whole exome data, but I want to call CNVs in such data. Do you have recommendations?
A10: Yes, please check XHMM or CoNIFER.

Q11: I want to annotate CNV calls. Do you have recommendations?
A11: Please check ANNOVAR.

Citing ERDS
Zhu M, Need AC, Han Y, Ge D, Maia JM, Zhu Q, Heinzen EL, Cirulli ET, Pelak K, He M, Ruzzo EK, Gumbs CE, Singh A, Feng S, Shianna KV and Goldstein DB. Using ERDS to Infer Copy-Number Variants in High-Coverage Genomes. The American Journal of Human Genetics. Volume 91, Issue 3, 7 September 2012, Pages 408-421.