DISCOVAR de novo

海宝建议用这个拼接软件

http://www.broadinstitute.org/software/discovar/blog/?page_id=98

DISCOVAR – variant caller 适合于call variant 和拼接小基因组

DISCOVAR de novo 适合拼接大基因组

下载:

ftp://ftp.broadinstitute.org/pub/crd/DiscovarDeNovo/latest_source_code/LATEST_VERSION.tar.gz

安装:

General Instructions for Building Our Software

System Requirements

Software released from the CRD group at the Broad Institute is built and tested on a modern version of Linux for the x86_64 architecture. Our software does not run on 32-bit machines: you must have a 64-bit Linux system. Our users have successfully built and executed our software using a variety of Linux distributions including Ubuntu, RedHat, and SUSE. We expect that any flavor of x86_64 Linux will work fine, as long as it provides the necessary software prerequisites, as described below.

Basic Compiler and Library Requirements

We rely on reasonably up-to-date versions of these software packages:

  • GCC, with its associated g++ compiler for the C++ language, version 4.7 or later.  We‘re now using C++11 features, and require a modern GCC.
  • For ARACHNE only, the Xerces-C library. The source can be downloaded from the Xerces-c download page.  Supply the argument --with-xerces=/path/to/xercesc/installation when you run./configure.

The Build Procedure

  • Move the package that you downloaded from our FTP site to a location on you system where you‘d like to build the software.
  • Extract the contents by typing: tar xzf package.tgz
  • You‘ll have a new subdirectory in the current directory named after the package and revision. cd package Some older releases spill all the source into the current directory, rather than creating a new subdirectory. If that‘s the case, ignore this step.
  • Execute the configuration script by typing: ./configure This assumes that you can copy executables to /usr/local/bin.  If you cannot, you should instead execute: ./configure --prefix=/path/to/install/directory Note: Some older packages lack a configuration script. Consider returning to the FTP site for a more recent version, or just skip this step.
  • Build the software by typing: make all
  • Install the software by typing: make install
  • The executables will be in /usr/local/bin or in /path/to/install/directory/bin.  Make sure this is on your path.

If this goes well, you‘re ready to go. Consult the manual for the package to learn how to set up your data and what programs to execute.

对于数据的要求:

Sequencing data requirements summary
● Illumina MiSeq or HiSeq 2500 genome sequencers
● PCR-free library preparation
● 250 base paired end reads (or longer)
● ~450 base pair fragment size
● ~60x coverage

Input files
DISCOVAR requires a BAM file containing the raw reads from the sequencer. For variant calling it also
requires a matching reference FASTA file.

call variant  命令:

DISCOVAR can currently generate variants for small regions, and not the entire genome at once. To
generate variants for a 100 kb region for example, use:
Discovar \
READS=reads.bam \
OUT_HEAD=assembly \
REGIONS=1:50000150000
\
REFERENCE=genome.fasta
The complete set of variant calls for this region is given in the text file:
assembly.final.variant

Input files
DISCOVAR requires a BAM file containing the raw reads from the sequencer. For variant calling it also
requires a matching reference FASTA file.
BAM files
The reads to assemble must be in a BAM file or files. The name of the BAM file is specified with the
required argument READS :
READS= filename
Multiple BAM files are specified using a comma separated list:
READS= filename1,filename2,...
Alternatively, the BAM files can be specified in a separate file contain a list of BAM filenames, one per
line:
READS= @listfilename

DISCOVAR calls SAMtools internally to extract reads from the BAM.

Reference file (optional)
This is only required if you are using DISCOVAR as a variant caller. The reference information is used
only for variant calling and not in the assembly process. Specifying a valid FASTA reference file is all
that is required to cause DISCOVAR to generate variants.
To specify a reference FASTA file use the optional argument REFERENCE :
REFERENCE= filename

It should be the same file that was used to generate the alignments in the input BAM file(s), or at least
should share the same coordinate system. The FASTA record names should match those in the BAM
file. Ns are allowed.In addition to the reference FASTA file, DISCOVAR also requires the associated index file ( .fai

DISCOVAR can currently de novo assemble small genomes (up to 50 Mb), with larger genome support
to come soon.

The syntax for DISCOVAR de novo assembly is:
Discovar READS= bamfilenames \
OUT_HEAD= outputfilename \
REGIONS=all

This will take as input all the reads in the BAM file reads.bam , generate an assembly, then write the
output to a set of files prefixed with assembly

时间: 2024-10-09 22:00:28

DISCOVAR de novo的相关文章

混合纠错PBcR--Hybrid error correction and de novo assembly of single-molecule sequencing reads

原文链接:Hybrid error correction and de novo assembly of single-molecule sequencing reads 单分子测序reads(PB)的混合纠错和denovo组装 我们广泛使用的PBcR的原始文章就是这一篇 摘要: PB技术可以产生极长的reads,可以显著提高基因组和转录组的组装. 然而,单分子测序的reads的error rate非常高,这限制了它们在重测序方面的应用. 为了解决这个问题,我们创造了PBcR这个纠错算法和组装策

MCP/解读人:邹婉婷,Precision de novo peptide sequencing using mirror proteases of Ac-LysargiNase and trypsin for large-scale proteomics(基于Ac-LysargiNase和胰蛋白酶的蛋白组镜像de novo测序)

一.概述 由于难以获得100%的蛋白氨基酸序列覆盖率,蛋白组de novo测序成为了蛋白测序的难点,由Ac-LysargiNase(N端蛋白酶)和胰蛋白酶构成的镜像酶组合可以解决这个问题并具有稳定性,这2种消化位点互补的酶能够产生目标蛋白的镜像b,y离子,基于镜像原理设计的算法pNovoM可用于蛋白组de novo测序. 二.研究背景 De novo测序是基于二级质谱谱图解析未知蛋白.翻译后修饰及蛋白突变位点的测序方法,这项技术适用于没有氨基酸序列信息的蛋白及蛋白组解析.De novo测序的难点

chromosome interaction mapping|cis- and trans-regulation|de novo|SRS|LRS|Haplotype blocks|linkage disequilibrium

Dissecting evolution and disease using comparative vertebrate genomics-The sequencing revolution   short-read sequencing (SRS) (因大规模基因组数据需要,采用Illumina paired-end,短序列)->genome assembly and long-read sequencing (LRS) (因长序列的需要)   Sequencing 和assembly两个模

基因组 de novo 组装原理

Falcon软件的组装流程 为了错误校正,将原始子reads进行overlap 预组装和错误校正 错误校正后reads的overlap检测 overlap的过滤 从overlap构建图 从图构建contigs 几个解释: sub-reads是啥?为什么要进行错误校正?校正的原理是什么?length_cutoff和length_cutoff_pre分别是什么意思,为什么要设置这两个参数? sub-reads就是机器出来的reads经过处理后的子reads,方便软件处理: 因为第三代测序是单分子测序

The sequence and de novo assembly of the giant panda genome.ppt

sequencing:使用二代测序原因:高通量,短序列 不用长序列原因: 1.算法错误率高 2.长序列测序将嵌合体基因错误积累.嵌合体基因:通过重组由来源与功能不同的基因序列剪接而形成的杂合基因 sequencing: 增多的total length>N>gap>missing in genome The reads with a frequency > 1 were called duplicated reads, and we defined the duplication r

测序领域常用名词解释

kb=千碱基 kilobase nt=核苷酸 nucleotide bp=碱基对 base pair 高通量测序 高通量测序技术(High-throughput sequencing,HTS),有些文献中称其为下一代测序技术(next generation sequencing,NGS),又被称为深度测序(Deep sequencing) 基因组重测序(Genome Re-sequencing) 全基因组重测序是对基因组序列已知的个体进行基因组测序,并在个体或群体水平上进行差异性分析的方法 de

隐匿性乙肝病毒感染:一种秘密行动

Journal of Viral Hepatitis Occult Hepatitis B Virus Infection: A Covert Operation F. B. Hollinger, G. Sood Disclosures Summary and Introduction Summary Detection of occult hepatitis B requires assays of the highest sensitivity and specificity with a

corsetjiedu

Corset: enabling differential gene expression analysis for de novo assembled transcriptomes 背景: 转录组测序这种高通量RNA测序,是一个非常强力的技术 去研究转录本的各个方面 it has a broad range of applications 它有着广泛的应用 包括发现新的基因,检测可变剪接,差异表达基因,基因融合检测,比如SNPs和转录后的编辑post- transcriptional edit

ABySS 拼接工具

ABySS, that stands for Assembly By Short Sequences, is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version i