Why you should QC your reads AND your assembly?

鲤鱼基因组:http://www.ntv.cn/a/20140923/52953.shtml

关于鲤鱼基因组的测定,数据质量控制遭到质疑。

Why you should QC your reads AND your assembly?

Graham Etherington

http://grahametherington.blogspot.co.uk/2014/09/why-you-should-qc-your-reads-and-your.html

The genome sequence of the Common Carp Cyprinus carpio was published in Nature last week. By coincidence, I was doing some QC on some domesticated Ferret (Mustela ptorius furo) reads, which had thrown some kmer warnings in the FastQC tool. I blasted the kmers in NCBI and was quite perplexed by the number of hits that I found in the carp genome. Nearly all of the first 150 hits were all from the carp genome. Anyway, I looked a bit further into my odd kmers and it turns out that they were the ends of some Illumina adapter sequences that had presumably been incorporated into the paired-reads on the shorter ends of the insert size. This then took me back to the Carp Genome - what had creeped into that?

In the paper, the authors state that they used 454, Illumina and Solid sequencing and also used some previously published BAC-end sequences. The BAC-end and 454 sequences were assembled with the Celera assembler and the Illumina, Solid and 454 8kb mate-pair sequences were mapped to the assembly to construct the scaffolds. Finally, they used the paired-end information from the short paired-end reads to fill the gaps between the scaffolds. The final assembly consists of 9377 scaffolds.

The only quality control they speak of is "We then filtered out low-quality and short reads to obtain a set of usable reads".

So I thought I‘d look at what was actually in their assembly. I downloaded the Carp genome assembly (9377 scaffolds) and created a blast database from it and then created a fasta file of Illumina adapter sequences (found here) and used them as query sequences to blast against the Carp genome. There is some redundancy in the Illumina adapter sequences, so I collapsed them, so retaining only unique sequences and then removed any adapter sequences that were sub-sequences of longer adapter (the final file consisted of 81 sequences). The blast resulted in 3750 hits (evalue < 8.00E-06) of which 1009 were of 100% identity.

This gave me a final tally of at least 20 Illumina adapter sequences incorporated into the final Common Carp genome assembly. Out of the 9377 scaffolds, 277 appears to have Illumina Adapter sequences in them. I‘ve included the counts of the different Illumina adapter sequences (non-redundant) for the scaffolds at the bottom of the page.

I‘ve not looked for adapter sequences used in Solid or 454 sequencing yet. It would be interesting to see what that throws up.

So, a lesson to be learned here. QC your assembly, especially if you‘re not overly stringent with your read QC.

Here‘s the data:
Common Carp genome scaffolds
Illumina adapter sequences
Illumina adapter sequences collapsed
Illumina adapters v Carp genome blast

时间: 2024-08-06 12:45:49

Why you should QC your reads AND your assembly?的相关文章

DISCOVAR de novo

海宝建议用这个拼接软件 http://www.broadinstitute.org/software/discovar/blog/?page_id=98 DISCOVAR – variant caller 适合于call variant 和拼接小基因组 DISCOVAR de novo 适合拼接大基因组 下载: ftp://ftp.broadinstitute.org/pub/crd/DiscovarDeNovo/latest_source_code/LATEST_VERSION.tar.gz

拾人牙慧篇之———QQ微信的第三方登录实现

一.写在前面 关于qq微信登录的原理之流我就不一一赘述了,对应的官网都有,在这里主要是展示我是怎么实现出来的,看了好几个博客,有的是直接复制官网的,有的不知道为什么实现不了.我只能保证我的这个是我实现后才贴出来的,本文有看不懂的地方请结合官网看.(话说我感觉我写博客废话好多) 二.准备工作 通过以下官网获得相应AppID和AppSecret以及对应的回调地址. QQ登录官网:https://connect.qq.com 微信登录官网:https://open.weixin.qq.com 三.登录

MySql数据库3【优化3】缓存设置的优化

1.表缓存 相关参数: table_open_cache 指定表缓存的大小.每当MySQL访问一个表时,如果在表缓冲区中还有空间,该表就被打开并放入其中,这样可以更快地访问表内容.通过检查峰值时间的状态值,如果发现open_tables等于table_cache,并且opened_tables在不断增长,那么就需要增加table_open_cache的值了.注意,不能盲目地把这个参数设置得很大,如果设置太大,会引起文件描述符不足,造成性能不稳定或者数据库连接失败.建议为512 table_cac

让Quality Center走下神坛--测试管理工具大PK(转)

让Quality Center走下神坛--测试管理工具QC/ALM 和 RQM.Jira.TP.SCTM大PK 在写完了<让QTP走下神坛>之后,现在来谈谈测试管理工具,献给所有正在或打算做测试管理工作的同行. 当然,话题离不了Quality Center——但又不只是谈QC,我会结合对比各种主流的企业级测试管理工具,包括标题提到的:HP QC/ALM.IBM RQM.51Testing TP.Micro Focus SCTM.Atlassian Jira.但是不会提及Bugzilla.Bug

生信概念之

1.contig:A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA 从reads拼接出来的更长的序列. 2.k-mer:k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA Sequenci

三代测序文章

Long-read sequence assembly of the gorilla genome http://science.sciencemag.org/content/352/6281/aae0344   Insights into hominid evolution from the gorilla genome sequence http://www.nature.com/nature/journal/v483/n7388/full/nature10842.html#methods

QC的使用简介

目录一.站点管理员的操作(后台)1.登录2.创建域3.创建项目4.新建用户5.QC的一些其他信息的修改(非 常用)二.项目管理员对项目的配置管理(前台)1.登录2.修改用户个人信息及密码3.项目成员设置4.用户组的设置5.项目数据的设置(customize projecct entities)6.项目数据列表(customize project lists)三.普通用户对项目的操作(前台)1.登录2.对需求的操作3.对用例的操作4.执行测试5.缺陷报告6.从Excel导入数据到QC 一.站点管理

理解soft-clipped reads

什么是soft-clipped reads 当基因组发生某一段的缺失,或转录组的剪接,在测序过程中,横跨缺失位点及剪接位点的reads回帖到基因组时,一条reads被切成两段,匹配到不同的区域,这样的reads叫做soft-clipped reads,这些reads对于鉴定染色体结构变异及外源序列整合具有重要作用. 什么是multi-hits reads 由于大部分测序得到的reads较短,一个reads能够匹配到基因组多个位置,无法区分其真实来源的位置.一些工具根据统计模型,如将这类reads

史上最全QC学习方案,值得收藏!

Quality Center是一个基于Web的强大的测试管理工具,可以组织和管理应用程序测试流程的所有阶段,**制定测试需求.计划测试.执行测试和跟踪缺陷.此外,通过Quality Center还可以创建报告和图来监控测试流程.合理的使用Quality Center可以提高测试的工作效率,节省时间,起到事半功倍的效果. Quality Center的前身就是大名鼎鼎的TD,也就是TestDirector,所以在很多测试资料中,大家看到的TD资料,其实也可以用作学习QC的参考啦! QC可以说是软件