Quality assessment and quality control of NGS data

http://www.molecularevolution.org/resources/activities/QC_of_NGS_data_activity_new

table of contents

expected learning outcomes

The objective of this activity is to understand some relevant properties of an ensemble of next generation sequencing reads such as length, quality scores and base distribution in order to assess the quality of the data and to discard low quality reads.

Next-generation sequencing technologies are becoming widely used, and although a massive number of sequences can be generated in a single experiment, it is still important to characterize the reads. Characterizing reads may prevent undesirable outcomes in the assembly or mapping processes. Discarding low-quality reads, controlling for contamination, and trimming adaptor sequences are examples of preliminary quality control procedures that can be applied to raw reads before further analysis.

We will use basic UNIX commands, the FASTX-Toolkit, and FastQC applied to two small 454 and Illumina datasets.

getting started

  1. Make sure you have the following programs installed (if you used the customized USB flash drive for software installation, you already have them):

    1. FASTX-Toolkit
    2. FastQC
    3. Perl and Bioperl
    4. R
    5. Gnuplot
  2. The data you are going to use have been copied from your USB flash drive during the software installation to the~/wcg_oct2011/activities/QC_of_NGS_data folder. Go to that folder and you should find:
    1. bartonella_illumina.fastq: a small subset of a Illumina run (v.1.5) in a FASTQ file
    2. bartonella_454.fasta and bartonella_454.qual: sequences and qualities for small subset of a 454 Titanium run
    3. plotLengthDistribution.R: a simple R script to plot the distribution of read lengths, taking as input a two column tab file where the second column is the length of the sequences (and the first column is read ID, although this information is not used)
    4. fastx_nucleotide_distribution.R: an R script to plot the distribution of nucleotides along the read position (similar to the FASTX-Toolkit tool fastx_nucleotide_distribution.sh, uses the same input file)
    5. fastx_quality_boxplot.R: an R script to plot the quality of nucleotides along the read position (similar to the FASTX-Toolkit tool fastx_quality_boxplot.sh, uses the same input file)
  3. The Perl scripts that are going to be used in this activity have been installed, if you are interested to look at them:
    1. fastaNamesSizes.pl: takes a sequence file as input (e.g. fastq). The output is a list of sequence IDs followed by the length of each sequence; requires BioPerl
    2. fastaQual2fastq.pl: takes a FASTA and a quality (.qual) file and outputs a FASTQ file; requires BioPerl
  4. Have a look at the documentation for FASTX-Toolkit. Get an idea of the different tools and what they do. In general, you can get help on each program by typing program_name -h

FIX FOR MISSING SCRIPTS

Execute the following commands:

cd ~/wcg_oct2011/local/scripts
wget http://www.molecularevolution.org/molevolfiles/exercises/QC_of_NGS/scripts.tar.gz
tar xvfz scripts.tar.gz

exercise 1: checking Illumina data with the FASTX-Toolkit

The goal of this exercise is to inspect the sequence data of an Illumina run. The sequence data belongs to the intracellular bacterium Bartonella bacilliformis which is zoonotic pathogen transmitted by insect vectors. Although FASTQ is the main file received from a sequence provider, some users want to perform the base calling step themselves, using a different package than the proprietary Illumina software. This is not covererd in this exercise, and we start directly from the FASTQ file.

    1. Inspect the data contained in the sequence file, bartonella_illumina.fastq.
    2. Using the Perl script fastaNamesSizes.pl, calculate the length of each sequence in the fastq file:

      fastaNamesSizes.pl -f fastq-illumina bartonella_illumina.fastq > bartonella_illumina.ns

      -f fastq-illumina = specifies what format the input sequence file is in. Format options are specified in Bioperl (see link for more information about the fastq format specified)
      > bartonella_illumina.ns = specifies the output file

      Have a look at the numbers output on the terminal. How many sequences do we have?

      This information can found in the standard error:
      # 10000 sequences, total length 380000.
      # Minimum len: 38. Max: 38. Average: 38.

      Inspect the output.
      [show answer]

      Answer:
      The output will be in tabular format, with each line containing the sequence ID and length.

    3. Examine the Perl script in a text editor and inspect it to see how the script works. Note how the script (less `which fastaNamesSizes.pl`).

(Note: this is the same as using which to find the file location, and then less to open the file at that path. The backticks (not single quotations) specify that the which command should be performed before less).

    1. Convert the FASTQ file into FASTA format using the Fastx Toolkit script fastq_to_fasta:

      fastq_to_fasta -n -i bartonella_illumina.fastq > bartonella_illumina.fasta

-n = keeps sequences with unknown (N) nucleotides
-i bartonella_illumina.fastq = specifies the input file

    1. Compute quality statistics for the bartonella_illumina dataset, using fastx_quality_stats

      fastx_quality_stats -i bartonella_illumina.fastq > bartonella_illumina.qualstats

    2. View the statistics file with a text editor. What do the columns represent?

[show answer]

Answer:
Run the program with -h to show help documentation which contains this information. The format of this output is described as the "old" style.

    1. (optional): Produce a more extensive output by running the same command with the -N flag. View it. What are the extra columns?

[show answer]

Answer:
The ‘new‘ output contains the original information followed by the analysis for each nucleotide at each cycle (explained in the help documentation).

    1. Draw a box-plot of the reads quality with fastq_quality_boxplot_graph.sh:

      fastq_quality_boxplot_graph.sh -i bartonella_illumina.qualstats -o bartonella_illumina.boxplot.png
      (or, alternatively with R: Rscript fastx_quality_boxplot.R bartonella_illumina.qualstats bartonella_illumina.boxplot2.png)

    2. Open the resulting png file (with open or Preview on a Mac, with Firefox or eog on Ubuntu from the terminal). What do x and y axis represent? What do the boxes represent? What can you say about the quality in general? What is the average probability of error? How is the quality varying along the read?

[show answer]

Answer: 
x axis represents the read length, y axis indicates the Phred quality score. From the top each box represents upper whisker, Q1, median, Q3 and lower whisker.
The sequence reads have a high quality. The probability of error is about 0.0001. The quality drops towards the end of the reads.

    1. Draw the distribution of nucleotides with fastx_nucleotide_distribution_graph.sh:

      fastx_nucleotide_distribution_graph.sh -i bartonella_illumina.qualstats -o bartonella_illumina.nucdistr.png
      (or, alternatively with R: Rscript fastx_nucleotide_distribution.R bartonella_illumina.qualstats bartonella_illumina.nucdistr2.png)

    2. Examine the image from the resulting png file. What can you say about the distribution of nucleotides? Any estimation of the G+C content?

[show answer]

Answer: 
The sequence reads are A-T rich. The estimate of G+C content is about 30-40%.

exercise 2: checking 454 data with the FASTX-Toolkit

The goal of this exercise is to do the same kind of quality checks as in exercise 1, but on 454 data this time. The primary data from 454 is stored in a sff file, but in general, FASTA and qual files are also provided. It is possible to parse the sff files with different parameters using the proprietary software provided by 454 (sfffile/sffinfo) or a free, open-source tool,sff_extract. Sff extraction is not covered in this exercise, and we start from the bartonella_454.fasta andbartonella_454.qual files. How would you perform the following tasks?

    1. Examine the FASTA and qual files.
    2. Some functions of FASTX-Toolkit do not work with FASTA-formatted sequences on multiple lines, thus it is sometimes necessary to transform the file so that fasta_formatter each sequence is on a single line.

[show answer]

Answer: 
fasta_formatter < bartonella_454.fasta > bartonella_454_1.fasta

    1. Some FASTX-Toolkit programs can only take FASTQ as an input, thus necessitating the conversion of FASTA and qual files to a FASTQ file, which can be done using a short Bioperl script:

      fastaQual2fastq.pl bartonella_454.fasta bartonella_454.qual > bartonella_454.fastq

    2. Inspect the resulting FASTQ file
    3. Now calculate the length of each sequence using the fastaNamesSizes.pl script.

[show answer]

Answer: 
fastaNamesSizes.pl bartonella_454.fasta > bartonella_454.ns

    1. Explore the output with a text editor. What would you say about the statistics on length and number of sequences?

Standard error:
# 10000 sequences, total length 3666752.
# Minimum len: 40. Max: 633. Average: 367

Although the average read length is 367 bp, very few sequences are ~200-300 bp. The read length of the majority of the sequences in the file fits in two major groups: less than 100bp and more than 300bp.

    1. Plot the distribution of read lengths using a short R script. Open the resulting png file (with open on a Mac or eog on Ubuntu from the terminal):

      Rscript plotLengthDistribution.R bartonella_454.ns bartonella_454.readDistr.png

    2. (optional) Examine the script.
    3. Can you say something about the distribution? What is the mean length? the median?

[show answer]

Answer: 
The read length distribution is bimodal. Mean read length=367 bp, Median=449 bp.

    1. (optional) Modify the R script to display vertical bars of different colors at the mean and median read length. Hint: useabline(v=...) command in the R script.

[show answer]

Answer: 
The following code can be inserted after the hist line and before the dev.off()line:
abline(v=mean(data$V2),col="red")
abline(v=median(data$V2),col="blue")

    1. Now compute the quality statistics for this subset and look at them in a text editor. Use fastx_quality_stats and view the results in a text editor. Use also the -N switch to produce extensive output.

[show answer]

Answer: 
fastx_quality_stats -i bartonella_454.fastq > bartonella_454.qualstats
fastx_quality_stats -i bartonella_454.fastq -N > bartonella_454.extqualstats

    1. Now plot the quality distribution along the sequence with fastq_quality_boxplot_graph.sh and view the resulting png plot (alternatively, use the R script). What do you see?

[show answer]

Answer: 
fastq_quality_boxplot_graph.sh -i bartonella_454.qualstats -o bartonella_454.qualstats.png
The quality of the sequence reads is high but it drops after ~400bp.

    1. Draw the distribution of nucleotides with fastx_nucleotide_distribution_graph.sh (or with the Rscript) and view the result. What do you see?

[show answer]

Answer: 
fastx_nucleotide_distribution_graph.sh -i bartonella_454.qualstats -o bartonella_454.nucdistr.png
The sequence reads are G-C rich. The distribution of nucleotides is not uniform at the 5‘ end of the reads.

    1. We need to trim that down. Use head to take only the first lines of the stats file and output it to another file, and redraw the figure. What do you see now? What is your conclusion about this dataset? What should be done about it?

[show answer]

Answer: 
head -n 51 bartonella_454.qualstats > bartonella_454.50f.qualstats

fastx_nucleotide_distribution_graph.sh -i bartonella_454.50f.qualstats -o bartonella_454.50f.nucdistr.png

The distribution of the nucleotides is not uniform in the first 10 nucleotide position of the reads. Adaptor sequences at the 5‘ end are included in the sequenced reads. The sequence reads need to be trimmed.

exercise 3: fixing extra adaptor sequence with the FASTX-Toolkit

Most of the sequences have some adaptor sequences at the 5‘ end. We need to remove these adaptor sequences to avoid problems with the assembly or mapping. We will use fastx_trimmer to do this.

    1. fastx_trimmer trims a given number of nucleotides, either from the beginning or from the end of the sequence.

[show answer]

Answer: 
fastx_trimmer -f 11 -m 50 -i bartonella_454.fastq -o bartonella_454.trim.fastq
-f 11 = the position on the read (from the 5‘ end) of first nucleotide to keep. The first 10 bp will be removed from the 5‘ end.
-m 50 = the minimum final read length. Reads shorter than this length will be discarded.

    1. Calculate the nucleotide distribution statistics for the trimmed reads and redraw the graph. How does it look now?

[show answer]

Answer: 
fastx_quality_stats -i bartonella_454.trim.fastq | head -n 100 > bartonella_454.trim.qualstats

fastx_nucleotide_distribution_graph.sh -i bartonella_454.trim.qualstats -o bartonella_454.trim.nucdistr.png

The distribution of the nucleotides is now similar over the length of the reads.

    1. (optional): Why is it important to trim x bp from the start of all of the reads rather than searching for the adapter sequence and removing it?

[show answer]

Answer:
If the adapter was removed by searching for a specific sequence then reads with sequencing or PCR errors in the adapter would not being trimmed. These untrimmed reads would cause problems with assembly.

exercise 4: checking Illumina data with FastQC

Another tool to assess sequence quality is FastQC. FastQC can operate in two different modes - as a stand-alone interactive application or in a non-interactive mode which is more suitable for larger pipelines. For the purposes of this activity we will be using the interactive application. To start FastQC:

If you are running the WCG Ubuntu Linux distribution, first run the following commands:
rm ~/wcg_oct2011/local/bin/fastqc
ln -s ~/wcg_oct2011/software/FastQC/fastqc ~/wcg_oct2011/local/bin/fastqc

Then, you should be able to run fastqc from anywhere like so:

fastqc

Or if you are running on Mac OS X, navigate to the ~/wcg_oct2011/software/fastqc_v0.9.3 folder (already copied from your USB during the software installation) and double-click the FastQC icon.

FastQC may take a moment to open and should display a blank window once it has.

Load the bartonella_illumina.fastq Illumina file (File->Open) from the ~/wcg_oct2011/activities/QC_of_NGS_datadirectory. Once loaded, FastQC will process it and generate the report. This report can be exported (File->Save report...) as a zipped HTML folder which makes for easy sharing. You can view the results either within the FastQC application or theexported report. The FastQC documentation provides a detailed explanation of the tooland an explanation of each module report.

One nice feature of FastQC is that it easily handles and detects the many FASTQ variants. Within the Basic Statisticsreport, it shows that our data was Illumina 1.5 encoded.

The per base sequence quality plot is similar to the plot we produced in Exercise 1 with thefastq_quality_boxplot_graph.sh FastX-Toolkit command. Are there any differences between the two plots?

Explore the rest of the plots. What new information can be gathered about this Illumina dataset?

exercise 5: checking 454 data with FastQC

Load the bartonella_454.fastq file (File->Open) that we created from the 454 bartonella_454.fasta andbartonella_454.qual files in Exercise 2, step 3. Note that FastQC handles the following file types:

  • FASTQ (all variants)
  • Colorspace FASTQ (SOLiD)
  • GZip compressed FASTQ
  • SAM
  • BAM
  • SAM/BAM Mapped only (normally used for colorspace data)

You can view the results either within the FastQC application or the exported report.

You will notice for this dataset it shows that our data was Illumina 1.5 encoded. If you remember in exercise 2, we used thefastaQual2fastq.pl script to convert the .fasta and .qual files to FASTQ and it encoded them as Illumina 1.5.

You should notice in this case that 5 of the modules are marked with the red cross as "very unusual." Why was this the case for the 454 data? Try loading the bartonella_454.trim.fastq file to see the impact of our quality control efforts.

Another nice feature of FastQC is that many of the plots bin the nucleotide position on the x-axis which is useful for long reads such as 454. It is especially useful for the per base sequence content plot.

FastQC also provides example datasets on their website for a good Illumina datasetpoor Illumina dataset, a 454 dataset, and PacBio dataset.

see also

Other tools to verify quality of second-generation sequencing results are available:

    • Galaxy, a web-based genomics pipeline, in which FASTX-Toolkit and FastQC are integrated.
    • PRINSEQ, either as a standalone package or through a web-interface can generate summary statistics of sequence and quality data, which can subsequently be used to filter, reformat and trim next-generation sequence data.
    • Perl and Bioperl, to write small scripts. There already exist a very large number of packages devoted to genomics in Bioperl.
    • R and Bioconductor, are other solution to import and verify data. There are many contributed packages and modules available.
时间: 2024-10-11 17:50:00

Quality assessment and quality control of NGS data的相关文章

Paper | Blind Quality Assessment Based on Pseudo-Reference Image

目录 1. 技术细节 1.1 失真识别 1.2 得到对应的PRI并评估质量 块效应 模糊和噪声 1.3 扩展为通用的质量评价指标--BPRI 归一化3种质量评分 判断失真类型 加权求和 2. 总结 这一篇应该是继<BLIND QUALITY ASSESSMENT OF COMPRESSED IMAGES VIA PSEUDO STRUCTURAL SIMILARITY>(2016 ICME)之后的拓展工作.后者是将压缩图像再压缩,比较二者伪结构(压缩块角)的相似度:而本文就是将方法一般化,产生

Paper | Quality assessment of deblocked images

目录 1. 故事 2. 失真变化 3. 方法(PSNR-B) 4. 实验 这篇文章提出了一个PSNR-B指标,旨在衡量 压缩图像的块效应强度 或 去块效应后的残留块效应强度(比较去块效应算法的优劣). 1. 故事 现有的PSNR虽然形式简单.物理意义清晰,但与主观质量关系不大:SSIM(同时考虑亮度相似度.对比度相似度和结构相似度)和主观质量更贴近,但无法反映块效应强度. 2. 失真变化 首先,我们设无损图像为\(x\),编解码后为压缩图像\(y\),去压缩失真后的图像为\(\tilde{y}\

关于VNX中的control station 和data mover

control station is the main thing where the total system operations are monitored , maintained , and datamover is another component datamover is for file access , file management ... VNX have disks right , there data will be stored , and when a NAS u

Code Complete阅读笔记(二)

2015-03-06   328   Unusual Data Types    ——You can carry this technique to extremes,putting all the variables in your program into one big,juicy variable and then passingit everywhere.Careful programmers avoid bundling data any more than is logically

让Quality Center走下神坛--测试管理工具大PK(转)

让Quality Center走下神坛--测试管理工具QC/ALM 和 RQM.Jira.TP.SCTM大PK 在写完了<让QTP走下神坛>之后,现在来谈谈测试管理工具,献给所有正在或打算做测试管理工作的同行. 当然,话题离不了Quality Center——但又不只是谈QC,我会结合对比各种主流的企业级测试管理工具,包括标题提到的:HP QC/ALM.IBM RQM.51Testing TP.Micro Focus SCTM.Atlassian Jira.但是不会提及Bugzilla.Bug

Question: Should I use reads with good quality but failed-vendor flag?--biostart for vendor quality

https://www.biostars.org/p/198405/ Quick question is: I have some mapped reads in bam file which have good read quality, but they have sam flag 0x200 which means they didn't pass the vendor check. Should I include them or not in downstream analysis?

Accessing and Updating Data in ASP.NET: Retrieving XML Data with XmlDataSource Control

XmlDataSource Basics The XmlDataSource control exists merely as a proxy for retrieving XML data, which can then be programmatically accessed or bound to a data Web control. To access XML data from an ASP.NET page using the XmlDataSource control, star

vs2005中microsoft ado data control 6.0控件问题

在vs2005中是没有这个控件的,需要注册,步骤如下: 1. 先到C:\WINDOWS\system32目录下看看你的系统里是否已经有了MSADODC.ocx和MSDATGRD.ocx这两个文件(多半是没有的),没有就去下载: 2.在VS 2005中注册MSADODC.ocx和MSDATGRD.ocx这两个控件.在VS 2005中点击"Tools" --> "Visual Studio 2005 Command Prompt",输入如下命令即可: regsvr

My journey introducing the data build tool (dbt) in project’s analytical stacks

转自:https://www.lantrns.co/my-journey-introducing-the-data-build-tool-dbt-in-projects-analytical-stacks/ Not sure I remember how, but I had the good luck a few weeks ago to stumble upon posts from Tristan Handy where he mentioned a tool his team built