Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads – BMC Bioinformatics

Features

The algorithm and statistical seduce schemes are implemented as a Linux platform Skewer using C++. A comparison of the chief features of Skewer with those of existing mainstream arranger trimmers are presented in table 1 .Table 1
Main features of various adapter trimmers
Full size table

Experiment environment

The waiter that was used for the experiments had 4 × 8-core Intel® ; 2.67GHz CPUs, 1T memory, and RAID with bandwidths of 266MB/s and 262MB/s for reading and writing respectively. The operating system ( OS ) was the Red Hat Enterprise Linux Server release 6.3 .

Experiments on simulated data

General information

We simulated 10 million 100 bp + 100 bp PE Solexa reads from the Arabidopsis thaliana genome using ART, a NGS read simulator [ 15 ], with some revision on the generator codes for simulating adapter-contaminated reads ( hypertext transfer protocol : //sourceforge.net/projects/skewer/files/Simulator/ ). The trail profile was from the real number sequence data of A. thaliana where approximately 36 % of the reads were contaminated with adapters. We compared Skewer with mainstream adapter trimmers that can handle PE reads deoxyadenosine monophosphate well as four example arranger trimmers that can handle entirely SE reads.

To assess trimming quality, we defined the play along metrics : FP ( false positive ) as the number of reads that were over-trimmed, either for trimming non-contaminant reads ( false trimming ), noted as F P _ f t, or for over-trimming contaminant reads, noted as F P _ o t ; FN ( false negative ) as the count of reads that were under-trimmed, either for not trimming contaminant reads ( false retaining ), noted as F N _ f r, or under-trimming contaminant reads, noted as F N _ u t ; and TN ( true negative ) as the number of untrimmed non-contaminant reads. From these numbers, we defined the positive predictive value ( PPV ) as the ratio of the total of correctly trimmed reads to the number of trim reads ; sensitivity ( Sen ) as the ratio of the number of correctly trimmed reads to the act of contaminant reads ; and specificity ( Spec ) as the proportion of the count of untrimmed non-contaminant reads to the total of non-contaminant reads as follows : PPV = TP / ( TP + FP _ foot + FP _ ot + FN _ do ) ( 1 ) Sen = TP / ( TP + FN _ francium + FN _ do + FP _ ot ) ( 2 ) Spec = TN / ( TN + FP _ foot ) ( 3 ) finally, we defined the Matthew’s correlation coefficient ( mCC ), which is a choice bill for design recognition, as : mCC= TP · TN − FP · FN ( TP + FP ) ( TP + FN ) ( TN + FP ) ( TN + FN ) ( 4 )

Primary result

Each method acting was run with its nonpayment parameters, except that the minimum output signal shard length and the thread number ( if applied ) were set to 1. The results obtained from these runs are listed in table 2 and details are available in Table S1 of Additional file 1. FastX, an earlier and widely adopted NGS arranger trimmer, had a relatively low mCC ( 0.6683 ) and a moo serve speed ( 0.92 M b p / s ) ; SeqTrim had a like overall performance as FastX ( 0.6618 ), but it had the slowest process focal ratio ( 0.03 M b p / s ) of all the trimmers tested despite its extensive log utility ; TagCleaner was intelligibly the most conservative of the trimmers ( F P =0 ), but it had the lowest sensitivity ( 45.50 % ) and was notably slower than FastX ( 58.7 % of the accelerate ) ; EA-Tools had the highest sensitivity ( 99.72 % ) for processing SE reads and was orders of magnitude fast ( 13 X ∼400 X ) than the slow trimmers ; Cutadapt, the most wide accepted adapter pruner, exhibited a commodity compromise between sensitivity and specificity ( 96.27 % vs. 96.93 % ), and had the highest mCC ( 0.9286 ) among the existing tools for processing SE reads ; TrimGalore, a wrapping for Cutadapt, had a performance that was equivalent to EA-tools with nonpayment settings, but it was well slower than EA-tools ( 28.2 % ∼31.6 % of the speed ) ; SeqPrep, a dedicate PE reads adapter pruner and amalgamation, had the highest mCC ( 0.9975 ) among the existing tools for processing PE reads, but it was dense ( 0.64 M b p / s ) ; Btrim had the highest speed ( 23.63 M b p / s ) for adapter trim, but it had low sensitivity ( 53.44 % ) ; Scythe had an mCC similar to that of Cutadapt for SE reads arranger trimming, but was more conservative ; Flexbar had slenderly lower metrics and about 20 % lower march speed than TrimGalore ; Trimmomatic was among the most button-down ones, but it had an acceptable sensitivity ( 72.31 % ) and a relatively senior high school accelerate ( 16.73 M b p / s ) ; AlienTrimmer had exchangeable metrics to Btrim, but was a lot slower ( 1.64 M b p / s ) ; and AdapterRemoval had a exchangeable overall performance as SeqPrep for PE reads processing, but unlike SeqPrep it can besides handle SE reads .Table 2
Performance of adapter trimmers on 2Gbp simulated data
Full size table The results of the runs listed in table 2 show that Skewer outperformed all the mainstream tools in terms of mCC for both SE and PE shave ( 0.9291 and 0.9989 respectively ), although Skewer was only marginally better than Cutadapt and Scythe in SE trim. Furthermore, Skewer was substantially faster ( one times faster for SE and more than 12 times faster for PE trimming ) than the tools that had relative performances. Trimmomatic and AlienTrimmer both used above 2 G bytes vertex memory. Most of the other trimmers used less than 50 M bytes memory except SeqTrim, which used 115.7 M bytes. Although Skewer did not have the least memory use, its memory pulmonary tuberculosis ( less than 35 M bytes, see table S2 of Additional file 1 for details ) was far from a bottleneck on a 64-bit calculator. In fact, Skewer uses extra memory to facilitate the process of IUPAC ( International Union of Pure and Applied Chemistry ) characters and for parallel computer science .

Scalability for parallel computing

We ran respective adapter trimmers that support multi-threading to compare their scalability under a parallel calculation environment. SeqTrim was excluded because it is excessively slow to gain relative amphetamine even with tens of threads. In addition, Cutadapt, AdapterRemoval, and other tools were not included because they presently lack a multi-threading function. In the eight threads case, for both the decompress and compress inputs, Skewer achieved the highest acceleration among the arranger trimmers tested ( 7.87 for decompress input, and 4.54 for compressed input signal ) ( see Table S2 of Additional file 1 for details ) .

Receiver operating characteristic (ROC) curves

ROC curves for assorted adapter trimmers under different stringencies were plotted and are shown in Figure 1 and Figure 2 ( see Table S3 of Additional file 1 for details ).

Figure 1figure 1 ROC curves of various adapter trimmers for processing single-end reads of simulated data. ROC : telephone receiver operating characteristic. Full size doubleFigure 2figure 2 ROC curves of various adapter trimmers for processing paired-end reads of simulated data. ROC : receiver operating characteristic. Full size effigy In high stringency ( left ) regions, both Trimmomatic and Cutadapt performed well in that they had depleted FPRs ( false positive rates ) and high TPRs ( true incontrovertible rates ) ; however, as the strictness decreases, both their performance degrade gradually ( Figure 1 ). A alike swerve was seen for TrimGalore where the ROC curve shifted to the upper-right area. This implies that these trimmers avariciously picked up the first candidate that met the strictness quite than select the optimum one. The ROC curve for AlienTrimmer was similar to the above ones, but with a worse operation. FastX may adopt some optimization technique, however, its performance was worse than those of Trimmomatic and Cutadapt within all the stringency range. other arranger trimmers showed advantages on a specific metric ; e.g. AdapterRemoval, Flexbar, and EA-Tools were the most sensitive, while TagCleaner, Btrim, and Scythe were the most bourgeois. SeqTrim appeared only as a dot in the ROC curves plot, because it does not provide a strictness doorway. Skewer outperformed the other adapter trimmers in that it had the least FPR to gain a specific TPR, when TPR > 95 %. ROC curves for the adapter trimmers that are mindful of PE information were plotted and are shown in Figure 2. other trimmers that can process PE reads have worse ROC curves than corresponding ROC curves for processing SE reads since the second reads normally have lower sequencing qualities. From calculate 2, we can see that Skewer had a closely perfect ROC wind conclusion to the upper-left corner. For example, it achieved a TPR of 99.951 % with a FPR of 0.001 % .

Experiments on real data

sRNA sequencing data for Caenorhabditis elegans

A recently published real sRNA data set ( short read archive [ SRA : SRR014966 ] ) [ 16 ], which includes 14,251,981 reads of little non-coding RNA ( ncRNA ) from C. elegans, was used to evaluate the arranger trimmers. Because it is hard to recover all the underlying sRNA fragments for sequencing, we aligned the trim reads to the reference genome and used delta of the numeral of uniquely aligned reads relative to the number of uniquely aligned raw reads, noted as TT ( true trimming ), as a ersatz for true positive. We besides used delta of number of non-uniquely aligned reads relative to the number of non-uniquely aligned natural reads, noted as FT ( false trimming ), as a substitute for false positive. The rationale was that correct-trimming tends to change unaligned fragments to uniquely aligned fragments ( true positive ), while over-trimming tends to change uniquely align fragments to non-uniquely aligned fragments ( false positive ). notice that these metrics tolerate bantam mistakes that can be rescued by the conjunction software and are useful for virtual evaluation. To evaluate the performances under versatile trimming stringencies, all the tools were used to trim arranger sequences from the C. elegans datum set using respective trimming stringency. Next the serve reads were aligned to the C. elegans genome [ 17 ] ( translation 10 ) using Bowtie2 [ 18 ] ( version 2.1.0 ). We then used the above metrics for final plot, with higher FT representing lower stringency. The results are presented in Table S4 and Table S5 of Additional file 2, and illustrated in Figure 3. AdapterRemoval and Flexbar exhibited alike operation curves, while AdapterRemoval was slenderly better than Flexbar within all the screen stringency scope ; and TrimGalore and Cutadapt had similar curves, while TrimGalore was slightly better than Cutadapt at all the stringencies. Under high strictness, EA-Tools, Skewer, and TrimGalore shared the inaugural crying in terms of low FT and high gear TT ; Trimmomatic, AdapterRemoval, and Schythe were ranked second under middle strictness, middle humble stringency, and broken stringency respectively ; and Skewer ranked first at all the stringencies .Figure 3figure 3 Performance of various adapter trimmers on real small RNA data [SRA:SRR014966]. Full size trope

Paired-end RNA sequencing data for Drosophila simulans

A real RNA-Seq data set with 27,005,344 pairs of 101 bp reads ( short read archive [ SRA : SRR330569 ] ) from the gonads and carcasses of D. simulans was used to compare the performances of the adapter trimmers in trimming artificial contaminants from PE reads. For the evaluation, we first used each of the tools that can deal with PE reads with default option setting to trim adapters from the reads, with the exception that the minimum end product fragment length was set to 20 and quality trim was inhibited. We then used TopHat [ 19, 20 ] ( version 2.0.10 ) to align the process reads to the character genome of D. simulans [ 21 ] ( dsim rewrite 1.4 ). last the number of uniquely and concordantly aligned match was used as the performance metrics. The results are presented in Table S6 of Additional file 3 and illustrated in Figure 4. Skewer outperformed the other adapter trimmers in terms of the number of uniquely and concordantly aligned pairs of the trim PE reads. Trimmomatic and AdapterRemoval, both of which performed well in processing the sRNA data, performed ill in processing the farseeing PE datum. This finding implies that these tools may be tuned specifically for trimming adapters from sRNA data. similarly, Btrim besides performed less well with the PE data in this experiment. After investigating the processed data, we found that Btrim could recognize lone the occurrence of the whole adapter sequence with a express permissiveness for insertions and deletions. It should be noted that all quality trim was inhibited from these experiments to compare the adapter trimming performance alone. however, in real applications, timbre shave, which is outside of the scope of this paper, has been reported to improve the map rate and help downriver data psychoanalysis [ 22 ] .Figure 4figure 4 Performance of various adapter trimmers on real paired-end data [SRA:SRR330569]. Full size effigy

Nextera long mate pair (LMP) data for Arabidopsis thaliana

A 5-kb insert size Nextera LMP library of A. thaliana Col-0 with 6,602,426 pairs of 251bp reads ( european Nucleotide Archive [ ENA : ERA264981 ] ) and a 400-bp insert size Illumina HiSeq PE library of the like species with 17,341,797 pairs of 100 bp reads [ ENA : SRR519624 ] were sequenced previously and used to demonstrate the utility of NextClip [ 14 ], a give tool for trimming adapters from Nextera LMP libraries.

To compare Skewer with NextClip, we followed a validation procedure exchangeable to the one describe for NextClip [ 14 ]. Briefly, the LMP library was inaugural trimmed using the adapter spare. then the trimmed LMP reads and the PE reads were delaware novo assembled using ABySS [ 23 ]. The meter needed for the adapter pare and the N50 lengths of the scaffolds were used as the metrics for the evaluation. The solution is listed in table 3 ( see Additional file 4 for relevant commands in detail ), from which we can see that Skewer marginally outperforms NextClip in terms of assembly statistics ( N50 length etc. ) of the trimmed reads. In accession, Skewer is about 49 % faster than NextClip in single weave mode .Table 3
Comparison of NextClip and Skewer in processing Nextera long mate-pair (LMP) reads (ERA264981)
Full size table

generator : https://thefartiste.com
Category : Tech

About admin

I am the owner of the website thefartiste.com, my purpose is to bring all the most useful information to users.

Check Also

articlewriting1

Manage participants in a zoom meeting webinar

Call the people who attend the meet as follows Alternate host host Who scheduled the …

Leave a Reply

Your email address will not be published.