Pipeline

Graphical overview of the quicksand and quicksand-build pipeline

This page provides an overview of the workflows implemented in the quicksand pipeline, as illustrated in the figure above.

While the quicksand pipeline is dedicated to processing DNA sequencing data, the quicksand-build pipeline is used for customizing the underlying databases. Outputs from the quicksand-build pipeline, like the kraken-database, the reference genomes and the bed files for masking non-informative positions are prerequisites for quicksand.

For a detailed overview of the quicksand-build pipeline, see the quicksand_build-page page.

Workflow

Quicksand is a nextflow [1] pipeline, developed for the analysis of target-enriched sedimentary ancient DNA.

The basic workflow of the quicksand pipline comprises multiple subworkflows: filtering, metagenomic classification, mapping/deduplication, and basic DNA damage analysis of the alignments. quicksand generates taxonomic profiles for each of the analyzed input files, along with the binned bam files for each step.

The pipeline requires as input demultiplexed, adapter-trimmed and overlap-merged DNA sequences.

Filtering

In the filtering stage, quicksand takes measures to ensure a smooth execution of the pipeline. It also removes sequences from the files that are non-mappable because of their sequence length, as specified by the --bamfilter_length_cutoff threshold (default: 35bp).

If the input comprises fastq files, they undergo conversion to unpaired and unmapped bam files, a necessary step for subsequent processes. Input bam files are filtered by default to eliminate paired-end reads.

Metagenomics classification

Sequences are classified by the metagenomic classifier KrakenUniq [2]. which operates as a kmer-based classifier. In simple terms, each sequence is broken into smaller chunks of length k (kmers). These informative kmers are then matched against a database to identify the taxa associated with each kmer.

Example for how the KrakenUniq classifier works.

The example above illustrates how kmer-based classification works. Imagine a sentence on a page and the task to find the book it's from. Each word corresponds to a kmer, so all words are checked in a database and linked to books where the word appears in. Common words that appear in all books are removed, leaving only informative words for the classification. Relating the found references in a tree, allows for the assignment of sentences to a node in the given tree.

For the analysis of mitochondrial DNA sequences quicksand uses KrakenUniq with a database created from the non-redundant mtDNA NCBI RefSeq release with a kmer size of 22.

KrakenUniq is fast and allows for a sorting of all input sequences into taxonomic families. To avoid false-positive assignments, families are removed from the results if they fall below the minimum number of reads per family (--krakenuniq_min_reads) and the minimum number of family-kmers (--krakenuniq_min_kmers). For each family reported by KrakenUniq, the node with the highest number of assigned unique kmers is picked as the reference-node for that family.

Example for picking a reference node from the kraken-report.

the example shows how the report is parsed. For the Hyaenidae family, the Hyaena node is picked as reference even though a species level assignment is present for another genus.

For the mapping step, all sequences assigned to a particular family (or order) are gathered into a new bam file. One can choose whether this gathering happens on the family or order level by using the --taxlvl flag.

Mapping and Deduplication

The sequences extracted for each family or order are mapped against all the reference genomes of species belonging to the reference-node found with KrakenUniq for the given family or order. For instance, in the example above, they are mapped against all Hyaena species in RefSeq.

For mapping quicksand utilizes the network-aware fork of BWA [3] with ancient parameters (n 0.01 -o 2 -l 16500).

For the downstream processes, unmapped sequences and sequences with a mean mapping quality of less than the specified --bamfilter_quality_cutoff (default: 25) are removed from the alignment.

In each alignment, exact sequence duplicates are collapsed into unique sequences using bam-rmdup. This process relies on shared start and end coordinates in the alignment. For cases where the reference-node is rank genus or higher, the most representative species is chosen by comparing all the alignments. The species with the highest numbers of basepairs covered in the alignment is picked as best species for the next steps.

The deduped alignments are then depleted of reads that overlap sites marked as non-informative by dustmasker [4]. That step is skipped for families with a fixed reference genome (see --fixed flag)

DNA Deamination Stats

In the last step of the pipeline, the deduplicated sequences are checked for DNA damage patters, by counting base substitutions in the analyzed sequences compared to the reference genomes the sequences are mapped against.

Ancient DNA exhibits C to T substitutions at both the 3’ and 5’ ends of the molecule, a degradation pattern used to identify ancient DNA. By default quicksand counts C to T substitutions on both ends of the DNA sequences, as obtained by data that went through a single stranded library preparation. Use the --doublestranded flag to count the G to A substitutions at the 5' end of the DNA sequences instead, as observed in libraries prepared with a double-stranded protocol.

Taxonomic families whose sequences show more than 10% of terminal base substitutions are flagged as ancient (++) in the final report.

Fixed References

quicksand operates on the family-level and reduces the reads assigned by KrakenUniq to one family to the sequences mapped to a single best reference. The fixed references are used to specify one or multiple reference genomes instead.

Throughout the run quicksand distinguishes between families with 'best' and 'fixed' references. fixed references are genomes specified with the --fixed flag, which assigns a specific reference genome for certain families. This overwrites the KrakenUniq reference-node. Families with a fixed reference undergo a special treatment, skipping the bedtools intersect filter and going through additional analyses described below.

A comparison between the 'best' and the 'fixed' workflows.

The image above shows the differences in the workflows. For the Hominidae reads found by KrakenUniq a reference is either picked by the pipeline (best) based on the number of unique kmers as described above, or provided with a file and the --fixed flag. Fixed references are not reduced to a single species per family.

Extract Ancient Sequences

For families with a fixed reference genome deaminated sequences are extracted. Then the quality score of the first and last 3 base pairs is masked by setting the quality score to 0, and mpileup files are created.

Rerun

The --rerun flag provides an alternative entry-point into the the pipeline and starts the workflow after the KrakenUniq step. The --rerun flag works only together with the --fixed flag. In this mode, quicksand takes the extracted reads and reanalyzes them with the genomes specified.

A comparison between the 'best' and the 'fixed' workflows.

During a rerun, quicksand imports the final report and overwrites it at the end. Only families are reanalyzed for which an extracted-reads file exists.

References