Examples

This page provides examples for the three main ways to execute quicksand. The regular run, a run with fixed references and a rerun with fixed references within an existing run-folder.

Please see the Quickstart section to download a test-dataset (split) and the required datastructure (refseq).

Regular run

The regular run is used to get an initial overview over the taxonomic composition of the samples. quicksand provides an overview over the detected families and the number of ancient sequences found.

Execute quicksand like this:

nextflow run mpieva/quicksand -r v2.1 \
    -profile   singularity \
    --split    split/ \
    --db       refseq/kraken/Mito_db_kmer22/ \
    --bedfiles refseq/genomes/ \
    --masked   refseq/masked/

The output files are grouped by family-level in the out/ directory. Extracted family-sequences after the KrakenUniq run are stored in out/{family}/1-extracted/ while mapped, deduped and filtered sequences are saved to the out/{family}/best/{step}/ directory after the respective processing step:

quicksand_v2.1
├── out
│    └── {family}
│         ├── 1-extracted
│         │    └── {RG}_extractedReads-{family}.bam
│         └── best
│              ├── 2-aligned
│              │     └── {RG}.{family}.{species}.bam
│              ├── 3-deduped
│              │     └── {RG}.{family}.{species}_deduped.bam
│              └── 4-bedfiltered
│                    └── {RG}.{family}.{species}_deduped_bedfiltered.bam
...
└── final_report.tsv

See the final_report.tsv for a summary of the quicksand run

Fixed references

quicksand is designed to work with target-enriched DNA sequences and to account for expected families in the data. For families of interest provide an input-file with the --fixed flag, which specifies the reference-genomes to use for the sequences assigned by KrakenUniq to the given family. Tags are used for the file-names and should be unique!:

file: fixed-references.tsv

Taxon       Tag             Genome
Hominidae   Homo_sapiens    /path/to/reference.fasta
Hominidae   Another_human   /path/to/reference.fasta

and start the execution with:

nextflow run mpieva/quicksand -r v2.1 \
    -profile   singularity \
    --split    split/ \
    --db       refseq/kraken/Mito_db_kmer22/ \
    --genomes  refseq/genomes/ \
    --bedfiles refseq/masked/
    --fixed    fixed-references.tsv

The output file structure remains the same as before. For families specified in the fixed-references.tsv file output-files appear in the out/{family}/fixed/{step}/ directory, together with additional output-files that are useful in additional downstream-analyses, such as the extracted deaminated reads:

quicksand_v2.1
├── out
│    └── {family}
│         ├── 1-extracted
│         │    └── {RG}_extractedReads-{family}.bam
│         ├── best // (family not in fixed)
|         |
│         └── fixed // (family in fixed)
│              ├── 2-aligned
│              │     └── {RG}.{family}.{Tag}.bam
│              ├── 3-deduped
│              │     └── {RG}.{family}.{Tag}_deduped.bam
│              ├── 5-deaminated
│              │     ├── {RG}.{family}.{Tag}_deduped_deaminated_1term.bam
│              │     └── {RG}.{family}.{Tag}_deduped_deaminated_3term.bam
│              └── 6-mpileups
│                    ├── {RG}.{family}.{Tag}_term1_mpiled.tsv
│                    ├── {RG}.{family}.{Tag}_term3_mpiled.tsv
│                    └── {RG}.{family}.{Tag}_all_mpiled.tsv
...
└── final_report.tsv

Rerun

This mode is used to repeat a run with a different set of fixed references. Imagine beeing interested in the evolution of the Suidae family after having analyzed all samples with quicksand already.

And in the final report of the analysis some lines look like this:

Family    Species                   Reference     ReadsMapped    ProportionMapped    ReadsDeduped
Suidae    Sus_scrofa_taivanus       best          1208           0.9028              1000

The assigned species was based on the KrakenUniq results and probably doesnt resemble the "real" species as RefSeq contains only limited amounts of reference genomes. For any analyses that go beyond the family level, a reanalysis with a suitable reference genome is required.

After collecting the reference genome(s) for the Suidae family, prepare a fresh fixed-references file:

Taxon       Tag                 Genome
Suidae      super_cool_pig      /path/to/reference.fasta

and rerun the pipeline with:

nextflow run mpieva/quicksand -r v2.1 \
    -profile   singularity \
    --rerun    \
    --fixed    fixed-references.tsv

The (additional) output files are the ones created by the --fixed flag:

quicksand_v2.1
├── out
│    └── {family}
│         ├── 1-extracted
│         │    └── {RG}_extractedReads-{family}.bam
│         └── fixed // (family in fixed)
│              ├── 2-aligned
│              │     └── {RG}.{family}.{Tag}.bam
│              ├── 3-deduped
│              │     └── {RG}.{family}.{Tag}_deduped.bam
│              ├── 5-deaminated
│              │     ├── {RG}.{family}.{Tag}_deduped_deaminated_1term.bam
│              │     └── {RG}.{family}.{Tag}_deduped_deaminated_3term.bam
│              └── 6-mpileups
│                    ├── {RG}.{family}.{Tag}_term1_mpiled.tsv
│                    ├── {RG}.{family}.{Tag}_term3_mpiled.tsv
│                    └── {RG}.{family}.{Tag}_all_mpiled.tsv
...
└── final_report.tsv

The report contains now additional lines for the Suidae family with the 'fixed' references tag:

Family    Species                   Reference     ReadsMapped    ProportionMapped    ReadsDeduped
Suidae    Sus_scrofa_taivanus       best          1208           0.9028              1000
Suidae    super_cool_pig            fixed         1052           0.8024              976

The final report contains a mix of best (old run) and fixed (rerun) reference entries.