Examples¶
This page shows examples for the three main ways to execute quicksand. The regular run, a run with fixed references and a rerun with fixed references within an existing run-folder.
Please see the Quickstart section to download a test-dataset (split)
and the required datastructure (refseq).
Regular run¶
The regular run is used to get an initial overview over the taxonomic composition of the samples. quicksand provides an overview over the detected families and the number of ancient sequences found.
Execute quicksand like this:
nextflow run mpieva/quicksand -r v2.5 \
-profile singularity \
--split split/ \
--db refseq/kraken/Mito_db_kmer22/ \
--bedfiles refseq/genomes/ \
--masked refseq/masked/
The output files are grouped by family-level in the out/ directory. Sequences binned by family-level
(after KrakenUniq) are stored in out/{family}/1-extracted/ while mapped, deduped and bedfiltered sequences are saved in the
out/{family}/best/{step}/ directories after the respective processing step:
quicksand_v2.5
├── out
│ └── {family}
│ ├── 1-extracted
│ │ └── {RG}_extractedReads-{family}.bam
│ └── best
│ ├── 2-aligned
│ │ └── {RG}.{family}.{species}.bam
│ ├── 3-deduped
│ │ └── {RG}.{family}.{species}_deduped.bam
│ └── 4-bedfiltered
│ └── {RG}.{family}.{species}_deduped_bedfiltered.bam
...
└── final_report.tsv
See the final_report.tsv for a summary of the quicksand run.
Filter the final_report¶
The default quicksand-output (final_report.tsv) is unfiltered, because the best
filtering thresholds might differ between sites (and projects). However, we provide a filtered version of the report filtered_report_05p_05b.tsv
with the default filter-thresholds applied. These thresholds are the FamPercentage column (>=0.5%)
and the ProportionExpectedBreadth column (>=0.5).
Fixed references¶
quicksand is designed to work with target-enriched data. To account for
expected taxa in the sequences, users can provide a TSV-file with the --fixed flag. This file specifies for each family the reference-genome(s)
that quicksand uses for mapping sequences assigned by KrakenUniq to the given family.
The 'Tags' used are used in the same way as the 'Species' (e.g. in the file-names) and should be unique!:
file: fixed-references.tsv
Taxon Tag Genome
Hominidae Homo_sapiens /path/to/reference_1.fasta
Hominidae Another_human /path/to/reference_2.fasta
Run quicksand with:
nextflow run mpieva/quicksand -r v2.5 \
-profile singularity \
--split split/ \
--db refseq/kraken/Mito_db_kmer22/ \
--genomes refseq/genomes/ \
--bedfiles refseq/masked/
--fixed fixed-references.tsv
The output file structure remains mostly the same. For families specified in the fixed-references.tsv file output-files
appear in the out/{family}/fixed/{step}/ directory, together with additional output-files
that might be useful for additional downstream-analyses, such as the extracted deaminated reads:
quicksand_v2.5
├── out
│ └── {family}
│ ├── 1-extracted
│ │ └── {RG}_extractedReads-{family}.bam
│ ├── best // (family not in fixed)
| |
│ └── fixed // (family in fixed)
│ ├── 2-aligned
│ │ └── {RG}.{family}.{Tag}.bam
│ ├── 3-deduped
│ │ └── {RG}.{family}.{Tag}_deduped.bam
│ ├── 4-bedfiltered #(only if --fixed_bedfiltering)
│ │ └── {RG}.{family}.{Tag}_deduped_bedfiltered.bam
│ ├── 5-deaminated
│ │ ├── {RG}.{family}.{Tag}_deduped_deaminated_1term.bam
│ │ └── {RG}.{family}.{Tag}_deduped_deaminated_3term.bam
│ └── 6-mpileups
│ ├── {RG}.{family}.{Tag}_term1_mpiled.tsv
│ ├── {RG}.{family}.{Tag}_term3_mpiled.tsv
│ └── {RG}.{family}.{Tag}_all_mpiled.tsv
...
└── final_report.tsv
Rerun¶
This mode is used to repeat a run with a different set of fixed references. For example: the final report of the analysis look like this:
Family Species Reference ReadsMapped ProportionMapped ReadsDeduped
Suidae Sus_scrofa_taivanus best 1208 0.9028 1000
The assigned ('best') species was based on the KrakenUniq results and might reflect the "real" species as RefSeq contains only limited amounts of reference genomes. For any analyses that go beyond the family level, a reanalysis with a suitable reference genome might be required.
So after collecting more reference genome(s) for the Suidae family, prepare a fresh fixed-references file:
Taxon Tag Genome
Suidae super_cool_pig /path/to/reference.fasta
Suidae super_cool_pig2 /path/to/reference2.fasta
Suidae super_cool_pig3 /path/to/reference3.fasta
and rerun the pipeline with:
nextflow run mpieva/quicksand -r v2.5 \
-profile singularity \
--rerun \
--fixed fixed-references.tsv
The (additional) output files are then the ones created by the --fixed flag:
quicksand_v2.5
├── out
│ └── Suidae
│ ├── 1-extracted
│ │ └── {RG}_extractedReads-Suidae.bam
│ └── fixed
│ ├── 2-aligned
│ │ └── {RG}.Suidae.{Tag}.bam
│ ├── 3-deduped
│ │ └── {RG}.Suidae.{Tag}_deduped.bam
│ ├── 5-deaminated
│ │ ├── {RG}.Suidae.{Tag}_deduped_deaminated_1term.bam
│ │ └── {RG}.Suidae.{Tag}_deduped_deaminated_3term.bam
│ └── 6-mpileups
│ ├── {RG}.Suidae.{Tag}_term1_mpiled.tsv
│ ├── {RG}.Suidae.{Tag}_term3_mpiled.tsv
│ └── {RG}.Suidae.{Tag}_all_mpiled.tsv
...
└── final_report.tsv
The report contains now additional lines for the Suidae family with the 'fixed' references tag:
Family Species Reference ReadsMapped ProportionMapped ReadsDeduped
Suidae Sus_scrofa_taivanus best 1208 0.9028 1000
Suidae super_cool_pig fixed 1052 0.8024 976
Suidae super_cool_pig2 fixed 1000 0.9001 800
Suidae super_cool_pig3 fixed 860 0.7551 550
The final report contains a mix of best (old run) and fixed (rerun) reference entries.