Input and Output¶
Input¶
quicksand requires as input demultiplexed, merged, and adapter trimmed sequencing libraries in .fastq
or .bam
format.
Use the --split
flag to point to the directory containing these files. quicksand refers to the name of the files
as readgroups:
splitdir/
readgroup1.fastq
readgroup2.fastq
readgroup3.bam
Note
Quicksand will process all the .fastq,fq,fastq.gz,fq.gz
and .bam
-files within the --split
directory and ignore all other files.
Output¶
quicksand writes all output files to the quicksand_v2.1 directory. Within this directory the files are layed out as follows:
quicksand_v2.1
├── out
│ └── {taxon}
│ ├── 1-extracted
│ │ └── {RG}_extractedReads-{taxon}.bam
│ ├── best // (for families not in --fixed)
│ │ ├── 2-aligned
│ │ │ └── {RG}.{family}.{species}.bam
│ │ ├── 3-deduped
│ │ │ └── {RG}.{family}.{species}_deduped.bam
│ │ └── 4-bedfiltered
│ │ └── {RG}.{family}.{species}_deduped_bedfiltered.bam
│ └── fixed // (for families in --fixed)
│ ├── 2-aligned
│ │ └── {RG}.{family}.{species}.bam
│ ├── 3-deduped
│ │ └── {RG}.{family}.{species}_deduped.bam
│ ├── 5-deaminated
│ │ ├── {RG}.{family}.{species}_deduped_deaminated_1term.bam
│ │ └── {RG}.{family}.{species}_deduped_deaminated_3term.bam
│ └── 6-mpileups
│ ├── {RG}.{family}.{species}_term1_mpiled.tsv
│ ├── {RG}.{family}.{species}_term3_mpiled.tsv
│ └── {RG}.{family}.{species}_all_mpiled.tsv
├── stats
│ ├── splitcounts.tsv
│ ├── {RG}.kraken.report
│ ├── {RG}.kraken.translate
│ ├── {RG}_00_extracted.tsv
│ ├── {RG}_01_mapped.tsv
│ ├── {RG}_02_deduped.tsv
│ ├── {RG}_03_bedfiltered.tsv
│ └── {RG}_04_deamination.tsv
├── nextflow
│ ├── {DATE}.commands
│ └── {DATE}.config
├── work
│ └── ...
├── cc_estimates.tsv
├── filtered_report_{N}p_{N}b.tsv
└── final_report.tsv
Files¶
Directory: out/TAXON¶
1-extracted/${RG}.extractedReads-${taxon}.bam
BAM FILE. Contains the DNA sequences of one readgroup assigned by KrakenUniq to one taxon [family or order].
2-aligned/${RG}.${family}.${species}.bam
BAM FILE. Contains the aligend sequences after mapping the extractedReads to the reference species
3-deduped/${RG}.${family}.${species}_deduped.bam
BAM FILE. The same alignment, but depleted of PCR duplicates.
4-bedfiltered/${RG}.${family}.${species}_deduped_bedfiltered.bam
5-deaminated/${RG}.${family}.${species}_deduped_deaminated_1term.bam
5-deaminated/${RG}.${family}.${species}_deduped_deaminated_3term.bam
6-mpileups/${RG}.${family}.${species}_all_mpiled.tsv
6-mpileups/${RG}.${family}.${species}_1term_mpiled.tsv
6-mpileups/${RG}.${family}.${species}_3term_mpiled.tsv
Directory: stats¶
${RG}.report
The standard krakenuniq report
${RG}.translate
The human readable kraken report in mpa-format
stats/splitcounts.tsv
TSV FILE. Contains for each readgroup the number of reads before (raw) and after the initial filter step:
RG ReadsRaw ReadsFiltered ReadsLengthfiltered
test1 235 235 230
test2 235 235 230
test3 235 235 230
${RG}_00_extracted.tsv
TSV FILE. Contains the number of sequences assigned to a taxon based on the KrakenUniq classification:
Taxon ReadsExtracted
Hominidae 235
${RG}_01_mapped.tsv
TSV FILE. Contains for each readgroup and family the number of sequences mapped to the reference genome. The column 'Reference' shows if the reference genome was fixed. The proportion mapped is the proportion of mapped to extracted reads:
Order Family Species Reference ReadsMapped ProportionMapped
Primates Hominidae Homo_sapiens fixed 235 0.913
${RG}_02_deduped.tsv
TSV FILE. Contains for each readgroup and family the number of unique reads mapped to the reference genome, the duplication rate
and information from the samtools coverage
command:
Order: The taxonomic order
Family: The taxonomic family
Species: The taxonomic species used as reference for mapping
Reference: The reference type: either 'best' or 'fixed'
ReadsDeduped: The number of unique reads
DuplicationRate: The duplication rate of the unique reads
CoveredBP: 'covbases' of the samtools coverage command: The number of covered bases in the reference genome
Coverage: 'meandepth' of the samtools coverage command: The mean depth of coverage
Breadth: 'coverage' of the samtools coverage command (by 100): the proportion of covered bases in the reference genome
ExpectedBreadth: Expected breadth based on the inStrain formula: expected_breadth = 1-e^(-0.833*coverage). See
https://instrain.readthedocs.io/en/latest/important_concepts.html
ProportionExpectedBreadth: The proportion of Breadth / ExpectedBreadth
${RG}_03_bedfiltered.tsv
TSV FILE. Contains for each readgroup and family the number of sequences remaining in the bam-file after bedfiltering and the number of covered basepairs in the reference genome after removal of low-complexity sequences:
Order Family Species Reference ReadsBedfiltered PostBedCoveredBP
Primates Hominidae Homo_sapiens fixed 97 4177
${RG}_04_deamination.tsv
TSV FILE. Contains for each readgroup the deamination stats for the BAM file after bedfiltering:
Ancientness: ++ = more than 9.5% of the reads that show a terminal C in both the 5' and 3' position in the reference genome, carry a T
+ = more than 9.5% of the reads that show a terminal C in either the 5' or 3' position in the reference genome, carry a T
- = no signs for DNA deamination patterns
ReadsDeam(1term): The number of reads (after deduplication and bedfiltering) that show a deamination in the terminal base positions
ReadsDeam(3term): The number of reads (after deduplication and bedfiltering) that show a deamination in the three terminal base positions
Deam5(95ci): For the terminal 5' end, the percentage of C to T substitutions (and the 95% confidence interval)
Deam3(95ci): For the terminal 3' end, the percentage of C to T substitutions (and the 95% confidence interval)
Deam5Cond(95ci): Taken only 3' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 5' terminal base
Deam3Cond(95ic): Taken only 5' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 3' terminal base
final_report.tsv¶
The final report contains all the columns presented above. In Addition, the final report contains a column FamPercentage
which provides the relative
proportion of final reads (after deduplication or bedfiltering) of the assigned family in the readgroup. If there are several lines for one family and readgroup (e.g. after a rerun or multiple fixed references)
the highest number of final reads is used as the baseline for the other entries of the same family
filtered_report.tsv¶
The filtered report contains all the columns from the final_report. However, the report is filtered by the two values FamPercentage
and ProportionExpectedBreadth
as
provided by the flags --reportfilter_percentage
and --reportfilter_breadth
(both default to 0.5).
cc_estimates.tsv
files contains information about index-hopping and cross contamintaionnextflow
directory contains information about the run, like the commandline used and the config-files providedwork
directory can be deleted after the run - it contains nextflow specific intermediate files