.. _output-page: Input and Output ================ .. _input: Input ----- quicksand requires as input demultiplexed, merged, and adapter trimmed sequencing libraries in :file:`.fastq` or :file:`.bam` format. Use the :code:`--split` flag to point to the directory containing these files. quicksand refers to the name of the files as readgroups:: splitdir/ readgroup1.fastq readgroup2.fastq readgroup3.bam .. note:: Quicksand will process all the :file:`.fastq,fq,fastq.gz,fq.gz` and :file:`.bam`-files within the :code:`--split` directory and ignore all other files. Output ------ quicksand writes all output files to the **quicksand_v2.6** directory. Within this directory the files are layed out as follows:: quicksand_v2.6 ├── out │ └── {taxon} │ ├── 1-extracted │ │ └── {RG}_extractedReads-{taxon}.bam │ ├── best // (for families not in --fixed) │ │ ├── 2-aligned │ │ │ └── {RG}.{family}.{species}.bam │ │ ├── 3-deduped │ │ │ └── {RG}.{family}.{species}_deduped.bam │ │ └── 4-bedfiltered │ │ └── {RG}.{family}.{species}_deduped_bedfiltered.bam │ └── fixed // (for families in --fixed) │ ├── 2-aligned │ │ └── {RG}.{family}.{species}.bam │ ├── 3-deduped │ │ └── {RG}.{family}.{species}_deduped.bam │ ├── 3-bedfiltered #(if --fixed_bedfiltering) │ │ └── {RG}.{family}.{species}_deduped_bedfiltered.bam │ ├── 5-deaminated │ │ ├── {RG}.{family}.{species}_deduped_deaminated_1term.bam │ │ └── {RG}.{family}.{species}_deduped_deaminated_3term.bam │ └── 6-mpileups │ ├── {RG}.{family}.{species}_term1_mpiled.tsv │ ├── {RG}.{family}.{species}_term3_mpiled.tsv │ └── {RG}.{family}.{species}_all_mpiled.tsv ├── stats │ ├── splitcounts.tsv │ ├── {RG}.kraken.report │ ├── {RG}.kraken.translate │ ├── {RG}_00_extracted.tsv │ ├── {RG}_01_mapped.tsv │ ├── {RG}_02_deduped.tsv │ ├── {RG}_03_bedfiltered.tsv │ ├── {RG}_04_deamination_positions/ │ │ ├── {family}.{species}_substitutions.tsv │ │ └── {family}.{species}_95CI-intervals.tsv │ └── {RG}_04_deamination.tsv ├── nextflow │ ├── {DATE}.commands │ └── {DATE}.config ├── cc_estimates.tsv ├── filtered_report_{N}p_{N}b.tsv # final_report.tsv with applied filters ├── R_final_report.tsv # final_report.tsv, but with R-friendly headers └── final_report.tsv .. _files: Files """"" Directory: out/TAXON ~~~~~~~~~~~~~~~~~~~~ .. rst-class:: file *1-extracted/$\{RG\}.extractedReads-$\{taxon\}.bam* .. rst-class:: description The ‘ExtractedReads’ are BAM files containing the binned sequences per [family or order], based on the KrakenUniq classification. They are generated for each assignment that passes the KrakenUniq filter thresholds. The sequences in these files are unmapped. .. rst-class:: file *2-aligned/$\{RG\}.$\{family\}.$\{species\}.bam* .. rst-class:: description The ‘2-aligned’ directory contains all alignments generated by mapping the ExtractedReads of each family to all reference genomes under its “best” reference node, as identified from the KrakenUniq report. The alignment files include only mapped sequences and are not filtered for low-quality scores. .. rst-class:: file *3-deduped/$\{RG\}.$\{family\}.$\{species\}_deduped.bam* .. rst-class:: description The ‘3-deduped/’ directory contains the alignments from ‘2-aligned/’, with PCR duplicates removed. Sequences not passing the mapping-quality threshold are excluded. These BAM-files are the main input-files for additional downstream analyses. .. rst-class:: file *4-bedfiltered/$\{RG\}.$\{family\}.$\{species\}_deduped_bedfiltered.bam* .. rst-class:: description The ‘4-bedfiltered/’ directory contains a single BAM file per family - the mapped and deduplicated sequences of the “final” best reference per family. Sequences overlapping low-complexity regions are removed based on the BED files provided with the quicksand database. By default, this directory is not generated for “fixed” references, as the fixed reference genomes lack the pre-generated BED files. The creation of a BED file and the bedfiltering-step can be enforced for ‘fixed’ references with the --fixed_bedfiltering flag. .. rst-class:: file *5-deaminated/$\{RG\}.$\{family\}.$\{species\}_deduped_deaminated_1term.bam* .. rst-class:: description | BAM FILE. The deduped alignment, filtered for reads that show a C to T | substitution at one of the terminal positions in respect to the reference genome .. rst-class:: file *5-deaminated/$\{RG\}.$\{family\}.$\{species\}_deduped_deaminated_3term.bam* .. rst-class:: description | BAM FILE. The deduped alignment, filtered for reads that show a C to T | substitution at one of the terminal `three` positions in respect to the reference genome .. rst-class:: file *6-mpileups/$\{RG\}.$\{family\}.$\{species\}_all_mpiled.tsv* .. rst-class:: description | TSV FILE. The deduped alignment, but in mpileup format. | The first three positions of each sequence are masked by setting the mapping quality to 0 .. rst-class:: file *6-mpileups/$\{RG\}.$\{family\}.$\{species\}_1term_mpiled.tsv* .. rst-class:: description | TSV FILE. Mpileup format. The first three positions of each sequence are masked by setting the mapping quality to 0. | The pileup contains only reads showing a C to T substitution at one of the terminal positions in respect to the reference genome .. rst-class:: file *6-mpileups/$\{RG\}.$\{family\}.$\{species\}_3term_mpiled.tsv* .. rst-class:: description | TSV FILE. Mpileup format. The first three positions of each sequence are masked by setting the mapping quality to 0. | The pileup contains only reads showing a C to T substitution at one of the terminal `three` positions in respect to the reference genome Directory: stats ~~~~~~~~~~~~~~~~ .. rst-class:: file *$\{RG\}.report* .. rst-class:: description The standard krakenuniq report .. rst-class:: file *$\{RG\}.translate* .. rst-class:: description The human readable kraken report in mpa-format .. rst-class:: file *stats/splitcounts.tsv* .. rst-class:: description TSV FILE. Contains for each readgroup the number of reads before (raw) and after the initial filter step:: RG ReadsRaw ReadsFiltered ReadsLengthfiltered test1 235 235 230 test2 235 235 230 test3 235 235 230 .. rst-class:: file *$\{RG\}_00_extracted.tsv* .. rst-class:: description TSV FILE. Contains the number of sequences assigned to a taxon based on the KrakenUniq classification:: Taxon ReadsExtracted Hominidae 235 .. rst-class:: file *$\{RG\}_01_mapped.tsv* .. rst-class:: description TSV FILE. Contains for each readgroup and family the number of sequences mapped to all reference genomes from the "best" node. The column 'Reference' shows if the reference genome was fixed. The proportion mapped is the proportion of mapped to extracted reads:: Order Family Species Reference ReadsMapped ProportionMapped Primates Hominidae Homo_sapiens fixed 235 0.913 .. rst-class:: file *$\{RG\}_02_deduped.tsv* .. rst-class:: description TSV FILE. Contains for each readgroup and family the number of unique reads mapped to the reference genome, the duplication rate and information from the :code:`samtools coverage` command:: Order: The taxonomic order Family: The taxonomic family Species: The taxonomic species used as reference for mapping Reference: The reference type: either 'best' or 'fixed' ReadsDeduped: The number of unique reads DuplicationRate: The duplication rate of the unique reads CoveredBP: 'covbases' of the samtools coverage command: The number of covered bases in the reference genome Coverage: 'meandepth' of the samtools coverage command: The mean depth of coverage Breadth: 'coverage' of the samtools coverage command (by 100): the proportion of covered bases in the reference genome ExpectedBreadth: Expected breadth based on the inStrain formula: expected_breadth = 1-e^(-0.833*coverage). See https://instrain.readthedocs.io/en/latest/important_concepts.html ProportionExpectedBreadth: The proportion of Breadth / ExpectedBreadth .. rst-class:: file *$\{RG\}_03_bedfiltered.tsv* .. rst-class:: description TSV FILE. Contains for each readgroup and family the number of sequences remaining in the bam-file after bedfiltering and the number of covered basepairs in the reference genome after removal of low-complexity sequences:: Order Family Species Reference ReadsBedfiltered PostBedCoveredBP Primates Hominidae Homo_sapiens fixed 97 4177 .. rst-class:: file *$\{RG\}_04_deamination.tsv* .. rst-class:: description TSV FILE. Contains for each readgroup the deamination stats for the BAM file after bedfiltering:: Ancientness: ++ = more than 9.5% of the reads that show a terminal C in both the 5' and 3' position in the reference genome, carry a T + = more than 9.5% of the reads that show a terminal C in either the 5' or 3' position in the reference genome, carry a T - = no signs for DNA deamination patterns ReadsDeam(1term): The number of reads (after deduplication and bedfiltering) that show a deamination in the terminal base positions ReadsDeam(3term): The number of reads (after deduplication and bedfiltering) that show a deamination in the three terminal base positions Deam5(95ci): For the terminal 5' end, the percentage of C to T substitutions (and the 95% confidence interval) Deam3(95ci): For the terminal 3' end, the percentage of C to T substitutions (and the 95% confidence interval) Deam5Cond(95ci): Taken only 3' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 5' terminal base Deam3Cond(95ic): Taken only 5' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 3' terminal base .. rst-class:: file *$\{RG\}_04_deamination_positions/* .. rst-class:: description A directory that contains for each final alignment (RG and Family) the DNA substitution-patterns per position. Two files report the substitution patterns for the first and last 10 positions as well as the 95% confidence intervals. We report C-to-T and G-to-A substitutions as well as all other substitution from C or G to one of the other three bases. final_report.tsv ~~~~~~~~~~~~~~~~ The final report contains all the columns presented above. In Addition, the final report contains a column :code:`FamPercentage` which provides the relative proportion of *final reads* (after deduplication or bedfiltering) of the assigned family in the readgroup. If there are several lines for one family and readgroup (e.g. after a rerun or multiple fixed references) the highest number of final reads is used as the baseline for the other entries of the same family filtered_report.tsv ~~~~~~~~~~~~~~~~ The filtered report contains all the columns from the final_report. However, the report is filtered by the two values :code:`FamPercentage` and :code:`ProportionExpectedBreadth` as provided by the flags :code:`--reportfilter_percentage` and :code:`--reportfilter_breadth` (both default to 0.5). | The :file:`cc_estimates.tsv` files contains information about index-hopping and cross contamintaion | The :file:`nextflow` directory contains information about the run, like the commandline used and the config-files provided | the :file:`work` directory can be deleted after the run - it contains nextflow specific intermediate files