Usage

Running an analysis

The basic command to run the analysis, after editing the settings in settingsfile.txt:

bash START.sh settingsfile.txt

The previous command runs Guppy and the first clustering per sample. To run the reclustering and Taxonomic identification:

bash START_recluster.sh settingsfile.txt

Modify the following parameters in settingsfile.txt to run the pipeline:

Main parameters

Variable	Description
RunName	Run name of the analysis, this will be the name of the output folder.
mail-user	Email address to receive updates about the slurm job of the analysis.
Targets	Markers to analyze separated with ‘,’ , these marker names must correspond to the names in the primer file. For example: ‘18SV4,COI’
THREADS	Total CPU threads per sample.
GPU	GPU usage (only for Guppy basecalling): ‘1’ for using GPU or ‘cpu’ to run Guppy on CPU.
timepersample	Total minutes per sample reserved by slurm.
RunModules	Modules to run. Options: all (default) , Guppy , Clustering , Consensus , Blast , Taxonomy Another module called ‘oldmode’ can also be used, which runs the tax identification per sample (instead of per dataset). Needs to be used in combination with ‘all’ or other modules (e.g. “oldmode,clustering,Consensus,Blast,taxonomy”)
modus	“all” to analyze all datasets within OutDir or “one” to analyze dataset with same RunName and OutDir.

File location parameters

Variable	Description
FAST5Folder	Folder containing the nanopore fast5/pod5 files
OutDir	Path where the output of the DNA metabarcoding analysis is placed
workdir	Location of the Pimenta scripts on the HPC server, example: ~/pimenta
SampleDescription	tsv file containing sample names, see id5
PrimerFile	file containing marker primers, see id4
NT_dmp	contains nodes.dmp and names.dmp from NCBI Installation
DATABASE	BLAST database, for example: /lustre/shared/wfsr-databases/BLASTdb/nt
ExSeqids	file containing seqids that should be excluded from BLAST results, default=$workdir/Excluded.NCBI.identifications.tsv
ExTaxids	file containing taxids that should be excluded from BLAST results, default=$PWD/Excluded.NCBI.taxids

Tool parameters

Variable	Tool	Description
MinionKit	Guppy	MinION kit used, for example: SQK-PBK004
MinionFlowCell	Guppy	Flow cell used for sequencing, for example: FLO-FLG001
ExpansionKit	Guppy	Barcode expansion kit, set it to “none” if not used. example: EXP-NBD104
MinMaxLength	Prinseq	Filtering on nt length with Prinseq, preferably around the length of your primers, for example: 100-500
TrimLeft	Prinseq	Total nt to be trimmed of on the left side of sequence, default=5
TrimRight	Prinseq	Total nt to be trimmed of on the right side of sequence, default=5
MinQualMean	Prinseq	Minimum quality score of reads that can pass the filtering, default=12
Ident1	CD-HIT	Minimum percentage identity for the first clustering with CD-HIT, default=0.93
Ident2	CD-HIT	Minimum percentage identity for the reclustering with CD-HIT, default=1
MinClustSize	CD-HIT	Minimum cluster size for first clustering
MinClustSize2	CD-HIT	Minimum cluster size for clusters in reclustering
Error	Cutadapt	Maximum error rate that Cutadapt allows, default=0.15
Evalue	BLAST	Maximum E-value that BLAST allows, default=0.001
Pident	BLAST	Minimum percentage identity for filtered BLAST results, default=90
Qcov	BLAST	Minimum Query coverage for filtered BLAST results, default=90
MaxTargetSeqs	BLAST	Maximum amount of hits BLAST outputs per consensus sequence, default=100
filtertax	BLAST	Keep all results in $filtertax, ex. “Metazoa”, leave empty “” to keep everything

The primer file is a csv file including primer sequences in the following format:

Target;sequence;forward
Target;sequence;reverse

Example:

18SV9;GTACACACCGCCCGTC;forward
18SV9;TGATCCTTCTGCAGGTTCACCTAC;reverse
18SV4;AGGGCAAKYCTGGTGCCAG;forward
18SV4;GRCGGTATCTRATCGYCTT;reverse

It is important that the target name is the same name given in the settingsfile.

If you want to give names to the different samples, you could create a sample description txt file. It has the following format:

##BarcodeID;SampleName
barcode01;Sample_1
barcode02;Sample_2

If you want to use oldmode, then this sampledescription is a requirement and also includes the markers used. example:

##BarcodeID;SampleName;DNAbarcodes
barcode03;Mix_1;miniCOI,18SV9,18SV4
barcode04;Mix_2;miniCOI,18SV9,18SV4

While running PIMENTA, a lot of files are created in OutDir / RunName Beneath is an overview with a short explanation per output.

file/folder	file/folder	file/folder	explanation
barcode*.`SampleName`			Folder containing individual sample data
barcode*.`SampleName`	barcode*.`SampleName`.fastq		“raw” reads basecalled and demultiplexed by Guppy
barcode*.`SampleName`	barcode.`SampleName`.QC.fastq, barcode.`SampleName`.QC.fasta, barcode*.`SampleName`.QC.gd		QC filtered reads in fastq and fasta format
barcode*.`SampleName`	ClustCons	multi-seq	folders containing clustering data and cluster fasta files
RunName.ClusterContent.tsv			tsv file containing an overview of the reclustered clusters with more information of the size, taxonomy, etc. of each cluster
RunName.PS.fasta			fasta file containing consensus sequences from all samples from the first clustering
RunName.settings.all.txt			file containing the settings used during the analysis
RunName.stats.txt			file containing an overview of the cluster distribution over the different markers
Taxonomy_per_sample_`Ident2`	Target		folder containining tsv files with taxonomy overview with readcounts per sample
Taxonomy_per_sample_`Ident2`	krona		folder containining krona plots per target and sample
job_scheduler	LOGS		this folder contains all slurm log files (except Guppy logs)