Usage

Running an analysis

The basic command to run the analysis, after editing the settings in settingsfile.txt:

bash START.sh settingsfile.txt

The previous command runs Guppy and the first clustering per sample. To run the reclustering and Taxonomic identification:

bash START_recluster.sh settingsfile.txt

Parameters

Modify the following parameters in settingsfile.txt to run the pipeline:

Main parameters

Variable

Description

RunName

Run name of the analysis, this will be the name of the output folder.

mail-user

Email address to receive updates about the slurm job of the analysis.

Targets

Markers to analyze separated with ‘,’ , these marker names must correspond to the names in the primer file. For example: ‘18SV4,COI’

THREADS

Total CPU threads per sample.

GPU

GPU usage (only for Guppy basecalling): ‘1’ for using GPU or ‘cpu’ to run Guppy on CPU.

timepersample

Total minutes per sample reserved by slurm.

RunModules

Modules to run. Options: all (default) , Guppy , Clustering , Consensus , Blast , Taxonomy

Another module called ‘oldmode’ can also be used, which runs the tax identification per sample (instead of per dataset). Needs to be used in combination with ‘all’ or other modules (e.g. “oldmode,clustering,Consensus,Blast,taxonomy”)

modus

“all” to analyze all datasets within OutDir or “one” to analyze dataset with same RunName and OutDir.

File location parameters

Variable

Description

FAST5Folder

Folder containing the nanopore fast5/pod5 files

OutDir

Path where the output of the DNA metabarcoding analysis is placed

workdir

Location of the Pimenta scripts on the HPC server, example: ~/pimenta

SampleDescription

tsv file containing sample names, see id5

PrimerFile

file containing marker primers, see id4

NT_dmp

contains nodes.dmp and names.dmp from NCBI Installation

DATABASE

BLAST database, for example: /lustre/shared/wfsr-databases/BLASTdb/nt

ExSeqids

file containing seqids that should be excluded from BLAST results, default=$workdir/Excluded.NCBI.identifications.tsv

ExTaxids

file containing taxids that should be excluded from BLAST results, default=$PWD/Excluded.NCBI.taxids

Tool parameters

Variable

Tool

Description

MinionKit

Guppy

MinION kit used, for example: SQK-PBK004

MinionFlowCell

Guppy

Flow cell used for sequencing, for example: FLO-FLG001

ExpansionKit

Guppy

Barcode expansion kit, set it to “none” if not used. example: EXP-NBD104

MinMaxLength

Prinseq

Filtering on nt length with Prinseq, preferably around the length of your primers, for example: 100-500

TrimLeft

Prinseq

Total nt to be trimmed of on the left side of sequence, default=5

TrimRight

Prinseq

Total nt to be trimmed of on the right side of sequence, default=5

MinQualMean

Prinseq

Minimum quality score of reads that can pass the filtering, default=12

Ident1

CD-HIT

Minimum percentage identity for the first clustering with CD-HIT, default=0.93

Ident2

CD-HIT

Minimum percentage identity for the reclustering with CD-HIT, default=1

MinClustSize

CD-HIT

Minimum cluster size for first clustering

MinClustSize2

CD-HIT

Minimum cluster size for clusters in reclustering

Error

Cutadapt

Maximum error rate that Cutadapt allows, default=0.15

Evalue

BLAST

Maximum E-value that BLAST allows, default=0.001

Pident

BLAST

Minimum percentage identity for filtered BLAST results, default=90

Qcov

BLAST

Minimum Query coverage for filtered BLAST results, default=90

MaxTargetSeqs

BLAST

Maximum amount of hits BLAST outputs per consensus sequence, default=100

filtertax

BLAST

Keep all results in $filtertax, ex. “Metazoa”, leave empty “” to keep everything

Primer file

The primer file is a csv file including primer sequences in the following format:

Target;sequence;forward
Target;sequence;reverse

Example:

18SV9;GTACACACCGCCCGTC;forward
18SV9;TGATCCTTCTGCAGGTTCACCTAC;reverse
18SV4;AGGGCAAKYCTGGTGCCAG;forward
18SV4;GRCGGTATCTRATCGYCTT;reverse

It is important that the target name is the same name given in the settingsfile.

Sample description

If you want to give names to the different samples, you could create a sample description txt file. It has the following format:

##BarcodeID;SampleName
barcode01;Sample_1
barcode02;Sample_2

If you want to use oldmode, then this sampledescription is a requirement and also includes the markers used. example:

##BarcodeID;SampleName;DNAbarcodes
barcode03;Mix_1;miniCOI,18SV9,18SV4
barcode04;Mix_2;miniCOI,18SV9,18SV4

Output files

While running PIMENTA, a lot of files are created in OutDir / RunName Beneath is an overview with a short explanation per output.

file/folder

file/folder

file/folder

explanation

barcode*.`SampleName`

Folder containing individual sample data

barcode*.`SampleName`

barcode*.`SampleName`.fastq

“raw” reads basecalled and demultiplexed by Guppy

barcode*.`SampleName`

barcode*.`SampleName`.QC.fastq, barcode*.`SampleName`.QC.fasta, barcode*.`SampleName`.QC.gd

QC filtered reads in fastq and fasta format

barcode*.`SampleName`

ClustCons

multi-seq

folders containing clustering data and cluster fasta files

RunName.ClusterContent.tsv

tsv file containing an overview of the reclustered clusters with more information of the size, taxonomy, etc. of each cluster

RunName.PS.fasta

fasta file containing consensus sequences from all samples from the first clustering

RunName.settings.all.txt

file containing the settings used during the analysis

RunName.stats.txt

file containing an overview of the cluster distribution over the different markers

Taxonomy_per_sample_`Ident2`

Target

folder containining tsv files with taxonomy overview with readcounts per sample

Taxonomy_per_sample_`Ident2`

krona

folder containining krona plots per target and sample

job_scheduler

LOGS

this folder contains all slurm log files (except Guppy logs)