|
|
|
# Meta-Omics Software for Community Analysis (MOSCA)
|
|
|
|
|
|
|
|
**MOSCA** (portuguese for fly) is a pipeline designed for performing metagenomics (MG) and metatranscriptomics (MT) integrated data analyses, in a mostly local and fully automated workflow.
|
|
|
|
|
|
|
|
## Features
|
|
|
|
|
|
|
|
* **Preprocessing** where low quality regions of data are trimmed and reads less interest are removed. FastQC's reports are used to automatically set the parameters for the other tools. It includes:
|
|
|
|
* initial quality check with **FastQC**
|
|
|
|
* Illumina artificial sequences removal with **Trimmomatic**: based on **FastQC** reports, MOSCA will find the adapters file most approprita to the data
|
|
|
|
* rRNA removal with **SortMeRNA**: uses Pfam and SILVA databases as reference
|
|
|
|
* quality trimming with **Trimmomatic**:
|
|
|
|
* another **FastQC** report will be generated after rRNA removal, and will be used to set the parameters for **Trimmomatic**'s hard trimmers (CROP and HEADCROP). This will ensure that the data will be reported as excellent by FastQC
|
|
|
|
* reads with less than 20 average quality or 100 nuleotides of length will also be removed
|
|
|
|
* final quality check with **FastQC**
|
|
|
|
* **Assembly** where MG trimmed reads will be assembled to partially reconstruct the original genomes in the samples. It includes:
|
|
|
|
* assembly with two possible assemblers - **MetaSPAdes** and **Megahit** - which will be used in a multi-kmer approach
|
|
|
|
* control over the quality of the contigs, with **MetaQUAST** reporting on several classical metrics (such as N50 and L50) and alignment of reads for estimating percentage of reads used in assembly, with **Bowtie2**
|
|
|
|
* **Annotation** where proteins present in the contigs will be identified. It includes:
|
|
|
|
* gene calling with **FragGeneScan**
|
|
|
|
* annotation of identified ORFs with **DIAMOND**, using the **UniProt database** as reference - MOSCA only reports on the first annotation
|
|
|
|
* retrieval of diverse biological information with [**UPIMAPI**](https://anaconda.org/bioconda/UPIMAPI)
|
|
|
|
* functional annotation with [**reCOGnizer**](https://anaconda.org/bioconda/reCOGnizer), using the **COG database** as reference
|
|
|
|
* MOSCA automatically **generates new databases by the number of threads specified**, thus allowing for multithread annotation with **RPSBLAST**
|
|
|
|
* the quantification of each protein in MG data, by alignment of MG reads to the contigs using **Bowtie2** and quantification of reads to protein using **HTSeq-count**
|
|
|
|
* **Binning** where the contigs are clustered into taxonomic units, to validate (or not) the annotation, and possibily help reconstructing genomes from the samples
|
|
|
|
* **MaxBin2** bins the contigs by tetranucleotide composition, relative abundance, and marker genes analysis
|
|
|
|
* the final bins are reported for their completeness - how many of the marker genes are present in each bin
|
|
|
|
* **Metatranscriptomics analysis** where the expression of each identified gene is quantified. It includes:
|
|
|
|
* alignment of MT reads to the MG contigs with **Bowtie2**, and quantification of reads to protein using **HTSeq-count**
|
|
|
|
* differential gene expression and multisample comparison using **DESeq2**
|
|
|
|
* **Normalization** of protein quantification for the final reports using **edgeR**
|
|
|
|
* **Pathway representation** with [**KEGGCharter**](https://anaconda.org/bioconda/KEGGCharter), representing both the metabolic networks of most abundant taxa and expression levels of metabolic functions |
|
|
|
\ No newline at end of file |