Commit 24e01da5 authored by iquasere's avatar iquasere
Browse files

Updated documentation

Added meta.yaml as example for that format
Added download-uniprot argument for annotation
Removed uniprot download from build script
Started working on the MP script
parent 86311648
......@@ -81,3 +81,28 @@ which can be used for MOSCA as follows:
```
python mosca.py --configfile config.json
```
The config file allows to customize MOSCA's workflow, but for the convenience of users, many typical decisions in MG and
MT workflow are already automized. The customization, therefore, is only related to steps that are not yet well established
in the field of MG (e.g. assembling data into contigs is still a controversial step that may lose information on data).
Following are the options available in the config file, and the accepted values:
| Parameter | Options | Required | Description |
|:------------------------------:|:------------------------------------------------------------:|:--------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| output | String | Yes | Name of folder where MOSCA's results will be stored (if it doesn't exist, it will be created) |
| threads | Int | Yes | Number of maximum threads for MOSCA to use |
| experiments | String | Yes | Name of TSV file with information on samples/files/conditions |
| trimmomatic_adapters_directory | String | Yes | Name of folder containing adapters for Trimmomatic's ADAPTER REMOVAL preprocessing tool |
| rrna_databases_directory | String | Yes | Name of folder containing rRNA databases to use as reference for rRNA removal with SortMeRNA |
| assembler | metaspades, megahit | Yes | Name of assembler to use for iterative co-assembly of MG data |
| markerset | 40, 107 | Yes | Name of markerset to use for completeness/contamination estimation with CheckM over the contigs obtained with MaxBin2 |
| error_model | sanger_5, sanger_10, 454_10, 454_30, illumina_5, illumina_10 | No | Name of file to use as the error model for gene calling with FragGeneScan. sanger, 454 or illumina if either Sanger, pyro- or Illumina sequencing reads are the input to gene calling. Leave empty if assembly was performed. |
| diamond_database | String | Yes | Name of FASTA or DMND (DIAMOND formatted database) file to use as input for annotation with DIAMOND |
| download_uniprot | TRUE, FALSE | Yes | If UniProtKB (SwissProt + TrEMBL) is to be download. If TRUE, will download it to the folder indicated in diamond_database |
| diamond_max_target_seqs | Int | Yes | Number of matches to report for each protein from annotation with DIAMOND |
| recognizer_databases_directory | String | Yes | Name of folder containing the resources for reCOGnizer annotation. If those are not present in the folder, they will be downloaded |
| normalization_method | TMM, RLE | Yes | Method to use for normalization |
| keggcharter_maps | Comma-separated list of KEGG maps' IDs | No | If empty, KEGGCharter will use the default prokaryotic maps. These metabolic maps will have MG information represented in them, and gene expression if MT data is available |
| keggcharter_taxa_level | SPECIES, GENUS, FAMILY, ORDER, CLASS, PHYLUM, SUPERKINGDOM | Yes | The taxonomic level to represent with KEGGCharter. If above SPECIES, KEGGCharter will represent group information and represent is as such for each taxonomic group |
| keggcharter_number_of_taxa | Int, ideally under 11 | Yes | How many of the most abundant taxa should be represented with KEGGCharter |
| reporter_lists_directory | String | Yes | Name of folder containing lists for reporter module of MOSCA |
......@@ -8,7 +8,8 @@
"markerset": "40",
"error_model": "illumina_5",
"diamond_database": "resources/uniprot.dmnd",
"diamond_max_targets_seqs": "1",
"download_uniprot": "TRUE",
"diamond_max_targets_seqs": 1,
"recognizer_databases_directory": "resources/reCOGnizer_resources",
"normalization_method": "TMM",
"keggcharter_maps": "",
......
---
output: snakemake_learn
threads: 14
experiments: experiments.tsv
trimmomatic_adapters_directory: resources/illumina_adapters
rrna_databases_directory: resources/rRNA_databases
assembler: metaspades
markerset: '40'
error_model: illumina_5
diamond_database: resources/uniprot.dmnd
download_uniprot: 'TRUE'
diamond_max_targets_seqs: '1'
recognizer_databases_directory: resources/reCOGnizer_resources
normalization_method: TMM
keggcharter_maps: ''
keggcharter_taxa_level: GENUS
keggcharter_number_of_taxa: 10
reporter_lists_directory: resources
\ No newline at end of file
......@@ -28,7 +28,6 @@ def join_reads_input(wildcards):
return ['{}/Preprocess/Trimmomatic/quality_trimmed_{}{}.fq'.format(config["output"], name, fr) for name in names
for files in df for fr in (['_forward_paired', '_reverse_paired'] if ',' in files else [''])]
rule all:
input:
expand("{output}/Binning/{sample}/checkm.tsv", output = config["output"], sample = set(experiments['Sample'])),
......@@ -61,7 +60,6 @@ rule join_reads:
fr = (['_forward', '_reverse'] if experiments["Files"].str.contains(',').tolist() else ''))
run:
for file in input:
print(file)
if 'forward' in file:
shell("touch {output}/Assembly/{wildcards.sample}_forward.fastq; cat {file} >> {output}/Assembly/{wildcards.sample}_forward.fastq", output = config["output"])
elif 'reverse' in file:
......@@ -105,15 +103,18 @@ rule annotation:
threads:
config["threads"]
run:
shell("python annotation.py -i {input} -t {threads} -a -o {output}/Annotation/{sample} -em {error_model} -db {diamond_database} -mts {diamond_max_targets_seqs}",
shell("python annotation.py -i {input} -t {threads} -a -o {output}/Annotation/{sample} -em {error_model} -db {diamond_database} -mts {diamond_max_targets_seqs}{download_uniprot}",
output = config["output"], sample = set(experiments['Sample']), error_model = config["error_model"],
diamond_database = config["diamond_database"], diamond_max_targets_seqs = config["diamond_max_targets_seqs"])
diamond_database = config["diamond_database"], diamond_max_targets_seqs = config["diamond_max_targets_seqs"],
download_uniprot = ' --download-uniprot' if config["download_uniprot"].upper() == 'TRUE' else '')
rule upimapi:
input:
expand("{output}/Annotation/{sample}/aligned.blast", output = config["output"], sample = set(experiments["Sample"])) # TODO - will I have to create a mock file to force this to run for all aligned.blast?
output:
expand("{output}/Annotation/uniprotinfo.tsv", output = config["output"])
threads:
1
run:
shell("python UPIMAPI/upimapi.py -i {input} -o {output}/Annotation/uniprotinfo --blast --full-id",
output = config["output"]) # TODO - allow to set these columns and databases - allow them to be set with files
......@@ -161,7 +162,7 @@ rule join_information:
expand("{output}/MOSCA_Entry_Report.xlsx", output = config["output"]),
expand("{output}/Metatranscriptomics/expression_matrix.tsv", output = config["output"])
threads:
config["threads"]
config["threads"] - 2
run:
shell("python join_information.py -e {experiments} -t {threads} -o {output}",
experiments = config["experiments"], output = config["output"])
......@@ -172,6 +173,8 @@ rule differential_expression:
output:
expand("{output}/Metatranscriptomics/gene_expression.jpeg", output = config["output"]),
expand("{output}/Metatranscriptomics/sample_distances.jpeg", output = config["output"])
threads:
1
run:
conditions = ",".join(map(str, mt_experiments['Condition'].tolist()))
shell("Rscript de_analysis.R --readcounts {input} --conditions {conditions} --output {output}/Metatranscriptomics", # problems with libreadline.so.6 might be solved with cd /lib/x86_64-linux-gnu/; sudo ln -s libreadline.so.7.0 libreadline.so.6
......@@ -182,11 +185,13 @@ rule kegg_charter:
expand("{output}/MOSCA_Entry_Report.xlsx", output = config["output"])
output:
expand("{output}/KEGGCharter_results.xlsx", output = config["output"])
threads:
1
run:
shell("kegg_charter.py -f {input} -o {output} -mm {metabolic_maps} -mgc {mg_cols} -mtc {exp_cols} -tc 'Taxonomic lineage ({taxa_level})' -not {number_of_taxa} -keggc 'Cross-refernce (KEGG)'",
output = config["output"], metabolic_maps = config["keggcharter_maps"], mg_cols = mg_experiments['Name'].tolist(),
exp_cols = mt_experiments['Name'].tolist(), taxa_level = config["keggcharter_taxa_level"],
number_of_taxa = config["keggcharter_number_of_taxa"])
shell("kegg_charter.py -f {input} -o {output}{metabolic_maps} -mgc {mg_cols} -mtc {exp_cols} -tc 'Taxonomic lineage ({taxa_level})' -not {number_of_taxa} -keggc 'Cross-refernce (KEGG)'",
output = config["output"], metabolic_maps = " -mm".format(config["keggcharter_maps"]),
mg_cols = mg_experiments['Name'].tolist(), exp_cols = mt_experiments['Name'].tolist(),
taxa_level = config["keggcharter_taxa_level"], number_of_taxa = config["keggcharter_number_of_taxa"])
shutil.copyfile('{}/KEGGCharter_results.xlsx'.format(config["output"]),
'{}/MOSCA_Entry_Report.xlsx'.format(config["output"]))
......@@ -197,6 +202,8 @@ rule report:
output:
expand("{output}/technical_report.txt", output = config["output"]),
expand("{output}/MOSCA_General_Report.xlsx", output = config["output"])
threads:
1
run:
shell("python report.py -e {experiments} -o {output} -ldir {reporter_lists}", experiments = config["experiments"],
output = config["output"], reporter_lists = config["reporter_lists_directory"])
\ No newline at end of file
dir="${PREFIX}/share/MOSCA"
mkdir -p "${dir}/scripts" "${PREFIX}/bin"
cp workflow/scripts/* "${dir}/scripts
cp workflow/Snakefile workflow/mosca.py "${dir}/scripts
cp workflow/scripts/* "${dir}/scripts"
cp workflow/Snakefile workflow/mosca.py "${dir}/scripts"
cp -r resources "${dir}/resources"
chmod +x "${dir}/scripts/mosca.py"
ln -s "${dir}/scripts/mosca.py" "${PREFIX}/bin/"
......@@ -9,10 +9,6 @@ ln -s "${dir}/scripts/mosca.py" "${PREFIX}/bin/"
# Databases download
resources_dir="${PREFIX}/share/MOSCA/resources"
svn export https://github.com/timflutre/trimmomatic/trunk/adapters "${resources_dir}/illumina_adapters"
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
zcat uniprot_trembl.fasta.gz uniprot_sprot.fasta.gz > "${resources_dir}/uniprot.fasta"
rm uniprot_trembl.fasta.gz uniprot_sprot.fasta.gz
svn export https://github.com/biocore/sortmerna/trunk/rRNA_databases "${resources_dir}/rRNA_databases"
find "${resources_dir}/rRNA_databases/*" | grep -v ".fasta" | xargs rm -fr
......
......@@ -41,6 +41,8 @@ class Annotater:
'illumina_5', 'illumina_10'])
parser.add_argument("-db", "--database", type = str, help = "Database for annotation")
parser.add_argument("-mts", "--max-target-seqs", type=str, help="Number of identifications for each protein")
parser.add_argument("--download-uniprot", action = "store_true", default = False,
help="Download uniprot database if FASTA DB doesn't exist")
args = parser.parse_args()
......@@ -68,10 +70,16 @@ class Annotater:
'{} -out={} -complete=1 -train=./complete'.format(file, output) if assembled
else '{} -out={} -complete=0 -train=./'.format(file.replace(
'fastq', 'fasta'), error_model)))
def download_uniprot(self, out_dir):
for db in ['trembl', 'sprot']:
run_command('wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_{}.fasta.gz'.format(db))
run_command('zcat uniprot_trembl.fasta.gz uniprot_sprot.fasta.gz', file='{}/uniprot.fasta'.format(out_dir))
run_command('rm uniprot_trembl.fasta.gz uniprot_sprot.fasta.gz')
def generate_diamond_database(self, fasta, dmnd):
run_command('diamond makedb --in {} -d {}'.format(fasta, dmnd))
def run_diamond(self, query, aligned, unaligned, database, threads = '12',
max_target_seqs = '50'):
if database[-6:] == '.fasta' or database[-4:] == '.faa':
......@@ -93,13 +101,16 @@ class Annotater:
def run(self):
args = self.get_arguments()
self.gene_calling(args.input, '{}/fgs'.format(args.output), threads = args.threads,
assembled = args.assembled, error_model = args.error_model)
if args.download_uniprot:
self.download_uniprot('/'.join(args.database.split('/')[:-1]))
self.run_diamond('{}/fgs.faa'.format(args.output), '{}/aligned.blast'.format(args.output),
'{}/unaligned.fasta'.format(args.output), args.database,
threads=args.threads, max_target_seqs=args.max_target_seqs) # TODO - annotation has to be refined to retrieved better than hypothetical proteins
threads=args.threads, max_target_seqs=args.max_target_seqs) # TODO - annotation has to be refined to retrieve better than hypothetical proteins
'''
Input:
......
......@@ -14,6 +14,7 @@ import shlex
import shutil
import subprocess
import pandas as pd
import argparse
from annotation import Annotater
from lxml import etree
from mosca_tools import parse_fasta, run_command, sort_alphanumeric, parse_blast
......@@ -26,6 +27,25 @@ class MetaproteomicsAnalyser:
self.__dict__ = kwargs
self.searchgui_exe = glob.glob('SearchGUI-*.*.*/SearchGUI-*.*.*.jar')[-1]
self.peptide_shaker_exe = glob.glob('PeptideShaker-*.*.*/PeptideShaker-*.*.*.jar')[-1]
def get_arguments(self):
parser = argparse.ArgumentParser(description="MOSCA's metaproteomics analysis")
parser.add_argument("-sf", "--spectra-folder", type=str, required=True,
help="Folder with spectra to be analysed")
parser.add_argument("-t", "--threads", type=str,
default=str(multiprocessing.cpu_count() - 2),
help="Number of threads to use. Default is number of CPUs available minus 2.")
parser.add_argument("-o", "--output", type=str, help="Output directory")
parser.add_argument("-w", "--workflow", type=str, help="Workflow to use", choices=['maxquant','compomics'])
parser.add_argument("-o", "--output", type=str, help="Output directory")
parser.add_argument("-rdb", "--reference-database", type=str,
help="Database file (FASTA format) with reference sequences for protein identification")
parser.add_argument("-cdb", "--contaminants-database", type=str,
help="Database file (FASTA format) with contaminant sequences")
args = parser.parse_args()
args.output = args.output.rstrip('/')
return args
'''
input:
......@@ -48,40 +68,21 @@ class MetaproteomicsAnalyser:
a FASTA file named output will be created, containing the sequences from
metagenomics, from cRAP database and of the protease
'''
def database_generation(self, output, crap_database, protease = 'trypsin',
how = 'raw', faa = None, blast = None):
def database_generation(self, database, output, crap_database, protease='trypsin', how='raw', blast=None):
temp = '/'.join(output.split('/')[:-1]) + '/temporary.fasta'
print('Generating new database to ' + output)
if how == 'uniprot_sequences':
annotater = Annotater()
annotater.recursive_uniprot_fasta(output, blast = blast)
elif how == 'raw':
database = list()
faa = parse_fasta(faa)
print('Removing sequences with unknown bases (*).')
pbar = ProgressBar()
for k,v in pbar(faa.items()):
if '*' not in v:
database.append([k,v])
elif how == 'uniprot_ids':
faa = parse_fasta(faa)
faa = pd.DataFrame.from_dict(faa, orient='index')
faa.columns = ['sequence']
faa = pd.merge(blast, faa, left_on = 'qseqid', right_index = True)
with open(temp, 'w') as file:
print('Writing database with UniProt IDs to ' + temp)
pbar = ProgressBar()
for i in pbar(range(len(faa))):
file.write('>' + faa.iloc[i]['sseqid'] + '\n' + faa.iloc[i]['sequence'] + '\n')
print('Generating new database in {}'.format(output))
print('Removing asterisks (*)')
run_command("sed 's/*//g' {} > {}".format(database, output))
if protease == 'trypsin': # TODO - pig trypsin is not the only protease used, will have to include more - see from searchgui
if not os.path.isfile(output + '/P00761.fasta'):
if not os.path.isfile('P00761.fasta'):
print('Trypsin file not found. Will be downloaded from UniProt.')
run_command('wget https://www.uniprot.org/uniprot/P00761.fasta' +
' -P ' + '/'.join(output.split('/')[:-1]))
protease = output + '/P00761.fasta'
run_command('cat ' + ' '.join([crap_database, protease]), file = output, mode = 'a')
run_command('wget https://www.uniprot.org/uniprot/P00761.fasta -P {}'.format(
'/'.join(output.split('/')[:-1])))
protease = 'P00761.fasta'
run_command('cat {}'.format(' '.join([database, crap_database, protease])), file=output, mode = 'a')
'''
input:
......@@ -309,44 +310,52 @@ class MetaproteomicsAnalyser:
shutil.rmtree(directory, ignore_errors=True)
run_command('maxquant ' + mqpar) # TODO - get the shell messages from MaxQuant to appear
os.rename(spectra_folder + '/combined', output_folder + '/maxquant_results')
def compomics_workflow(self):
self.verify_crap_db(self.crap_database)
self.create_decoy_database(self.database, self.searchgui_exe)
try: # try/except - https://github.com/compomics/searchgui/issues/217
self.generate_parameters_file(self.output + '/params.par',
self.database.replace('.fasta', '_concatenated_target_decoy.fasta'),
self.searchgui_exe)
except:
print('An illegal reflective access operation has occurred. But MOSCA can handle it.')
self.peptide_spectrum_matching(self.spectra_folder, self.output,
self.output + '/params.par',
self.searchgui_exe, threads=self.threads)
self.browse_identification_results(self.spectra_folder, self.output + '/params.par',
self.output + '/searchgui_out.zip', self.output + '/ps_output.cpsx',
self.peptide_shaker_exe)
try: # try/except - if no identifications are present, will throw an error
self.generate_reports(self.output + '/ps_output.cpsx', self.output + '/reports',
self.peptide_shaker_exe)
except:
print('No identifications?')
self.spectra_counting((self.output + '/reports/' + self.experiment_name +
'_' + self.sample_name + '_' + self.replicate_number +
'_Default_Protein_Report.txt'), self.blast,
self.output + '/Spectra_counting.tsv')
def maxquant_workflow(self):
self.create_mqpar(self.output + '/mqpar.xml')
self.edit_maxquant_mqpar(self.output + '/mqpar.xml', self.output + '/database.fasta',
self.spectra_folder, self.experiment_names,
threads=self.threads)
self.run_maxquant(self.output + '/mqpar.xml', self.spectra_folder, self.output)
def run(self):
self.verify_crap_db(self.crap_database)
self.database_generation(self.output + '/database_uniprot_sequences.fasta',
self.crap_database, self.protease,
how = 'raw', blast = self.blast, faa = self.faa)
if self.workflow == 'maxquant':
self.create_mqpar(self.output + '/mqpar.xml')
self.edit_maxquant_mqpar(self.output + '/mqpar.xml', self.output + '/database.fasta',
self.spectra_folder, self.experiment_names,
threads = self.threads)
self.run_maxquant(self.output + '/mqpar.xml', self.spectra_folder, self.output)
self.maxquant_Workflow()
elif self.workflow == 'compomics':
self.create_decoy_database(self.database, self.searchgui_exe)
try: # try/except - https://github.com/compomics/searchgui/issues/217
self.generate_parameters_file(self.output + '/params.par',
self.database.replace('.fasta', '_concatenated_target_decoy.fasta'),
self.searchgui_exe)
except:
print('An illegal reflective access operation has occurred. But MOSCA can handle it.')
self.peptide_spectrum_matching(self.spectra_folder, self.output,
self.output + '/params.par',
self.searchgui_exe, threads = self.threads)
self.browse_identification_results(self.spectra_folder, self.output + '/params.par',
self.output + '/searchgui_out.zip', self.output + '/ps_output.cpsx',
self.peptide_shaker_exe)
try: # try/except - if no identifications are present, will throw an error
self.generate_reports(self.output + '/ps_output.cpsx', self.output + '/reports',
self.peptide_shaker_exe)
except:
print('No identifications?')
self.spectra_counting((self.output + '/reports/' + self.experiment_name +
'_' + self.sample_name + '_' + self.replicate_number +
'_Default_Protein_Report.txt'), self.blast,
self.output + '/Spectra_counting.tsv')
self.compomics_workflow()
def background_inputation(self, df):
return df.fillna(value = df.min().min())
......@@ -368,4 +377,7 @@ def censored_inputation(df, replicates):
return df
def background_inputation(df):
return df.fillna(value = df.min())
\ No newline at end of file
return df.fillna(value = df.min())
if __name__ == '__main__':
MetaproteomicsAnalyser().run()
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment