README-Template.md 11 KB
Newer Older
Diogo Batista Lima's avatar
Diogo Batista Lima committed
1
# Blueprint Repository
Diogo Batista Lima's avatar
Diogo Batista Lima committed
2

Diogo Batista Lima's avatar
Diogo Batista Lima committed
3
## How to use
Diogo Batista Lima's avatar
Diogo Batista Lima committed
4

Diogo Batista Lima's avatar
Diogo Batista Lima committed
5
#### HTTP Request through Neo4j Browser
Diogo Batista Lima's avatar
Diogo Batista Lima committed
6

Diogo Batista Lima's avatar
Diogo Batista Lima committed
7
Blueprint Repository was implemented with a Neo4j database. This database management system includes the Neo4j Browser tool which allows
Diogo Batista Lima's avatar
Diogo Batista Lima committed
8
users to access a given database's contents through a web browser, without requiring programming knowledge. After connecting to Universidade 
Diogo Batista Lima's avatar
Diogo Batista Lima committed
9
10
11
12
13
14
15
16
17
18
do Minho's VPN, go to the following [link](http://palsson.di.uminho.pt:7475/browser/) to access Neo4j Browser. In the main page enter the following 
credentials:

* Connect URL - bolt://palsson.di.uminho.pt:7688
* Username - neo4j
* Password - admin

Click the left database logo to see the database's contents, including its nodes, relationships and properties. One may click any of the 
aforementioned entities to automatically access a set of examples. Custom queries may be entered in the upper bar prompt. Neo4j Browser 
does not give access to Blueprint's source code, data and tools but it enables accessing the database's contents without requiring additional dependencies.
Diogo Batista Lima's avatar
Diogo Batista Lima committed
19

Diogo Batista Lima's avatar
Diogo Batista Lima committed
20
21
#### Palsson server SSH access to a Docker with source code via Ubuntu command line

Diogo Batista Lima's avatar
Diogo Batista Lima committed
22
23
Open Ubuntu command line and enter:  
`ssh dlima@palsson.di.uminho.pt`  
Diogo Batista Lima's avatar
Diogo Batista Lima committed
24

Diogo Batista Lima's avatar
Diogo Batista Lima committed
25
26
27
Password - blueprint2019  
To access Blueprint's source code enter:  
`cd blueprint_docker/code`  
Diogo Batista Lima's avatar
Diogo Batista Lima committed
28

Diogo Batista Lima's avatar
Diogo Batista Lima committed
29
30
31
To try Blueprint's Request system enter the following commands after accessing the Palsson server:  
`cd blueprint_docker`  
`docker run -v /home/dlima/blueprint_docker//:/home/ -it blueprint_code`  
Diogo Batista Lima's avatar
Diogo Batista Lima committed
32

Diogo Batista Lima's avatar
Diogo Batista Lima committed
33
The menu interface allows access to the database and its contents. Enter 0 for a short explanation regarding each option. 
Diogo Batista Lima's avatar
Diogo Batista Lima committed
34

Diogo Batista Lima's avatar
Diogo Batista Lima committed
35
#### Download source code, data and tools 
Diogo Batista Lima's avatar
Diogo Batista Lima committed
36

Diogo Batista Lima's avatar
Diogo Batista Lima committed
37
38
39
40
To explore all the source code, downloaded/generated data and tools go to the following OneDrive [link](https://uminho365-my.sharepoint.com/personal/pg32938_uminho_pt/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fpg32938_uminho_pt%2FDocuments%2FBlueprint_repository%2FBlueprint_TRN_Repository.rar&parent=%2Fpersonal%2Fpg32938_uminho_pt%2FDocuments%2FBlueprint_repository&cid=a4e2f5a8-708e-4131-ac9b-f82d236deb35) 
and download the rar file. To execute the downloaded code it is necessary to install the dependencies listed on the next section and to connect to 
Universidade do Minho's VPN (to access Blueprint's database hosted in the Palsson server)

Diogo Batista Lima's avatar
Diogo Batista Lima committed
41
## Prerequisites
Diogo Batista Lima's avatar
Diogo Batista Lima committed
42

Diogo Batista Lima's avatar
Diogo Batista Lima committed
43
44
To access the database via web browser it is only necessary to have access to the Universidade do Minho's VPN and follow the previous instructions.
To access Blueprint's source code and Request system via SSH access to the Palsson server it is necessary to have access to the Universidade do Minho's VPN
Diogo Batista Lima's avatar
Diogo Batista Lima committed
45
46
47
48
and to a [Ubuntu](https://www.microsoft.com/en-us/p/ubuntu/9nblggh4msv6?activetab=pivot:overviewtab) command line. To explore all of Blueprint's content 
after downloading the rar file from the OneDrive service, it is necessary to fulfill a few prerequisites.  

All code was written in Python 3.6, thus it it is necessary to install [Python](https://www.python.org/downloads/). Any Python 3 version should work
Diogo Batista Lima's avatar
Diogo Batista Lima committed
49
50
fine but it is recommended to install Python 3.6 or higher for optimal performance. The latest Python version, at the time of writing is 3.7.3 and
it is fully compatible with Blueprint Repository.  
Diogo Batista Lima's avatar
Diogo Batista Lima committed
51

Diogo Batista Lima's avatar
Diogo Batista Lima committed
52
The only scenario where it might be necessary to install Neo4j is if the user wants to create and manipulate a local instance of the Blueprint Repository database. 
Diogo Batista Lima's avatar
Diogo Batista Lima committed
53
To install Neo4j, access the following [link](https://neo4j.com/download-thanks/?edition=community&release=3.4.10&flavour=winzip&_ga=2.66976166.909941613.1542910944-847881713.1541509905).   
Diogo Batista Lima's avatar
Diogo Batista Lima committed
54

Diogo Batista Lima's avatar
Diogo Batista Lima committed
55
Access the comand line and type the following commands (Windows):  
Diogo Batista Lima's avatar
Diogo Batista Lima committed
56
`<NEO4J_HOME>\bin\neo4j install-service`  
Diogo Batista Lima's avatar
Diogo Batista Lima committed
57
`<NEO4J_HOME>\bin\neo4j start`
Diogo Batista Lima's avatar
Diogo Batista Lima committed
58

Diogo Batista Lima's avatar
Diogo Batista Lima committed
59
Visit [http://localhost:7474](http://localhost:7474) in your web browser of choice, to visualize the database.
Diogo Batista Lima's avatar
Diogo Batista Lima committed
60

Diogo Batista Lima's avatar
Diogo Batista Lima committed
61
62
## Required Python Packages

Diogo Batista Lima's avatar
Diogo Batista Lima committed
63
* [neo4j-driver](https://pypi.org/project/neo4j-driver/) - Python driver to interact with the database (USE THIS VERSION) `pip install neo4j-driver==1.3.1`
Diogo Batista Lima's avatar
Diogo Batista Lima committed
64
65
66
67
68
69
* [tqdm](https://pypi.org/project/tqdm/) - Progress meter `pip install tqdm`
* [numpy](https://pypi.org/project/numpy/) - Data processing `pip install numpy`
* [pandas](https://pypi.org/project/pandas/) - Data processing `pip install pandas`
* [scipy](https://pypi.org/project/scipy/) - Data processing and statistical analysis `pip install scipy`
* [BioPython](https://pypi.org/project/biopython/) - Accessing NCBI for taxonomy data `pip install biopython`
* [Matplotlib](https://pypi.org/project/matplotlib/) - Publication quality graphics `pip install matplotlib`
Diogo Batista Lima's avatar
Diogo Batista Lima committed
70
71
72
73
74
* [requests](https://pypi.org/project/requests/) - HTTP Requests `pip install requests`
* [openpyxl](https://pypi.org/project/openpyxl/) - Read and write Excel files `pip install openpyxl`
* [seaborn](https://pypi.org/project/seaborn/) - Another option for publication quality graphics `pip install seaborn`
* [bioservices](https://pypi.org/project/bioservices/) - Biological Web Services used for Uniprot requests `pip install bioservices`

Diogo Batista Lima's avatar
Diogo Batista Lima committed
75
76
77
78
79
80
81
82
## Blueprint Repository source code and tools - what has been done and how to use it 

The downloadable Blueprint Repository rar file is divided in four directories:

* data - all data downloaded/generated throughout the construction of the database
* neo4j_database - a local backup of the most recent version of the database
* topology - a document detailing the database's structure, its topology and the meaning behind the data and its relationships
* code - contains all code developed for the five repository systems; Data Retrieval, Knowledge Expansion, Integration, Report and Request. Additional tools, such as Automatic Integration of TRNs are also in this directory
Diogo Batista Lima's avatar
Diogo Batista Lima committed
83

Diogo Batista Lima's avatar
Diogo Batista Lima committed
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#### Code - Structure

The database construction process included the integration of 10 external sources:
* [RegulonDB](http://regulondb.ccg.unam.mx/)
* [Collectf](http://www.collectf.org/browse/home/)
* [RegPrecise](http://regprecise.lbl.gov/RegPrecise/)
* [DBTBS](http://dbtbs.hgc.jp/)
* [Faria JP 2016](https://www.frontiersin.org/articles/10.3389/fmicb.2016.00275/full)
* [Palsson 2017](https://www.pnas.org/content/114/38/10286.short)
* [Ortiz 2015](https://www.embopress.org/doi/10.15252/msb.20156236)
* [Turkarslan 2015](https://www.nature.com/articles/sdata201510)
* [Vasquez 2011](https://microbialinformaticsj.biomedcentral.com/articles/10.1186/2042-5783-1-3)
* [CoryneRegNet](https://coryneregnet.compbio.sdu.dk/v6/index.html)

Diogo Batista Lima's avatar
Diogo Batista Lima committed
98
99
100
101
For each external source, we wrote several Python scripts to Extract, Expand (Knowledge), Transform and Load data. Each source has its own 
parser, due to data heterogeneity. The database is divided in two large sections, the Staging Area and the TRN Universal Graph units. Data is 
always loaded in the Staging Area, before being integrated in the TRN Universal Graph Unit. 
The main differences between both units is the lack of processing and duplicate removal in the former. The Staging Area is merely a compilation 
Diogo Batista Lima's avatar
Diogo Batista Lima committed
102
of retrieved and transformed (from source data type to graph structures) data. Otherwise, the TRN Universal Graph unit only has data which went 
Diogo Batista Lima's avatar
Diogo Batista Lima committed
103
104
through extensive processing and filtering to create Transcriptional Regulatory Networks, as detailed as possible, without redundancies and duplicate information.  
The following sections, further detail the developed code.
Diogo Batista Lima's avatar
Diogo Batista Lima committed
105

Diogo Batista Lima's avatar
Diogo Batista Lima committed
106
#### launcher.py
Diogo Batista Lima's avatar
Diogo Batista Lima committed
107
108
109
110
111
112
113
114
This is the script to run to create the database from retrieved external data. The "writer.py" module is integrated within this one, creating a 
log file with a short quantitative summary of the database's contents during construction. By default, the functions assume that the data
has been processed and create the database automatically. Each function has a "preprocessing" boolean flag, that can be activated to repeat
preprocessing for a given external source. If the user wants to update an already integrated external source, it can do so by running the launcher
functions associated with that source (data will be duplicated in the Staging Area as intended, but not in the TRN Universal Graph unit).
If the user has a file with the same structure as the file used to integrate the external source data in the TRN Universal Graph unit,
there is no need to repeat preprocessing. If the user only has a newer version of the raw data retrieved from the external source, it is 
mandatory to repeat the preprocessing stages (preprocessing=True).
Diogo Batista Lima's avatar
Diogo Batista Lima committed
115

Diogo Batista Lima's avatar
Diogo Batista Lima committed
116
#### Neo4j_injection.py
Diogo Batista Lima's avatar
Diogo Batista Lima committed
117
118
119
120
This module is called by "launcher.py" and loads all process data in the TRN Universal Graph unit of the database. The code is a mix of pure Python with
Cypher (Neo4j's query language) queries, combined through the Python Neo4j Driver package. Being part of the Integration system, the purpose of this script
is to load data in the database, without adding duplicate information and overriding previously loaded data. Each external source has its own set of functions, thus
the loading process was semi-curated. 
Diogo Batista Lima's avatar
Diogo Batista Lima committed
121
122
123

#### Report.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
124
125
126
The Report system aims to summarize the database's contents. Every node and relationship in the database are analyzed, and the output may be JSON files with requested data
or graphics for quick visualization of the most relevant data.

Diogo Batista Lima's avatar
Diogo Batista Lima committed
127
128
#### Report_Graphics.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
129
130
This module complements "Report.py" and has all code regarding graphics plotting. 

Diogo Batista Lima's avatar
Diogo Batista Lima committed
131
132
#### Request.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
133
134
135
The Request system is an interface connecting users and the database. This module allows extracting data from the database, such as complete Transcriptional Regulatory Networks
for any organism in CSV or XLSX formats. 

Diogo Batista Lima's avatar
Diogo Batista Lima committed
136
137
#### Request_CMD.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
138
139
140
141
142
143
144
This script's algorithms is the same as in "Request.py", however, the algorithms are integrated in a command line menu, enabling quick and interactive access to the 
Request system. To use, simply open the command line, and run the following command:  

`<Blueprint_TRN_Repository/code/ python Request_CMD.py`

#### Automatic_TRN_Integrator.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
145
146
147
148
This module automatically integrates Transcriptional Regulatory Networks in the TRN Universal Graph unit of the database. 
Input files must be in JSON format and follow a specific structure, detailed in the script. The "ATRNI_Example" folder contains examples of files, in the correct
format and structure to be implemented in the Automatic TRN Integrator. 

Diogo Batista Lima's avatar
Diogo Batista Lima committed
149
150
#### debugger.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
151
152
This module contains useful functions to ensure the database's content integrity. 

Diogo Batista Lima's avatar
Diogo Batista Lima committed
153
154
#### Gene_Expression_Tools.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
155
156
157
This module contains simple generic functions to work with data distributions. In the context of the database development, these tools were used to group
datasets of gene expression data in different degrees of relative gene expression. 

Diogo Batista Lima's avatar
Diogo Batista Lima committed
158
159
#### ncbi_taxonomy.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
160
161
Interface to access NCBI Taxonomy and retrieve relevant data

Diogo Batista Lima's avatar
Diogo Batista Lima committed
162
163
#### timeout.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
164
165
Simple decorator module used in several ocasions throughout the database's development. Its purpose is to set a time limit in the execution of a given function. 

Diogo Batista Lima's avatar
Diogo Batista Lima committed
166
167
#### uniprot_fetcher.py

Diogo Batista Lima's avatar
Diogo Batista Lima committed
168
169
Simple module with functions to access Uniprot's API and to read JSON files. 

Diogo Batista Lima's avatar
Diogo Batista Lima committed
170
#### writer.py
Diogo Batista Lima's avatar
Diogo Batista Lima committed
171

Diogo Batista Lima's avatar
Diogo Batista Lima committed
172
This module is integrated in the "launcher.py" script, and creates/updates a database log with a short descriptive summary of the database's contents. 
Diogo Batista Lima's avatar
Diogo Batista Lima committed
173

Diogo Batista Lima's avatar
Diogo Batista Lima committed
174
175
## Authors

Diogo Batista Lima's avatar
Diogo Batista Lima committed
176
177
178
* **Diogo Lima**
* **Oscar Dias**
* **Fernando Cruz**