Git Lab CI for docker build enabled! You can enable it using .gitlab-ci.yml in your project. Check file template at https://gitlab.bio.di.uminho.pt/snippets/5

Commit 2667a5ac authored by Diogo Batista Lima's avatar Diogo Batista Lima
Browse files

Upload New File

parent 665cff0a
Pipeline #24 canceled with stages
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This scripts loads and processes a csv file, separated by commas, which contains regulatory network information. The file's columns match any gene's defining features and each line corresponds to a single gene. The first csv file includes genes only, and information about its regulators. The second csv file only has data regarding regulator genes (which interact with the genes on the first csv file)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df=pd.read_csv(\"Bs_reg.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal of these functions is to make sure that the dataset does not have multiple regulators or other informations on a single line. We aim to have a dataset structure where a single gene will be matched with a single regulator. This implies that there are multiple dataset entries for a single gene (with different feature values), but there are no repetead entries overall. This parcing process aims to increase the dataset's handleability."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>% BSU Number</th>\n",
" <th>% Gene_Name</th>\n",
" <th>% Operon</th>\n",
" <th>Sigma factor</th>\n",
" <th>Sigma factor number</th>\n",
" <th>Regulator(s) name</th>\n",
" <th>Regulator number</th>\n",
" <th>Regulation sign</th>\n",
" <th>Involved Metabolite(s)</th>\n",
" <th>Metabolite(s) number</th>\n",
" <th>Metabolite(s) sign</th>\n",
" <th>Regulatory mecanisms</th>\n",
" <th>Conditioned rules</th>\n",
" <th>Annotation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>BSU00930</td>\n",
" <td>cysE</td>\n",
" <td>gltX-cysES-yazC-yacOP</td>\n",
" <td>sigA</td>\n",
" <td>78</td>\n",
" <td>T-box-CYS</td>\n",
" <td>49</td>\n",
" <td>1</td>\n",
" <td>tRNAcys</td>\n",
" <td>88</td>\n",
" <td>-1</td>\n",
" <td>RNA switch</td>\n",
" <td>tRNA</td>\n",
" <td>serine acetyltransferase</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" % BSU Number % Gene_Name % Operon Sigma factor \\\n",
"75 BSU00930 cysE gltX-cysES-yazC-yacOP sigA \n",
"\n",
" Sigma factor number Regulator(s) name Regulator number Regulation sign \\\n",
"75 78 T-box-CYS 49 1 \n",
"\n",
" Involved Metabolite(s) Metabolite(s) number Metabolite(s) sign \\\n",
"75 tRNAcys 88 -1 \n",
"\n",
" Regulatory mecanisms Conditioned rules Annotation \n",
"75 RNA switch tRNA serine acetyltransferase "
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df=df.rename(index=str,columns={\"Unnamed: 0\":\"% BSU Number \"})\n",
"df_reg=df.loc[df['Regulation sign'].isnull()==False,:]\n",
"df_reg.index=range(len(df_reg))\n",
"\n",
"\n",
"def find_number_of_new_lines(line_as_list): \n",
" # The input is a list which contains the feature values of a line belonging to the dataset\n",
" number_of_new_lines=0\n",
" for i in range(len(line_as_list)):\n",
" n=len(list(filter(None,str(line_as_list[i]).split(\"|\"))))\n",
" if n>number_of_new_lines:\n",
" number_of_new_lines=n\n",
" return number_of_new_lines \n",
"# returns the number of lines for the original line to be \"unfolded\", the mininum number is 1\n",
"# this means that a dataset's row that doesn't contain multiple information for one gene, will remain the same\n",
"\n",
"def unfold_dataset(dataset): # the input is a pandas dataset\n",
" reg_list=dataset.values.tolist()\n",
" new_list=[]\n",
" for line in reg_list:\n",
" number_of_new_lines=find_number_of_new_lines(line)\n",
" for n in range(number_of_new_lines):\n",
" new_line=line.copy()\n",
" for c in range(3,len(dataset.columns)):\n",
" new_val=list(filter(None,str(line[c]).split(\"|\")))\n",
" if len(new_val)==number_of_new_lines:\n",
" new_line[c]=new_val[n]\n",
" elif len(new_val)==1:\n",
" new_line[c]=new_val[0]\n",
" new_list.append(new_line.copy())\n",
" refined_df_reg=pd.DataFrame(new_list,columns=dataset.columns)\n",
" refined_df_reg.index=range(len(refined_df_reg))\n",
" return refined_df_reg \n",
"# the output is the same dataset without multiple info on a single row\n",
"# THERE IS NO LOSS OF INFORMATION\n",
"\n",
"refined_df_reg=unfold_dataset(df_reg)\n",
"refined_df_reg.to_csv(\"Bs_reg_refined.csv\")\n",
"refined_df_reg.loc[refined_df_reg[\"% BSU Number \"]==\"BSU00930\",:]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"regulators=pd.read_csv(\"Regulators2.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"regulators=regulators.drop(range(0,6),axis=0)\n",
"regulators.columns=regulators.iloc[0,0:10]\n",
"regulators=regulators.drop(6,axis=0)\n",
"regulators=regulators.drop(\"nan\",axis=1)\n",
"regulators=regulators.loc[:,\"Regulator name\":\"Comment\"]\n",
"regulators.index=range(len(regulators))\n",
"regulators.columns.name=\"\"\n",
"regulators.to_csv(\"regulators_processed.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Regulator name</th>\n",
" <th>BSU</th>\n",
" <th>Number</th>\n",
" <th>Mechanism</th>\n",
" <th>conditioned_rules</th>\n",
" <th>metabolite</th>\n",
" <th>metabolite_number</th>\n",
" <th>metabolite_sign</th>\n",
" <th>Comment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Zur</td>\n",
" <td>BSU25100</td>\n",
" <td>1</td>\n",
" <td>TF+M</td>\n",
" <td>Zur</td>\n",
" <td>Zn</td>\n",
" <td>1040</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>YxdJ</td>\n",
" <td>BSU39660</td>\n",
" <td>2</td>\n",
" <td>TF-TC</td>\n",
" <td>YxdJ+YxdK</td>\n",
" <td>stress:toxic-peptide</td>\n",
" <td>2000</td>\n",
" <td>1</td>\n",
" <td>cationic antimicrobial peptide LL37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>YwcC</td>\n",
" <td>BSU38220</td>\n",
" <td>3</td>\n",
" <td>TF+M</td>\n",
" <td>Ywcc</td>\n",
" <td>unk</td>\n",
" <td>unk</td>\n",
" <td>-1</td>\n",
" <td>metabolite related to galactose utilisation</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>YvrI-YvrHa</td>\n",
" <td>BSU33230</td>\n",
" <td>4</td>\n",
" <td>sigma like factor</td>\n",
" <td>YvrI+YvrHa</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>YvrHb</td>\n",
" <td>BSU33221</td>\n",
" <td>5</td>\n",
" <td>TF-TC</td>\n",
" <td>YrrHb+YvrG</td>\n",
" <td>unk</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" <td>cell wall maintenance</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>YtrA</td>\n",
" <td>BSU30460</td>\n",
" <td>6</td>\n",
" <td>TF+M</td>\n",
" <td>YtrA</td>\n",
" <td>stress:ramoplanin antibiotic</td>\n",
" <td>1039</td>\n",
" <td>-1</td>\n",
" <td>antibiotic resistance but through an unknwon m...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>YtlI</td>\n",
" <td>BSU29400</td>\n",
" <td>7</td>\n",
" <td>TF+M</td>\n",
" <td>YtlI</td>\n",
" <td>unk</td>\n",
" <td>-1</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>YrkP</td>\n",
" <td>BSU26430</td>\n",
" <td>8</td>\n",
" <td>TF-TC</td>\n",
" <td>YrkP+YrkQ</td>\n",
" <td>unk</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>YqfL</td>\n",
" <td>BSU25240</td>\n",
" <td>9</td>\n",
" <td>unk</td>\n",
" <td>YqfL</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>YofA</td>\n",
" <td>BSU18420</td>\n",
" <td>10</td>\n",
" <td>TF</td>\n",
" <td>Yofa</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>YodB</td>\n",
" <td>BSU19540</td>\n",
" <td>11</td>\n",
" <td>TF+S</td>\n",
" <td>YlbO</td>\n",
" <td>stress:diamide/quinone</td>\n",
" <td>1007</td>\n",
" <td>-1</td>\n",
" <td>formation of intersubunit disulfides</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>YlaC</td>\n",
" <td>BSU14730</td>\n",
" <td>12</td>\n",
" <td>sigma factor</td>\n",
" <td>YlaC</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>ykrK</td>\n",
" <td>BSU13480</td>\n",
" <td>13</td>\n",
" <td>TF+unk</td>\n",
" <td>ykrK</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>regulation of membrane protein (quality control)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>YfmP</td>\n",
" <td>BSU07390</td>\n",
" <td>14</td>\n",
" <td>TF+M</td>\n",
" <td>YfmP+unk</td>\n",
" <td>unk</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>YfhP</td>\n",
" <td>BSU08620</td>\n",
" <td>15</td>\n",
" <td>unk</td>\n",
" <td>yfhP</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>YetL</td>\n",
" <td>BSU07220</td>\n",
" <td>16</td>\n",
" <td>TF+M</td>\n",
" <td>YetL</td>\n",
" <td>stress:lavonoids of kaempferol/ apigenin/ lute...</td>\n",
" <td>1038</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>YesS</td>\n",
" <td>BSU07010</td>\n",
" <td>17</td>\n",
" <td>TF+P+unk</td>\n",
" <td>Yess + Hpr</td>\n",
" <td>unk</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>involved in pectin/rhamnogalacturonan metabolism</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>YdfI</td>\n",
" <td>BSU05420</td>\n",
" <td>18</td>\n",
" <td>TF-TC</td>\n",
" <td>YdfH + YdfI</td>\n",
" <td>unk</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>YdaO riboswitch</td>\n",
" <td>@</td>\n",
" <td>19</td>\n",
" <td>riboswitch</td>\n",
" <td>YdaO</td>\n",
" <td>ATP</td>\n",
" <td>12</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>YcnK</td>\n",
" <td>BSU03960</td>\n",
" <td>20</td>\n",
" <td>TF+M</td>\n",
" <td>ycnK</td>\n",
" <td>Cu(I)</td>\n",
" <td>1008</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>YclJ</td>\n",
" <td>BSU03750</td>\n",
" <td>21</td>\n",
" <td>TF-TC</td>\n",
" <td>yclj+YclK</td>\n",
" <td>unk</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>possibly related to anerobiose</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>YcbG</td>\n",
" <td>BSU02500</td>\n",
" <td>22</td>\n",
" <td>TF+M</td>\n",
" <td>ycbG</td>\n",
" <td>D-glucarate/galactarate</td>\n",
" <td>103/104</td>\n",
" <td>1</td>\n",
" <td>gutR (alternative name)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>ybga</td>\n",
" <td>BSU02370</td>\n",
" <td>23</td>\n",
" <td>TF+M</td>\n",
" <td>ybga</td>\n",
" <td>D-glucosamine-6-phosphate</td>\n",
" <td>66</td>\n",
" <td>-1</td>\n",
" <td>putative effector /gamR (alternative name)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>YabJ</td>\n",
" <td>BSU00480</td>\n",
" <td>24</td>\n",
" <td>TF+M</td>\n",
" <td>YabJ</td>\n",
" <td>PRPP</td>\n",
" <td>42</td>\n",
" <td>-1</td>\n",
" <td>PRPP is putative</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>XylR</td>\n",
" <td>BSU17590</td>\n",
" <td>25</td>\n",
" <td>TF+M</td>\n",
" <td>XylR</td>\n",
" <td>xylose</td>\n",
" <td>50</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>Xre</td>\n",
" <td>BSU12510</td>\n",
" <td>26</td>\n",
" <td>TF</td>\n",
" <td>xre</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>Xpf</td>\n",
" <td>BSU12560</td>\n",
" <td>27</td>\n",
" <td>sigma factor</td>\n",
" <td>xpf</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>WalR</td>\n",
" <td>BSU40410</td>\n",
" <td>28</td>\n",
" <td>TF-TC</td>\n",
" <td>WalR+WalK</td>\n",
" <td>unk</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>TreR</td>\n",
" <td>BSU07820</td>\n",
" <td>29</td>\n",
" <td>TF+M</td>\n",
" <td>TreR</td>\n",
" <td>D-trehalose-6-phosphate</td>\n",
" <td>49</td>\n",
" <td>-1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>TRAP</td>\n",
" <td>BSU22770</td>\n",
" <td>30</td>\n",
" <td>RNA-BAP</td>\n",
" <td>MtrB</td>\n",
" <td>tryptophan</td>\n",
" <td>55</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>245</th>\n",
" <td>*YhgD*</td>\n",
" <td>BSU10150</td>\n",
" <td>248</td>\n",
" <td>silico-TF+unk</td>\n",
" <td>YhgD</td>\n",
" <td>unk</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Infered by RegPrecise</td>\n",
" </tr>\n",
" <tr>\n",
" <th>246</th>\n",
" <td>*YhdI/YdeL*</td>\n",
" <td>BSU05240/BSU09480</td>\n",
" <td>249</td>\n",
" <td>silico-TF+unk</td>\n",
" <td>yhdl</td>\n",
" <td>unk</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Infered by RegPrecise/ metabolite transport</td>\n",
" </tr>\n",
" <tr>\n",
" <th>247</th>\n",
" <td>*YhcF*</td>\n",
" <td>BSU09060</td>\n",
" <td>250</td>\n",
" <td>silico-TF+unk</td>\n",
" <td>YhcF</td>\n",
" <td>unk</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Infered by RegPrecise/ multidrug resistance</td>\n",
" </tr>\n",
" <tr>\n",
" <th>248</th>\n",
" <td>*YdfL*</td>\n",
" <td>BSU05460</td>\n",
" <td>251</td>\n",
" <td>silico-TF+unk</td>\n",
" <td>YdfL</td>\n",
" <td>unk</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Infered by RegPrecise/ multidrug resistance</td>\n",
" </tr>\n",
" <tr>\n",
" <th>249</th>\n",
" <td>*YdfF*</td>\n",
" <td>BSU05390</td>\n",
" <td>252</td>\n",
" <td>silico-TF+unk</td>\n",
" <td>YdfF</td>\n",
" <td>unk</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",