Git Lab CI for docker build enabled! You can enable it using .gitlab-ci.yml in your project. Check file template at https://gitlab.bio.di.uminho.pt/snippets/5

README.md 2.57 KB
Newer Older
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
1
2
3
4
5
6
7
8
9
10
11
12
13
# BioTMPy - Biomedical Text Mining with Python

## Description
This package was developed to ease the process of creating a pipeline to perform Document Classification using machine learning models, with a special focus on deep learning. This way, by using BioTMPy, one can retrieve the most relevant documents for a given topic.

## Installation
```bash
git clone https://gitlab.bio.di.uminho.pt/biotextminingpy/biotmpy.git
```
## Requirements
To use BioTMPy you first need to install either [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html), and then create a conda environment with the "conda_environment.yml" file. This file will allow you to install the packages required to use all the available features of this tool (Note: The Tensorflow version used in this environment supports computations on one or more GPUs). 
```bash
cd biotmpy
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
14
15
16
17
18
#Windows
conda env create -f conda_environment_win.yml
#Linux
conda env create -f conda_environment_lin.yml

Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
19
20
21
22
23
conda enable biotmpygpu
```

## Detailed Description
To provide the required steps for the development of a complete pipeline for document relevance, BioTMPy is divided into 6 main modules that can be used separately:
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
24
- **Wrappers** - convert data from distinct formats (.xml-bioc,.csv, dictionary) to a pandas dataframe using data structures
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

- **Data Structures** (data_structures) - data structures for document, sentences, tokens and relevance associated to a certain document

- **Preprocessing** - methods to perform preprocessing for Deep Learning(DL) models, feature generation (used on tradicional Machine Learning) and data analysis. Additionally, it contains config structures to choose some preprocessing steps like stop words removal, stemming, etc., containing also attributes to save the models and the results obtained throughout the pipeline.

- **Machine Learning** (mlearning) - module that provides different DL models and some methods to train/predict/evaluate traditional Machine Learning models from [scikit-learn](https://scikit-learn.org/stable/#).

- **Pipelines** - examples of complete pipelines to train/evaluate DL models, perform hyperparameter optimization and cross-validation.

- **Web** - module provides an easy implementation of a web service to deploy the developed model. Also, with the "pubmed_reader.py" file, it is possible to retrieve documents from the PubMed database, by using a term or PubMed IDs, and convert them into document objects.

![BioTMPy Structure](structure.png)


## Contacts
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
40
If you have any questions, feel free to send an email to n4lv3s@gmail.com
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
41
42
43



Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
44