Git Lab CI for docker build enabled! You can enable it using .gitlab-ci.yml in your project. Check file template at https://gitlab.bio.di.uminho.pt/snippets/5

README.md 2.57 KB
Newer Older
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
1 2 3 4 5 6 7 8 9 10 11 12 13
# BioTMPy - Biomedical Text Mining with Python

## Description
This package was developed to ease the process of creating a pipeline to perform Document Classification using machine learning models, with a special focus on deep learning. This way, by using BioTMPy, one can retrieve the most relevant documents for a given topic.

## Installation
```bash
git clone https://gitlab.bio.di.uminho.pt/biotextminingpy/biotmpy.git
```
## Requirements
To use BioTMPy you first need to install either [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html), and then create a conda environment with the "conda_environment.yml" file. This file will allow you to install the packages required to use all the available features of this tool (Note: The Tensorflow version used in this environment supports computations on one or more GPUs). 
```bash
cd biotmpy
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
14 15 16 17 18
#Windows
conda env create -f conda_environment_win.yml
#Linux
conda env create -f conda_environment_lin.yml

Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
19 20 21 22 23
conda enable biotmpygpu
```

## Detailed Description
To provide the required steps for the development of a complete pipeline for document relevance, BioTMPy is divided into 6 main modules that can be used separately:
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
24
- **Wrappers** - convert data from distinct formats (.xml-bioc,.csv, dictionary) to a pandas dataframe using data structures
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

- **Data Structures** (data_structures) - data structures for document, sentences, tokens and relevance associated to a certain document

- **Preprocessing** - methods to perform preprocessing for Deep Learning(DL) models, feature generation (used on tradicional Machine Learning) and data analysis. Additionally, it contains config structures to choose some preprocessing steps like stop words removal, stemming, etc., containing also attributes to save the models and the results obtained throughout the pipeline.

- **Machine Learning** (mlearning) - module that provides different DL models and some methods to train/predict/evaluate traditional Machine Learning models from [scikit-learn](https://scikit-learn.org/stable/#).

- **Pipelines** - examples of complete pipelines to train/evaluate DL models, perform hyperparameter optimization and cross-validation.

- **Web** - module provides an easy implementation of a web service to deploy the developed model. Also, with the "pubmed_reader.py" file, it is possible to retrieve documents from the PubMed database, by using a term or PubMed IDs, and convert them into document objects.

![BioTMPy Structure](structure.png)


## Contacts
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
40
If you have any questions, feel free to send an email to n4lv3s@gmail.com
Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
41 42 43



Nuno Miguel Caetano Alves's avatar
Nuno Miguel Caetano Alves committed
44