The official bioinformatics tool list for the Journal of Integrative Bioinformatics (JIB).
All bioinformatics tools published in JIB are automatically added to JIB.tools with first authors having the ability to edit their entries and directly import the tools information to bio.tools.
To better understand the dynamic behavior of metabolic networks in a wide variety of conditions, the field of Systems Biology has increased its interest in the use of kinetic models. The different databases, available these days, do not contain enough data regarding this topic. Given that a significant part of the relevant information for the development of such models is still wide spread in the literature, it becomes essential to develop specific and powerful text mining tools to collect these data. In this context, this work has as main objective the development of a text mining tool to extract, from scientific literature, kinetic parameters, their respective values and their relations with enzymes and metabolites. The approach proposed integrates the development of a novel plug-in over the text mining framework @Note2. In the end, the pipeline developed was validated with a case study on Kluyveromyces lactis, spanning the analysis and results of 20 full text documents.
JIB Publications
Desktop application
Sequence analysis
Maximum-likelihood methods based on models of codon substitution have been widely used to infer positively selected amino acid sites that are responsible for adaptive changes. Nevertheless, in order to use such an approach, software applications are required to align protein and DNA sequences, infer a phylogenetic tree and run the maximum-likelihood models. Therefore, a significant effort is made in order to prepare input files for the different software applications and in the analysis of the output of every analysis. In this paper we present the ADOPS (Automatic Detection Of Positively Selected Sites) software. It was developed with the goal of providing an automatic and flexible tool for detecting positively selected sites given a set of unaligned nucleotide sequence data. An example of the usefulness of such a pipeline is given by showing, under different conditions, positively selected amino acid sites in a set of 54 Coffea putative S-RNase sequences. ADOPS software is freely available and can be downloaded from http://sing.ei.uvigo.es/ADOPS.
In this demo paper, we sketch B-Fabric, an all-in-one solution for management of life sciences data. B-Fabric has two major purposes. First, it is a system for the integrated management of experimental data and scientific annotations. Second, it is a system infrastructure supporting on-the fly coupling of user applications, and thus serving as extensible platform for fast-paced, cutting-edge, collaborative research.
JIB PublicationsBacillOndex is an extension of the Ondex data integration system, providing a semantically annotated, integrated knowledge base for the model Gram-positive bacterium Bacillus subtilis. This application allows a user to mine a variety of B. subtilis data sources, and analyse the resulting integrated dataset, which contains data about genes, gene products and their interactions. The data can be analysed either manually, by browsing using Ondex, or computationally via a Web services interface. We describe the process of creating a BacillOndex instance, and describe the use of the system for the analysis of single nucleotide polymorphisms in B. subtilis Marburg. The Marburg strain is the progenitor of the widely-used laboratory strain B. subtilis 168. We identified 27 SNPs with predictable phenotypic effects, including genetic traits for known phenotypes. We conclude that BacillOndex is a valuable tool for the systems-level investigation of, and hypothesis generation about, this important biotechnology workhorse. Such understanding contributes to our ability to construct synthetic genetic circuits in this organism.
JIB PublicationsAs high-throughput technologies become cheaper and easier to use, raw sequence data and corresponding annotations for many organisms are becoming available. However, sequence data alone is not sufficient to explain the biological behaviour of organisms, which arises largely from complex molecular interactions. There is a need to develop new platform technologies that can be applied to the investigation of whole-genome datasets in an efficient and cost-effective manner. One such approach is the transfer of existing knowledge from well-studied organisms to closely-related organisms. In this paper, we describe a system, BacillusRegNet, for the use of a model organism, Bacillus subtilis, to infer genome-wide regulatory networks in less well-studied close relatives. The putative transcription factors, their binding sequences and predicted promoter sequences along with annotations are available from the associated BacillusRegNet website (http://bacillus.ncl.ac.uk).
JIB Publications
This paper presents a novel bioinformatics data warehouse software kit that integrates biological information from multiple public life science data sources into a local database management system. It stands out from other approaches by providing up-to-date integrated knowledge, platform and database independence as well as high usability and customization. This open source software can be used as a general infrastructure for integrative bioinformatics research and development. The advantages of the approach are realized by using a Java-based system architecture and object-relational mapping (ORM) technology. Finally, a practical application of the system is presented within the emerging area of medical bioinformatics to show the usefulness of the approach. The BioDWH data warehouse software is available for the scientific community at http://sourceforge.net/projects/biodwh/.
JIB Publications
Command-line tool Workflow
Data integration and warehousing
As research projects require multiple data sources, mapping between these sources becomes necessary. Utilized workflow systems and integration tools therefore need to process large amounts of heterogeneous data formats, check for data source updates, and find suitable mapping methods to cross-reference entities from different databases. BioDWH2 is an open-source, graph-based data warehouse and mapping tool, capable of helping researchers with these issues. A workspace centered approach allows project-specific data source selections and Neo4j or GraphQL server tools enable quick access to the database for analysis. The BioDWH2 tools are available to the scientific community at https://github.com/BioDWH2.
The study of microorganism consortia, also known as biofilms, is associated to a number of applications in biotechnology, ecotechnology and clinical domains. Nowadays, biofilm studies are heterogeneous and data-intensive, encompassing different levels of analysis. Computational modelling of biofilm studies has become thus a requirement to make sense of these vast and ever-expanding biofilm data volumes. The rationale of the present work is a machine-readable format for representing biofilm studies and supporting biofilm data interchange and data integration. This format is supported by the Biofilm Science Ontology (BSO), the first ontology on biofilms information. The ontology is decomposed into a number of areas of interest, namely: the Experimental Procedure Ontology (EPO) which describes biofilm experimental procedures; the Colony Morphology Ontology (CMO) which characterises morphologically microorganism colonies; and other modules concerning biofilm phenotype, antimicrobial susceptibility and virulence traits. The overall objective behind BSO is to develop semantic resources to capture, represent and share data on biofilms and related experiments in a regularized fashion manner. Furthermore, the present work also introduces a framework in assistance of biofilm data interchange and analysis - BiofOmics (http://biofomics.org) - and a public repository on colony morphology signatures - MorphoCol (http://stardust.deb.uminho.pt/morphocol).
JIB Publications
The speed and accuracy of new scientific discoveries - be it by humans or artificial intelligence - depends on the quality of the underlying data and on the technology to connect, search and share the data efficiently. In recent years, we have seen the rise of graph databases and semi-formal data models such as knowledge graphs to facilitate software approaches to scientific discovery. These approaches extend work based on formalised models, such as the Semantic Web. In this paper, we present our developments to connect, search and share data about genome-scale knowledge networks (GSKN). We have developed a simple application ontology based on OWL/RDF with mappings to standard schemas. We are employing the ontology to power data access services like resolvable URIs, SPARQL endpoints, JSON-LD web APIs and Neo4j-based knowledge graphs. We demonstrate how the proposed ontology and graph databases considerably improve search and access to interoperable and reusable biological knowledge (i.e. the FAIRness data principles).
JIB PublicationsGiven the great potential impact of the growing number of complete genome-scale metabolic network reconstructions of microorganisms, bioinformatics tools are needed to simplify and accelerate the course of knowledge in this field. One essential component of a genome-scale metabolic model is its biomass equation, whose maximization is one of the most common objective functions used in Flux Balance Analysis formulations. Some components of biomass, such as amino acids and nucleotides, can be estimated from genome information, providing reliable data without the need of performing lab experiments. In this work a java tool is proposed that estimates microbial biomass composition in amino acids and nucleotides, from genome and transcriptomic information, using as input files sequences in FASTA format and files with transcriptomic data in the csv format. This application allows to obtain the results rapidly and is also a user-friendly tool for users with any or little background in informatics (http://darwin.di.uminho.pt/biomass/). The results obtained using this tool are fairly close to experimental data, showing that the estimation of amino acid and nucleotide compositions from genome information and from transcriptomic data is a good alternative when no experimental data is available.
JIB PublicationsWhile high-throughput technology, advanced techniques in biochemistry and molecular biology have become increasingly powerful, the coherent interpretation of experimental results in an integrative context is still a challenge. BioModelKit (BMK) approaches this challenge by offering an integrative and versatile framework for biomodel-engineering based on a modular modelling concept with the purpose: (i) to represent knowledge about molecular mechanisms by consistent executable sub-models (modules) given as Petri nets equipped with defined interfaces facilitating their reuse and recombination; (ii) to compose complex and integrative models from an ad hoc chosen set of modules including different omic and abstraction levels with the option to integrate spatial aspects; (iii) to promote the construction of alternative models by either the exchange of competing module versions or the algorithmic mutation of the composed model; and (iv) to offer concepts for (omic) data integration and integration of existing resources, and thus facilitate their reuse. BMK is accessible through a public web interface (www.biomodelkit.org), where users can interact with the modules stored in a database, and make use of the model composition features. BMK facilitates and encourages multi-scale model-driven predictions and hypotheses supporting experimental research in a multilateral exchange.
JIB PublicationsThe visualization of biological data gained increasing importance in the last years. There is a large number of methods and software tools available that visualize biological data including the combination of measured experimental data and biological networks. With growing size of networks their handling and exploration becomes a challenging task for the user. In addition, scientists also have an interest in not just investigating a single kind of network, but on the combination of different types of networks, such as metabolic, gene regulatory and protein interaction networks. Therefore, fast access, abstract and dynamic views, and intuitive exploratory methods should be provided to search and extract information from the networks. This paper will introduce a conceptual framework for handling and combining multiple network sources that enables abstract viewing and exploration of large data sets including additional experimental data. It will introduce a three-tier structure that links network data to multiple network views, discuss a proof of concept implementation, and shows a specific visualization method for combining metabolic and gene regulatory networks in an example.
JIB PublicationsBIOchemical PathwaY DataBase is developed as a manually curated, readily updatable, dynamic resource of human cell specific pathway information along with integrated computational platform to perform various pathway analyses. Presently, it comprises of 46 pathways, 3189 molecules, 5742 reactions and 6897 different types of diseases linked with pathway proteins, which are referred by 520 literatures and 17 other pathway databases. With its repertoire of biochemical pathway data, and computational tools for performing Topological, Logical and Dynamic analyses, BIOPYDB offers both the experimental and computational biologists to acquire a comprehensive understanding of signaling cascades in the cells. Automated pathway image reconstruction, cross referencing of pathway molecules and interactions with other databases and literature sources, complex search operations to extract information from other similar resources, integrated platform for pathway data sharing and computation, etc. are the novel and useful features included in this database to make it more acceptable and attractive to the users of pathway research communities. The RESTful API service is also made available to the advanced users and developers for accessing this database more conveniently through their own computer programmes.
JIB PublicationsMetagenomics provides quantitative measurements for microbial species over time. To obtain a global overview of an experiment and to explore the full potential of a given dataset, intuitive and interactive visualization tools are needed. Therefore, we established BioSankey to visualize microbial species in microbiome studies over time as a Sankey diagram. These diagrams are embedded into a project-specific webpage which depends only on JavaScript and Google API to allow searches of interesting species without requiring a web server or connection to a database. BioSankey is a valuable tool to visualize different data elements from single or dual RNA-seq datasets and additionally enables a straightforward exchange of results among collaboration partners.
JIB Publications
Desktop application
ChIP-seq Mapping Molecular interactions, pathways and networks Data architecture, analysis and design
The mapping of DNA-protein interactions is crucial for a full understanding of transcriptional regulation. Chromatin-immunoprecipitation followed by massively parallel sequencing (ChIP-seq) has become the standard technique for analyzing these interactions on a genome-wide scale. We have developed a software system called CASSys (ChIP-seq data Analysis Software System) spanning all steps of ChIP-seq data analysis. It supersedes the laborious application of several single command line tools. CASSys provides functionality ranging from quality assessment and -control of short reads, over the mapping of reads against a reference genome (readmapping) and the detection of enriched regions (peakdetection) to various follow-up analyses. The latter are accessible via a state-of-the-art web interface and can be performed interactively by the user. The follow-up analyses allow for flexible user defined association of putative interaction sites with genes, visualization of their genomic context with an integrated genome browser, the detection of putative binding motifs, the identification of over-represented Gene Ontology-terms, pathway analysis and the visualization of interaction networks. The system is client-server based, accessible via a web browser and does not require any software installation on the client side. To demonstrate CASSys's functionality we used the system for the complete data analysis of a publicly available Chip-seq study that investigated the role of the transcription factor estrogen receptor-α in breast cancer cells.
Command-line tool Desktop application
Computational biology
Using the lac operon as a paradigmatic example for a gene regulatory system in prokaryotes, we demonstrate how qualitative knowledge can be initially captured using simple discrete (Boolean) models and then stepwise refined to multivalued logical models and finally to continuous (ODE) models. At all stages, signal transduction and transcriptional regulation is integrated in the model description. We first show the potential benefit of a discrete binary approach and discuss then problems and limitations due to indeterminacy arising in cyclic networks. These limitations can be partially circumvented by using multilevel logic as generalization of the Boolean framework enabling one to formulate a more realistic model of the lac operon. Ultimately a dynamic description is needed to fully appreciate the potential dynamic behavior that can be induced by regulatory feedback loops. As a very promising method we show how the use of multivariate polynomial interpolation allows transformation of the logical network into a system of ordinary differential equations (ODEs), which then enables the analysis of key features of the dynamic behavior.
With the advent of modern day high-throughput technologies, the bottleneck in biological discovery has shifted from the cost of doing experiments to that of analyzing results. clubber is our automated cluster-load balancing system developed for optimizing these "big data" analyses. Its plug-and-play framework encourages re-use of existing solutions for bioinformatics problems. clubber's goals are to reduce computation times and to facilitate use of cluster computing. The first goal is achieved by automating the balance of parallel submissions across available high performance computing (HPC) resources. Notably, the latter can be added on demand, including cloud-based resources, and/or featuring heterogeneous environments. The second goal of making HPCs user-friendly is facilitated by an interactive web interface and a RESTful API, allowing for job monitoring and result retrieval. We used clubber to speed up our pipeline for annotating molecular functionality of metagenomes. Here, we analyzed the Deepwater Horizon oil-spill study data to quantitatively show that the beach sands have not yet entirely recovered. Further, our analysis of the CAMI-challenge data revealed that microbiome taxonomic shifts do not necessarily correlate with functional shifts. These examples (21 metagenomes processed in 172 min) clearly illustrate the importance of clubber in the everyday computational biology environment.
JIB Publications
Desktop application
Molecular interactions, pathways and networks Bioinformatics Cell biology Computer science Structural biology
Detailed investigation of socially important diseases with modern experimental methods has resulted in the generation of large volume of valuable data. However, analysis and interpretation of this data needs application of efficient computational techniques and systems biology approaches. In particular, the techniques allowing the reconstruction of associative networks of various biological objects and events can be useful. In this publication, the combination of different techniques to create such a network associated with an abstract cell environment is discussed in order to gain insights into the functional as well as spatial interrelationships. It is shown that experimentally gained knowledge enriched with data warehouse content and text mining data can be used for the reconstruction and localization of a cardiovascular disease developing network beginning with MUPP1/MPDZ (multi-PDZ domain protein).
The CELLmicrocosmos 4.2 PathwayIntegration (CmPI) is a tool which provides hybrid-dimensional visualization and analysis of intracellular protein and gene localizations in the context of a virtual 3D environment. This tool is developed based on Java/Java3D/JOGL and provides a standalone application compatible to all relevant operating systems. However, it requires Java and the local installation of the software. Here we present the prototype of an alternative web-based visualization approach, using Three.js and D3.js. In this way it is possible to visualize and explore CmPI-generated localization scenarios including networks mapped to 3D cell components by just providing a URL to a collaboration partner. This publication describes the integration of the different technologies – Three.js, D3.js and PHP – as well as an application case: a localization scenario of the citrate cycle. The CmPI web viewer is available at: http://CmPIweb.CELLmicrocosmos.org.
Comparative analysis of biological networks is a major problem in computational integrative systems biology. By computing the maximum common edge subgraph between a set of networks, one is able to detect conserved substructures between them and quantify their topological similarity. To aid such analyses we have developed CytoMCS, a Cytoscape app for computing inexact solutions to the maximum common edge subgraph problem for two or more graphs. Our algorithm uses an iterative local search heuristic for computing conserved subgraphs, optimizing a squared edge conservation score that is able to detect not only fully conserved edges but also partially conserved edges. It can be applied to any set of directed or undirected, simple graphs loaded as networks into Cytoscape, e.g. protein-protein interaction networks or gene regulatory networks. CytoMCS is available as a Cytoscape app at http://apps.cytoscape.org/apps/cytomcs.
JIB PublicationsThis work presents DaTo, a semi-automatically generated world atlas of biological databases and tools. It extracts raw information from all PubMed articles which contain exact URLs in their abstract section, followed by a manual curation of the abstract and the URL accessibility. DaTo features a user-friendly query interface, providing extensible URL-related annotations, such as the status, the location and the country of the URL. A graphical interaction network browser has also been integrated into the DaTo web interface to facilitate exploration of the relationship between different tools and databases with respect to their ontology-based semantic similarity. Using DaTo, the geographical locations, the health statuses, as well as the journal associations were evaluated with respect to the historical development of bioinformatics tools and databases over the last 20 years. We hope it will inspire the biological community to gain a systematic insight into bioinformatics resources. DaTo is accessible via http://bis.zju.edu.cn/DaTo/.
JIB Publications
Web service
Sequence analysis
During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.
Proteins and their interactions are essential for the functioning of all organisms and for understanding biological processes. Alternative splicing is an important molecular mechanism for increasing the protein diversity in eukaryotic cells. Splicing events that alter the protein structure and the domain composition can be responsible for the regulation of protein interactions and the functional diversity of different tissues. Discovering the occurrence of splicing events and studying protein isoforms have become feasible using Affymetrix Exon Arrays. Therefore, we have developed the versatile Cytoscape plugin DomainGraph that allows for the visual analysis of protein domain interaction networks and their integration with exon expression data. Protein domains affected by alternative splicing are highlighted and splicing patterns can be compared.
JIB PublicationsIn this paper we present two case studies of Proteomics applications development using the AIBench framework, a Java desktop application framework mainly focused in scientific software development. The applications presented in this work are Decision Peptide-Driven, for rapid and accurate protein quantification, and Bacterial Identification, for Tuberculosis biomarker search and diagnosis. Both tools work with mass spectrometry data, specifically with MALDI-TOF spectra, minimizing the time required to process and analyze the experimental data.
JIB PublicationsExpression efficiency is one of the major characteristics describing genes in various modern investigations. Expression efficiency of genes is regulated at various stages: transcription, translation, posttranslational protein modification and others. In this study, a special EloE (Elongation Efficiency) web application is described. The EloE sorts the organism's genes in a descend order on their theoretical rate of the elongation stage of translation based on the analysis of their nucleotide sequences. Obtained theoretical data have a significant correlation with available experimental data of gene expression in various organisms. In addition, the program identifies preferential codons in organism's genes and defines distribution of potential secondary structures energy in 5´ and 3´ regions of mRNA. The EloE can be useful in preliminary estimation of translation elongation efficiency for genes for which experimental data are not available yet. Some results can be used, for instance, in other programs modeling artificial genetic structures in genetically engineered experiments.
JIB PublicationsThe prevalence of comorbid diseases poses a major health issue for millions of people worldwide and an enormous socio-economic burden for society. The molecular mechanisms for the development of comorbidities need to be investigated. For this purpose, a workflow system was developed to aggregate data on biomedical entities from heterogeneous data sources. The process of integrating and merging all data sources of the workflow system was implemented as a semi-automatic pipeline that provides the import, fusion, and analysis of the highly connected biomedical data in a Neo4j database GenCoNet. As a starting point, data on the common comorbid diseases essential hypertension and bronchial asthma was integrated. GenCoNet (https://genconet.kalis-amts.de) is a curated database that provides a better understanding of hereditary bases of comorbidities.
JIB Publications
Script Library
Bioinformatics DNA Gene regulation
The interconversion of sequences that constitute the genome and the proteome is becoming increasingly important due to the generation of large amounts of DNA sequence data. Following mapping of DNA segments to the genome, one fundamentally important task is to find the amino acid sequences which are coded within a list of genomic sections. Conversely, given a series of protein segments, an important task is to find the genomic loci which code for a list of protein regions. To perform these tasks on a region by region basis is extremely laborious when a large number of regions are being studied. We have therefore implemented an R package geno2proteo which performs the two mapping tasks and subsequent sequence retrieval in a batch fashion. In order to make the tool more accessible to users, we have created a web interface of the R package which allows the users to perform the mapping tasks by going to the web page http://sharrocksresources.manchester.ac.uk/tofigaps and using the web service.
The need to process large quantities of data generated from genomic sequencing has resulted in a difficult task for life scientists who are not familiar with the use of command-line operations or developments in high performance computing and parallelization. This knowledge gap, along with unfamiliarity with necessary processes, can hinder the execution of data processing tasks. Furthermore, many of the commonly used bioinformatics tools for the scientific community are presented as isolated, unrelated entities that do not provide an integrated, guided, and assisted interaction with the scheduling facilities of computational resources or distribution, processing and mapping with runtime analysis. This paper presents the first approximation of a Web Services platform-based architecture (GITIRBio) that acts as a distributed front-end system for autonomous and assisted processing of parallel bioinformatics pipelines that has been validated using multiple sequences. Additionally, this platform allows integration with semantic repositories of genes for search annotations. GITIRBio is available at: http://c-head.ucaldas.edu.co:8080/gitirbio.
JIB Publications
Web application
Mapping Ontology and terminology Proteins Sequence analysis Sequencing
The functional annotation of genomic data has become a major task for the ever-growing number of sequencing projects. In order to address this challenge, we recently developed GOblet, a free web service for the annotation of anonymous sequences with Gene Ontology (GO) terms. However, to overcome limitations of the GO terminology, and to aid in understanding not only single components but as well systemic interactions between the individual components, we have now extended the GOblet web service to integrate also pathway annotations. Furthermore, we extended and upgraded the data analysis pipeline with improved summaries, and added term enrichment and clustering algorithms. Finally, we are now making GOblet available as a stand-alone application for high-throughput processing on local machines. The advantages of this frequently requested feature is that a) the user can avoid restrictions of our web service for uploading and processing large amounts of data, and that b) confidential data can be analysed without insecure transfer to a public web server. The stand-alone version of the web service has been implemented using platform independent Tcl-scripts, which can be run with just a single runtime file utilizing the Starkit technology. The GOblet web service and the stand-alone application are freely available at http://goblet.molgen.mpg.de.
Despite the large number of software tools developed to address different areas of microarray data analysis, very few offer an all-in-one solution with little learning curve. For microarray core labs, there are even fewer software packages available to help with their routine but critical tasks, such as data quality control (QC) and inventory management. We have developed a simple-to-use web portal to allow bench biologists to analyze and query complicated microarray data and related biological pathways without prior training. Both experiment-based and gene-based analysis can be easily performed, even for the first-time user, through the intuitive multi-layer design and interactive graphic links. While being friendly to inexperienced users, most parameters in Goober can be easily adjusted via drop-down menus to allow advanced users to tailor their needs and perform more complicated analysis. Moreover, we have integrated graphic pathway analysis into the website to help users examine microarray data within the relevant biological content. Goober also contains features that cover most of the common tasks in microarray core labs, such as real time array QC, data loading, array usage and inventory tracking. Overall, Goober is a complete microarray solution to help biologists instantly discover valuable information from a microarray experiment and enhance the quality and productivity of microarray core labs. The whole package is freely available at http://sourceforge.net/projects/goober. A demo web server is available at http://www.goober-array.org.
JIB PublicationsDetecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set in drosophila indicates a variation due to motif GC content that is more significant than that found due to exon GC content. The software is available online and could be applied for cross-experiment transcriptome data analysis in eukaryotes.
JIB Publications
Database portal
Proteomics Sequence analysis Human biology Pathology Small molecules
Amino acid repeats are found to play important roles in both structures and functions of the proteins. These are commonly found in all kingdoms of life, especially in eukaryotes and a larger fraction of human proteins composed of repeats. Further, the abnormal expansions of shorter repeats cause various diseases to humans. Therefore, the analysis of repeats of the entire human proteome along with functional, mutational and disease information would help to better understand their roles in proteins. To fulfill this need, we developed a web database HPREP (http://bioinfo.bdu.ac.in/hprep) for human proteome repeats using Perl and HTML programming. We identified different categories of well-characterized repeats and domain repeats that are present in the human proteome of UniProtKB/Swiss-Prot by using in-house Perl programming and novel repeats by using the repeat detection T-REKS tool as well as XSTREAM web server. Further, these proteins are annotated with functional, mutational and disease information and grouped according to specific repeat types. The developed database enables the users to search by specific repeat type in order to understand their involvement in proteins. Thus, the HPREP database is expected to be a useful resource to gain better insight regarding the different repeats in human proteome and their biological roles.
This work presents a sophisticated information system, the Integrated Analysis Platform (IAP), an approach supporting large-scale image analysis for different species and imaging systems. In its current form, IAP supports the investigation of Maize, Barley and Arabidopsis plants based on images obtained in different spectra. Several components of the IAP system, which are described in this work, cover the complete end-to-end pipeline, starting with the image transfer from the imaging infrastructure, (grid distributed) image analysis, data management for raw data and analysis results, to the automated generation of experiment reports.
JIB Publications
Knowledge found in biomedical databases, in particular in Web information systems, is a major bioinformatics resource. In general, this biological knowledge is worldwide represented in a network of databases. These data is spread among thousands of databases, which overlap in content, but differ substantially with respect to content detail, interface, formats and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites or DNA sequences as well as a semi-automated data exploration in information retrieval environments, an integrated view to databases is essential. Search engines have the potential of assisting in data retrieval from these structured sources, but fall short of providing a comprehensive knowledge except out of the interlinked databases. A prerequisite of supporting the concept of an integrated data view is to acquire insights into cross-references among database entities. This issue is being hampered by the fact, that only a fraction of all possible cross-references are explicitely tagged in the particular biomedical informations systems. In this work, we investigate to what extend an automated construction of an integrated data network is possible. We propose a method that predicts and extracts cross-references from multiple life science databases and possible referenced data targets. We study the retrieval quality of our method and report on first, promising results. The method is implemented as the tool IDPredictor, which is published under the DOI 10.5447/IPK/2012/4 and is freely available using the URL: http://dx.doi.org/10.5447/IPK/2012/4.
JIB Publications
Workbench
Metabolomics Data mining Data management
Over the last decade the evaluation of odors and vapors in human breath has gained more and more attention, particularly in the diagnostics of pulmonary diseases. Ion mobility spectrometry coupled with multi-capillary columns (MCC/IMS), is a well known technology for detecting volatile organic compounds (VOCs) in air. It is a comparatively inexpensive, non-invasive, high-throughput method, which is able to handle the moisture that comes with human exhaled air, and allows for characterizing of VOCs in very low concentrations. To identify discriminating compounds as biomarkers, it is necessary to have a clear understanding of the detailed composition of human breath. Therefore, in addition to the clinical studies, there is a need for a flexible and comprehensive centralized data repository, which is capable of gathering all kinds of related information. Moreover, there is a demand for automated data integration and semi-automated data analysis, in particular with regard to the rapid data accumulation, emerging from the high-throughput nature of the MCC/IMS technology. Here, we present a comprehensive database application and analysis platform, which combines metabolic maps with heterogeneous biomedical data in a well-structured manner. The design of the database is based on a hybrid of the entity-attribute-value (EAV) model and the EAV-CR, which incorporates the concepts of classes and relationships. Additionally it offers an intuitive user interface that provides easy and quick access to the platform’s functionality: automated data integration and integrity validation, versioning and roll-back strategy, data retrieval as well as semi-automatic data mining and machine learning capabilities. The platform will support MCC/IMS-based biomarker identification and validation. The software, schemata, data sets and further information is publicly available at http://imsdb.mpi-inf.mpg.de.
At the present, coding sequence (CDS) has been discovered and larger CDS is being revealed frequently. Approaches and related tools have also been developed and upgraded concurrently, especially for phylogenetic tree analysis. This paper proposes an integrated automatic Taverna workflow for the phylogenetic tree inferring analysis using public access web services at European Bioinformatics Institute (EMBL-EBI) and Swiss Institute of Bioinformatics (SIB), and our own deployed local web services. The workflow input is a set of CDS in the Fasta format. The workflow supports 1,000 to 20,000 numbers in bootstrapping replication. The workflow performs the tree inferring such as Parsimony (PARS), Distance Matrix - Neighbor Joining (DIST-NJ), and Maximum Likelihood (ML) algorithms of EMBOSS PHYLIPNEW package based on our proposed Multiple Sequence Alignment (MSA) similarity score. The local web services are implemented and deployed into two types using the Soaplab2 and Apache Axis2 deployment. There are SOAP and Java Web Service (JWS) providing WSDL endpoints to Taverna Workbench, a workflow manager. The workflow has been validated, the performance has been measured, and its results have been verified. Our workflow's execution time is less than ten minutes for inferring a tree with 10,000 replicates of the bootstrapping numbers. This paper proposes a new integrated automatic workflow which will be beneficial to the bioinformaticians with an intermediate level of knowledge and experiences. All local services have been deployed at our portal http://bioservices.sci.psu.ac.th.
JIB PublicationsAstraZeneca’s Oncology in vivo data integration platform brings multidimensional data from animal model efficacy, pharmacokinetic and pharmacodynamic data to animal model profiling data and public in vivo studies. Using this platform, scientists can cluster model efficacy and model profiling data together, quickly identify responder profiles and correlate molecular characteristics to pharmacological response. Through meta-analysis, scientists can compare pharmacology between single and combination treatments, between different drug scheduling and administration routes.
JIB Publications
Web API
Genetic variance within the genotype of population and its mapping to phenotype variance in a systematic and high throughput manner is of interest for biodiversity and breeding research. Beside the established and efficient high throughput genotype technologies, phenotype capabilities got increased focus in the last decade. This results in an increasing amount of phenotype data from well scaling, automated sensor platform. Thus, data stewardship is a central component to make experimental data from multiple domains interoperable and re-usable. To ensure a standard and comprehensive sharing of scientific and experimental data among domain experts, FAIR data principles are utilized for machine read-ability and scale-ability. In this context, BrAPI consortium, provides a comprehensive and commonly agreed FAIRed guidelines to offer a BrAPI layered scientific data in a RESTful manner. This paper presents the concepts, best practices and implementations to meet these challenges. As one of the worlds leading plant research institutes it is of vital interest for the IPK-Gatersleben to transform legacy data infrastructures into a bio-digital resource center for plant genetics resources (PGR). This paper also demonstrates the benefits of integrated database back-ends, established data stewardship processes, and FAIR data exposition in a machine-readable, highly scalable programmatic interfaces.
JIB.tools 2.0 is a new approach to more closely embed the curation process in the publication process. This website hosts the tools, software applications, databases and workflow systems published in the Journal of Integrative Bioinformatics (JIB). As soon as a new tool-related publication is published in JIB, the tool is posted to JIB.tools and can afterwards be easily transferred to bio.tools, a large information repository of software tools, databases and services for bioinformatics and the life sciences. In this way, an easily-accessible list of tools is provided which were published in JIB a well as status information regarding the underlying service. With newer registries like bio.tools providing these information on a bigger scale, JIB.tools 2.0 closes the gap between journal publications and registry publication. (Reference: https://jib.tools).
JIB PublicationsMeasuring differential methylation of the DNA is the nowadays most common approach to linking epigenetic modifications to diseases (called epigenome-wide association studies, EWAS). For its low cost, its efficiency and easy handling, the Illumina HumanMethylation450 BeadChip and its successor, the Infinium MethylationEPIC BeadChip, is the by far most popular techniques for conduction EWAS in large patient cohorts. Despite the popularity of this chip technology, raw data processing and statistical analysis of the array data remains far from trivial and still lacks dedicated software libraries enabling high quality and statistically sound downstream analyses. As of yet, only R-based solutions are freely available for low-level processing of the Illumina chip data. However, the lack of alternative libraries poses a hurdle for the development of new bioinformatic tools, in particular when it comes to web services or applications where run time and memory consumption matter, or EWAS data analysis is an integrative part of a bigger framework or data analysis pipeline. We have therefore developed and implemented Jllumina, an open-source Java library for raw data manipulation of Illumina Infinium HumanMethylation450 and Infinium MethylationEPIC BeadChip data, supporting the developer with Java functions covering reading and preprocessing the raw data, down to statistical assessment, permutation tests, and identification of differentially methylated loci. Jllumina is fully parallelizable and publicly available at http://dimmer.compbio.sdu.dk/download.html.
JIB Publications
Web application
Plant biology Genomics
Search engines and retrieval systems are popular tools at a life science desktop. The manual inspection of hundreds of database entries, that reflect a life science concept or fact, is a time intensive daily work. Hereby, not the number of query results matters, but the relevance does. In this paper, we present the LAILAPS search engine for life science databases. The concept is to combine a novel feature model for relevance ranking, a machine learning approach to model user relevance profiles, ranking improvement by user feedback tracking and an intuitive and slim web user interface, that estimates relevance rank by tracking user interactions. Queries are formulated as simple keyword lists and will be expanded by synonyms. Supporting a flexible text index and a simple data import format, LAILAPS can easily be used both as search engine for comprehensive integrated life science databases and for small in-house project databases. With a set of features, extracted from each database hit in combination with user relevance preferences, a neural network predicts user specific relevance scores. Using expert knowledge as training data for a predefined neural network or using users own relevance training sets, a reliable relevance ranking of database hits has been implemented. In this paper, we present the LAILAPS system, the concepts, benchmarks and use cases. LAILAPS is public available for SWISSPROT data at http://lailaps.ipk-gatersleben.de.
Distinct bacteria are able to cope with highly diverse lifestyles; for instance, they can be free living or host-associated. Thus, these organisms must possess a large and varied genomic arsenal to withstand different environmental conditions. To facilitate the identification of genomic features that might influence bacterial adaptation to a specific niche, we introduce LifeStyle-Specific-Islands (LiSSI). LiSSI combines evolutionary sequence analysis with statistical learning (Random Forest with feature selection, model tuning and robustness analysis). In summary, our strategy aims to identify conserved consecutive homology sequences (islands) in genomes and to identify the most discriminant islands for each lifestyle.
JIB Publications
The rapid increase of ~omics datasets generated by microarray, mass spectrometry and next generation sequencing technologies requires an integrated platform that can combine results from different ~omics datasets to provide novel insights in the understanding of biological systems. MADMAX is designed to provide a solution for storage and analysis of complex ~omics datasets. In addition, analysis results (such as lists of genes) will be merged to reveal candidate genes supported by all datasets. The system constitutes an ISA-Tab compliant LIMS part which is independent of different analysis pipelines. A pilot study of different type of ~omics data in Brassica rapa demonstrates the possible use of MADMAX. The web-based user interface provides easy access to data and analysis tools on top of the database.
JIB Publications
Web application
We describe a web-based tool, MakeSBML (https://sys-bio.github.io/makesbml/), that provides an installation-free application for creating, editing, and searching the Biomodels repository for SBML-based models. MakeSBML is a client-based web application that translates models expressed in human-readable Antimony to the System Biology Markup Language (SBML) and vice-versa. Since MakeSBML is a web-based application it requires no installation on the user's part. Currently, MakeSBML is hosted on a GitHub page where the client-based design makes it trivial to move to other hosts. This model for software deployment also reduces maintenance costs since an active server is not required. The SBML modeling language is often used in systems biology research to describe complex biochemical networks and makes reproducing models much easier. However, SBML is designed to be computer-readable, not human-readable. We therefore employ the human-readable Antimony language to make it easy to create and edit SBML models.
In recent years the amount of biological data has exploded to the point where much useful information can only be extracted by complex computational analyses. Such analyses are greatly facilitated by metadata standards, both in terms of the ability to compare data originating from different sources, and in terms of exchanging data in standard forms, e.g. when running processes on a distributed computing infrastructure. However, standards thrive on stability whereas science tends to constantly move, with new methods being developed and old ones modified. Therefore maintaining both metadata standards, and all the code that is required to make them useful, is a non-trivial problem. Memops is a framework that uses an abstract definition of the metadata (described in UML) to generate internal data structures and subroutine libraries for data access (application programming interfaces--APIs--currently in Python, C and Java) and data storage (in XML files or databases). For the individual project these libraries obviate the need for writing code for input parsing, validity checking or output. Memops also ensures that the code is always internally consistent, massively reducing the need for code reorganisation. Across a scientific domain a Memops-supported data model makes it easier to support complex standards that can capture all the data produced in a scientific area, share them among all programs in a complex software pipeline, and carry them forward to deposition in an archive. The principles behind the Memops generation code will be presented, along with example applications in Nuclear Magnetic Resonance (NMR) spectroscopy and structural biology.
JIB PublicationsHelicobacter pylori is a pathogenic bacterium that colonizes the human epithelia, causing duodenal and gastric ulcers, and gastric cancer. The genome of H. pylori 26695 has been previously sequenced and annotated. In addition, two genome-scale metabolic models have been developed. In order to maintain accurate and relevant information on coding sequences (CDS) and to retrieve new information, the assignment of new functions to Helicobacter pylori 26695s genes was performed in this work. The use of software tools, on-line databases and an annotation pipeline for inspecting each gene allowed the attribution of validated EC numbers and TC numbers to metabolic genes encoding enzymes and transport proteins, respectively. 1212 genes encoding proteins were identified in this annotation, being 712 metabolic genes and 500 non-metabolic, while 191 new functions were assignment to the CDS of this bacterium. This information provides relevant biological information for the scientific community dealing with this organism and can be used as the basis for a new metabolic model reconstruction.
JIB Publications
Database portal
Endocrinology and metabolism Plant biology Molecular interactions, pathways and networks Enzymes
Crop plants play a major role in human and animal nutrition and increasingly contribute to chemical or pharmaceutical industry and renewable resources. In order to achieve important goals, such as the improvement of growth or yield, it is indispensable to understand biological processes on a detailed level. Therefore, the well-structured management of fine-grained information about metabolic pathways is of high interest. Thus, we developed the MetaCrop information system, a manually curated repository of high quality information concerning the metabolism of crop plants. However, the data access to and flexible export of information of MetaCrop in standard exchange formats had to be improved. To automate and accelerate the data access we designed a set of web services to be integrated into external software. These web services have already been used by an add-on for the visualisation toolkit VANTED. Furthermore, we developed an export feature for the MetaCrop web interface, thus enabling the user to compose individual metabolic models using SBML.
Proteomic and transcriptomic technologies resulted in massive biological datasets, their interpretation requiring sophisticated computational strategies. Efficient and intuitive real-time analysis remains challenging. We use proteomic data on 1417 proteins of the green microalga Chlamydomonas reinhardtii to investigate physicochemical parameters governing selectivity of three cysteine-based redox post translational modifications (PTM): glutathionylation (SSG), nitrosylation (SNO) and disulphide bonds (SS) reduced by thioredoxins. We aim to understand underlying molecular mechanisms and structural determinants through integration of redox proteome data from gene- to structural level. Our interactive visual analytics approach on an 8.3 m2 display wall of 25 MPixel resolution features stereoscopic three dimensions (3D) representation performed by UnityMol WebGL. Virtual reality headsets complement the range of usage configurations for fully immersive tasks. Our experiments confirm that fast access to a rich cross-linked database is necessary for immersive analysis of structural data. We emphasize the possibility to display complex data structures and relationships in 3D, intrinsic to molecular structure visualization, but less common for omics-network analysis. Our setup is powered by MinOmics, an integrated analysis pipeline and visualization framework dedicated to multi-omics analysis. MinOmics integrates data from various sources into a materialized physical repository. We evaluate its performance, a design criterion for the framework.
JIB Publications
MicroRNAs (miRNAs/miRs) are important cellular components that regulate gene expression at posttranscriptional level. Various upstream components regulate miR expression and any deregulation causes disease conditions. Therefore, understanding of miR regulatory network both at upstream and downstream level is crucial and a resource on this aspect will be helpful. Currently available miR databases are mostly related to downstream targets, sequences, or diseases. But as of now, no database is available that provides a complete picture of miR regulation in a specific condition. Our miR regulation web resource (miReg) is a manually curated one that represents validated upstream regulators (transcription factor, drug, physical, and chemical) along with downstream targets, associated biological process, experimental condition or disease state, up or down regulation of the miR in that condition, and corresponding PubMed references in a graphical and user friendly manner, browseable through 5 browsing options. We have presented exact facts that have been described in the corresponding literature in relation to a given miR, whether it's a feed-back/feed-forward loop or inhibition/activation. Moreover we have given various links to integrate data and to get a complete picture on any miR listed. Current version (Version 1.0) of miReg contains 47 important human miRs with 295 relations using 190 absolute references. We have also provided an example on usefulness of miReg to establish signalling pathways involved in cardiomyopathy. We believe that miReg will be an essential miRNA knowledge base to research community, with its continuous upgrade and data enrichment. This HTML based miReg can be accessed from: www.iioab-mireg.webs.com or www.iioab.webs.com/mireg.htm.
JIB PublicationsIdentification of microRNA (miRNA) precursors has seen increased efforts in recent years. The difficulty in experimental detection of pre-miRNAs increased the usage of computational approaches. Most of these approaches rely on machine learning especially classification. In order to achieve successful classification, many parameters need to be considered such as data quality, choice of classifier settings, and feature selection. For the latter one, we developed a distributed genetic algorithm on HTCondor to perform feature selection. Moreover, we employed two widely used classification algorithms libSVM and random forest with different settings to analyze the influence on the overall classification performance. In this study we analyzed 5 human retro virus genomes; Human endogenous retrovirus K113, Hepatitis B virus (strain ayw), Human T lymphotropic virus 1, Human T lymphotropic virus 2, Human immunodeficiency virus 2, and Human immunodeficiency virus 1. We then predicted pre-miRNAs by using the information from known virus and human pre-miRNAs. Our results indicate that these viruses produce novel unknown miRNA precursors which warrant further experimental validation.
JIB PublicationsSmall non-coding RNAs, in particular microRNAs, are critical for normal physiology and are candidate biomarkers, regulators, and therapeutic targets for a wide variety of diseases. There is an ever-growing interest in the comprehensive and accurate annotation of microRNAs across diverse cell types, conditions, species, and disease states. Highthroughput sequencing technology has emerged as the method of choice for profiling microRNAs. Specialized bioinformatic strategies are required to mine as much meaningful information as possible from the sequencing data to provide a comprehensive view of the microRNA landscape. Here we present miRquant 2.0, an expanded bioinformatics tool for accurate annotation and quantification of microRNAs and their isoforms (termed isomiRs) from small RNA-sequencing data. We anticipate that miRquant 2.0 will be useful for researchers interested not only in quantifying known microRNAs but also mining the rich well of additional information embedded in small RNA-sequencing data.
JIB Publications
Web application
Sequence analysis Proteins Molecular interactions, pathways and networks Sequencing Protein interactions
During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.
Command-line tool
DNA Mobile genetic elements Sequence analysis
Background Miniature inverted repeat transposable element (MITE) is a short transposable element, carrying no protein-coding regions. However, its high proliferation rate and sequence-specific insertion preference renders it as a good genetic tool for both natural evolution and experimental insertion mutagenesis. Recently active MITE copies are those with clear signals of Terminal Inverted Repeats (TIRs) and Direct Repeats (DRs), and are recently translocated into their current sites. Their proliferation ability renders them good candidates for the investigation of genomic evolution. Results This study optimizes the C++ code and running pipeline of the MITE Uncovering SysTem (MUST) by assuming no prior knowledge of MITEs required from the users, and the current version, MUSTv2, shows significantly increased detection accuracy for recently active MITEs, compared with similar programs. The running speed is also significantly increased compared with MUSTv1. We prepared a benchmark dataset, the simulated genome with 150 MITE copies for researchers who may be of interest. Conclusions MUSTv2 represents an accurate detection program of recently active MITE copies, which is complementary to the existing template-based MITE mapping programs. We believe that the release of MUSTv2 will greatly facilitate the genome annotation and structural analysis of the bioOMIC big data researchers.
Database portal Web application
Biodiversity Data integration and warehousing
Fungi have crucial roles in ecosystems, and are important associates for many organisms. They are adapted to a wide variety of habitats, however their global distribution and diversity remains poorly documented. The exponential growth of DNA barcode information retrieved from the environment is assisting considerably the traditional ways for unraveling fungal diversity and detection. The raw DNA data in association to environmental descriptors of metabarcoding studies are made available in public sequence read archives. While this is potentially a valuable source of information for the investigation of Fungi across diverse environmental conditions, the annotation used to describe environment is heterogenous. Moreover, a uniform processing pipeline still needs to be applied to the available raw DNA data. Hence, a comprehensive framework to analyses these data in a large context is still lacking. We introduce the MycoDiversity DataBase, a database which includes public fungal metabarcoding data of environmental samples for the study of biodiversity patterns of Fungi. The framework we propose will contribute to our understanding of fungal biodiversity and aims to become a valuable source for large-scale analyses of patterns in space and time, in addition to assisting evolutionary and ecological research on Fungi.
Biological networks can be large and complex, often consisting of different sub-networks or parts. Separation of networks into parts, network partitioning and layouts of overview and sub-graphs are of importance for understandable visualisations of those networks. This article presents NetPartVis to visualise non-overlapping clusters or partitions of graphs in the Vanted framework based on a method for laying out overview graph and several sub-graphs (partitions) in a coordinated, mental-map preserving way.
JIB PublicationsOrganisms try to maintain homeostasis by balanced uptake of nutrients from their environment. From an atomic perspective this means that, for example, carbon:nitrogen:sulfur ratios are kept within given limits. Upon limitation of, for example, sulfur, its acquisition is triggered. For yeast it was shown that transporters and enzymes involved in sulfur uptake are encoded as paralogous genes that express different isoforms. Sulfur deprivation leads to up-regulation of isoforms that are poor in sulfur-containing amino acids, that is, methinone and cysteine. Accordingly, sulfur-rich isoforms are down-regulated. We developed a web-based software, doped Nutrilyzer, that extracts paralogous protein coding sequences from an annotated genome sequence and evaluates their atomic composition. When fed with gene-expression data for nutrient limited and normal conditions, Nutrilyzer provides a list of genes that are significantly differently expressed and simultaneously contain significantly different amounts of the limited nutrient in their atomic composition. Its intended use is in the field of ecological stoichiometry. Nutrilyzer is available at http://nutrilyzer.hs-mittweida.de. Here we describe the work flow and results with an example from a whole-genome Arabidopsis thaliana gene-expression analysis upon oxygen deprivation. 43 paralogs distributed over 37 homology clusters were found to be significantly differently expressed while containing significantly different amounts of oxygen.
JIB PublicationsWe present Omics Fusion, a new web-based platform for integrative analysis of omics data. Omics Fusion provides a collection of new and established tools and visualization methods to support researchers in exploring omics data, validating results or understanding how to adjust experiments in order to make new discoveries. It is easily extendible and new visualization methods are added continuously. It is available for free under: https://fusion.cebitec.uni-bielefeld.de/.
JIB PublicationsHigh throughput genomic studies can identify large numbers of potential candidate genes, which must be interpreted and filtered by investigators to select the best ones for further analysis. Prioritization is generally based on evidence that supports the role of a gene product in the biological process being investigated. The two most important bodies of information providing such evidence are bioinformatics databases and the scientific literature. In this paper we present an extension to the Ondex data integration framework that uses text mining techniques over Medline abstracts as a method for accessing both these bodies of evidence in a consistent way. In an example use case, we apply our method to create a knowledge base of Arabidopsis proteins implicated in plant stress response and use various scoring metrics to identify key protein-stress associations. In conclusion, we show that the additional text mining features are able to highlight proteins using the scientific literature that would not have been seen using data integration alone. Ondex is an open-source software project and can be downloaded, together with the text mining features described here, from www.ondex.org.
JIB PublicationsThe automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from http://ondex.sf.net/.
JIB PublicationsElectronic laboratory notebooks (ELNs) are more accessible and reliable than their paper based alternatives and thus find widespread adoption. While a large number of commercial products is available, small- to mid-sized laboratories can often not afford the costs or are concerned about the longevity of the providers. Turning towards free alternatives, however, raises questions about data protection, which are not sufficiently addressed by available solutions. To serve as legal documents, ELNs must prevent scientific fraud through technical means such as digital signatures. It would also be advantageous if an ELN was integrated with a laboratory information management system to allow for a comprehensive documentation of experimental work including the location of samples that were used in a particular experiment. Here, we present OpenLabNotes, which adds state-of-the-art ELN capabilities to OpenLabFramework, a powerful and flexible laboratory information management system. In contrast to comparable solutions, it allows to protect the intellectual property of its users by offering data protection with digital signatures. OpenLabNotes effectively closes the gap between research documentation and sample management, thus making Open-LabFramework more attractive for laboratories that seek to increase productivity through electronic data management.
JIB Publications
Command-line tool Library
Microarray experiment Gene expression
Correlation analysis assuming coexpression of the genes is a widely used method for gene expression analysis in molecular biology. Yet growing extent, quality and dimensionality of the molecular biological data permits emerging, more sophisticated approaches like Boolean implications. We present an approach which is a combination of the SOM (self organizing maps) machine learning method and Boolean implication analysis to identify relations between genes, metagenes and similarly behaving metagene groups (spots). Our method provides a way to assign Boolean states to genes/metagenes/spots and offers a functional view over significantly variant elements of gene expression data on these three different levels. While being able to cover relations between weakly correlated entities Boolean implication method also decomposes these relations into six implication classes. Our method allows one to validate or identify potential relationships between genes and functional modules of interest and to assess their switching behaviour. Furthermore the output of the method renders it possible to construct and study the network of genes. By providing logical implications as updating rules for the network it can also serve to aid modelling approaches.
Web application Database portal
Outbreaks of COVID-19 caused by the novel coronavirus SARS-CoV-2 is still a threat to global human health. In order to understand the biology of SARS-CoV-2 and developing drug against COVID-19, a vast amount of genomic, proteomic, interatomic, and clinical data is being generated, and the bioinformatics researchers produced databases, webservers and tools to gather those publicly available data and provide an opportunity of analyzing such data. However, these bioinformatics resources are scattered and researchers need to find them from different resources discretely. To facilitate researchers in finding the resources in one frame, we have developed an integrated web portal called OverCOVID (http://bis.zju.edu.cn/overcovid/). The publicly available webservers, databases and tools associated with SARS-CoV-2 have been incorporated in the resource page. In addition, a network view of the resources is provided to display the scope of the research. Other information like SARS-CoV-2 strains is visualized and various layers of interaction resources is listed in distinct pages of the web portal. As an integrative web portal, the OverCOVID will help the scientist to search the resources and accelerate the clinical research of SARS-CoV-2.
Web API
Molecular interactions, pathways and networks Data management
Biological pathways are crucial to much of the scientific research today including the study of specific biological processes related with human diseases. PathJam is a new comprehensive and freely accessible web-server application integrating scattered human pathway annotation from several public sources. The tool has been designed for both (i) being intuitive for wet-lab users providing statistical enrichment analysis of pathway annotations and (ii) giving support to the development of new integrative pathway applications. PathJam’s unique features and advantages include interactive graphs linking pathways and genes of interest, downloadable results in fully compatible formats, GSEA compatible output files and a standardized RESTful API.
Desktop application
Systems biology Molecular interactions, pathways and networks
Our understanding of complex biological processes can be enhanced by combining different kinds of high-throughput experimental data, but the use of incompatible identifiers makes data integration a challenge. We aimed to improve methods for integrating and visualizing different types of omics data. To validate these methods, we applied them to two previous studies on starvation in mice, one using proteomics and the other using transcriptomics technology. We extended the PathVisio software with new plugins to link proteins, transcripts and pathways. A low overall correlation between proteome and transcriptome data was detected (Spearman rank correlation: 0.21). At the level of individual genes, correlation was highly variable. Many mRNA/protein pairs, such as fructose biphosphate aldolase B and ATP Synthase, show good correlation. For other pairs, such as ferritin and elongation factor 2, an interesting effect is observed, where mRNA and protein levels change in opposite directions, suggesting they are not primarily regulated at the transcriptional level. We used pathway diagrams to visualize the integrated datasets and found it encouraging that transcriptomics and proteomics data supported each other at the pathway level. Visualization of the integrated dataset on pathways led to new observations on gene-regulation in the response of the gut to starvation. Our methods are generic and can be applied to any multi-omics study. The PathVisio software can be obtained at http://www.pathvisio.org. Supplemental data are available at http://www.bigcat.unimaas.nl/data/jib-supplemental/ , including instructions on reproducing the pathway visualizations of this manuscript.
MicroRNAs (miRs) are known to interfere with mRNA expression, and much work has been put into predicting and inferring miR-mRNA interactions. Both sequence-based interaction predictions as well as interaction inference based on expression data have been proven somewhat successful; furthermore, models that combine the two methods have had even more success. In this paper, I further refine and enrich the methods of miRmRNA interaction discovery by integrating a Bayesian clustering algorithm into a model of prediction-enhanced miR-mRNA target inference, creating an algorithm called PEACOAT, which is written in the R language. I show that PEACOAT improves the inference of miR-mRNA target interactions using both simulated data and a data set of microarrays from samples of multiple myeloma patients. In simulated networks of 25 miRs and mRNAs, our methods using clustering can improve inference in roughly two-thirds of cases, and in the multiple myeloma data set, KEGG pathway enrichment was found to be more significant with clustering than without. Our findings are consistent with previous work in clustering of non-miR genetic networks and indicate that there could be a significant advantage to clustering of miR and mRNA expression data as a part of interaction inference.
JIB PublicationsSystems biology plays a central role for biological network analysis in the post-genomic era. Cytoscape is the standard bioinformatics tool offering the community an extensible platform for computational analysis of the emerging cellular network together with experimental omics data sets. However, only few apps/plugins/tools are available for simulating network dynamics in Cytoscape 3. Many approaches of varying complexity exist but none of them have been integrated into Cytoscape as app/plugin yet. Here, we introduce PetriScape, the first Petri net simulator for Cytoscape. Although discrete Petri nets are quite simplistic models, they are capable of modeling global network properties and simulating their behaviour. In addition, they are easily understood and well visualizable. PetriScape comes with the following main functionalities: (1) import of biological networks in SBML format, (2) conversion into a Petri net, (3) visualization as Petri net, and (4) simulation and visualization of the token flow in Cytoscape. PetriScape is the first Cytoscape plugin for Petri nets. It allows a straightforward Petri net model creation, simulation and visualization with Cytoscape, providing clues about the activity of key components in biological networks.
JIB PublicationsImprovements in genome sequencing technology increased the availability of full genomes and transcriptomes of many organisms. However, the major benefit of massive parallel sequencing is to better understand the organization and function of genes which then lead to understanding of phenotypes. In order to interpret genomic data with automated gene annotation studies, several tools are currently available. Even though the accuracy of computational gene annotation is increasing, a combination of multiple lines of experimental evidences should be gathered. Mass spectrometry allows the identification and sequencing of proteins as major gene products; and it is only these proteins that conclusively show whether a part of a genome is a coding region or not to result in phenotypes. Therefore, in the field of proteogenomics, the validation of computational methods is done by exploiting mass spectrometric data. As a result, identification of novel protein coding regions, validation of current gene models, and determination of upstream and downstream regions of genes can be achieved. In this paper, we present new functionality for our proteogenomic tool, PGMiner which performs all proteogenomic steps like acquisition of mass spectrometric data, peptide identification against preprocessed sequence databases, assignment of statistical confidence to identified peptides, mapping confident peptides to gene models, and result visualization. The extensions cover determining proteotypic peptides and thus unambiguous protein identification. Furthermore, peptides conflicting with gene models can now automatically assessed within the context of predicted alternative open reading frames.
JIB Publications
Web application
Structure prediction Protein secondary structure Sequence analysis Protein folds and structural domains Nucleic acid structure analysis
During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.
MicroRNAs are short non-coding RNA transcripts that act as master cellular egulators with roles in orchestrating virtually all biological functions. The recent affordability and widespread use of high-throughput microRNA profiling technologies has grown along with the advancement of bioinformatics tools available for analysis of the mounting data flow. While there are many computational resources available for the management of data from genome sequenced animals, researchers are often faced with the challenge of identifying the biological implications of the daunting amount of data generated from these high-throughput technologies. In this article, we review the current state of highthroughput microRNA expression profiling platforms, data analysis processes, and computational tools in the context of comparative molecular physiology. We also present RBioMIR and RBioFS, our R package implementations for differential expression analysis and random forest-based gene selection. Detailed installation guides are available at kenstoreylab.com.
JIB PublicationsMicroRNAs are short non-coding RNA transcripts that act as master cellular egulators with roles in orchestrating virtually all biological functions. The recent affordability and widespread use of high-throughput microRNA profiling technologies has grown along with the advancement of bioinformatics tools available for analysis of the mounting data flow. While there are many computational resources available for the management of data from genome sequenced animals, researchers are often faced with the challenge of identifying the biological implications of the daunting amount of data generated from these high-throughput technologies. In this article, we review the current state of highthroughput microRNA expression profiling platforms, data analysis processes, and computational tools in the context of comparative molecular physiology. We also present RBioMIR and RBioFS, our R package implementations for differential expression analysis and random forest-based gene selection. Detailed installation guides are available at kenstoreylab.com.
JIB Publications
Desktop application
Systems biology Biochemistry Chemical biology Simulation experiment
Reaction-diffusion systems are mathematical models that describe how the concentrations of substances distributed in space change under the influence of local chemical reactions, and diffusion which causes the substances to spread out in space. The classical representation of a reaction-diffusion system is given by semi-linear parabolic partial differential equations, whose solution predicts how diffusion causes the concentration field to change with time. This change is proportional to the diffusion coefficient. If the solute moves in a homogeneous system in thermal equilibrium, the diffusion coefficients are constants that do not depend on the local concentration of solvent and solute. However, in nonhomogeneous and structured media the assumption of constant intracellular diffusion coefficient is not necessarily valid, and, consequently, the diffusion coefficient is a function of the local concentration of solvent and solutes. In this paper we propose a stochastic model of reaction-diffusion systems, in which the diffusion coefficients are function of the local concentration, viscosity and frictional forces. We then describe the software tool Redi (REaction-DIffusion simulator) which we have developed in order to implement this model into a Gillespie-like stochastic simulation algorithm. Finally, we show the ability of our model implemented in the Redi tool to reproduce the observed gradient of the bicoid protein in the Drosophila Melanogaster embryo. With Redi, we were able to simulate with an accuracy of 1% the experimental spatio-temporal dynamics of the bicoid protein, as recorded in time-lapse experiments obtained by direct measurements of transgenic bicoidenhanced green fluorescent protein.
Understanding how metabolic reactions translate the genome of an organism into its phenotype is a grand challenge in biology. Genome-wide association studies (GWAS) statistically connect genotypes to phenotypes, without any recourse to known molecular interactions, whereas a molecular mechanistic description ties gene function to phenotype through gene regulatory networks (GRNs), protein-protein interactions (PPIs) and molecular pathways. Integration of different regulatory information levels of an organism is expected to provide a good way for mapping genotypes to phenotypes. However, the lack of curated metabolic model of rice is blocking the exploration of genome-scale multi-level network reconstruction. Here, we have merged GRNs, PPIs and genome-scale metabolic networks (GSMNs) approaches into a single framework for rice via omics’ regulatory information reconstruction and integration. Firstly, we reconstructed a genome-scale metabolic model, containing 4,462 function genes, 2,986 metabolites involved in 3,316 reactions, and compartmentalized into ten subcellular locations. Furthermore, 90,358 pairs of protein-protein interactions, 662,936 pairs of gene regulations and 1,763 microRNA-target interactions were integrated into the metabolic model. Eventually, a database was developped for systematically storing and retrieving the genome-scale multi-level network of rice. This provides a reference for understanding genotype-phenotype relationship of rice, and for analysis of its molecular regulatory network.
JIB Publications
Web application
Sequence analysis
The differentiation of regions with coding potential from non-coding regions remains a key task in computational biology. Methods such as RNAcode that exploit patterns of sequence conservation for this task have a substantial advantage in classification accuracy in particular for short coding sequences, compared to methods that rely on a single input sequence. However, they require sequence alignments as input. Frequently, suitable multiple sequence alignments are not readily available and are tedious, and sometimes difficult to construct. We therefore introduce here a new web service that provides access to the well-known coding sequence detector RNAcode with minimal user overhead. It requires as input only a single target nucleotide sequence. The service automates the collection, selection, and preparation of homologous sequences from the NCBI database, as well as the construction of the multiple sequence alignment that are needed as input for RNAcode. The service automatizes the entire pre- and postprocessing and thus makes the investigation of specific genomic regions for previously unannotated coding regions, such as small peptides or additional introns, a simple task that is easily accessible to non-expert users. RNAcode_Web is accessible online at rnacode.bioinf.uni-leipzig.de.
SAD_BaSe is a blood bank data analysis software, created to assist in the management of blood donations and the blood production chain in blood establishments. In particular, the system keeps track of several collection and production indicators, enables the definition of collection and production strategies, and the measurement of quality indicators required by the Quality Management System regulating the general operation of blood establishments. This paper describes the general scenario of blood establishments and its main requirements in terms of data management and analysis. It presents the architecture of SAD_BaSe and identifies its main contributions. Specifically, it brings forward the generation of customized reports driven by decision making needs and the use of data mining techniques in the analysis of donor suspensions and donation discards.
JIB PublicationsAdvances in bioinformatics have contributed towards a significant increase in available information. Information analysis requires the use of distributed computing systems to best engage the process of data analysis. This study proposes a multiagent system that incorporates grid technology to facilitate distributed data analysis by dynamically incorporating the roles associated to each specific case study. The system was applied to genetic sequencing data to extract relevant information about insertions, deletions or polymorphisms.
JIB Publications
Library
Julia is a general purpose programming language that was designed for simplifying and accelerating numerical analysis and computational science. In particular the Scientific Machine Learning (SciML) ecosystem of Julia packages includes frameworks for high-performance symbolic-numeric computations. It allows users to automatically enhance high-level descriptions of their models with symbolic preprocessing and automatic sparsification and parallelization of computations. This enables performant solution of differential equations, efficient parameter estimation and methodologies for automated model discovery with neural differential equations and sparse identification of nonlinear dynamics. To give the systems biology community easy access to SciML, we developed SBMLToolkit.jl. SBMLToolkit.jl imports dynamic SBML models into the SciML ecosystem to accelerate model simulation and fitting of kinetic parameters. By providing computational systems biologists with easy access to the open-source Julia ecosystevnm, we hope to catalyze the development of further Julia tools in this domain and the growth of the Julia bioscience community. SBMLToolkit.jl is freely available under the MIT license. The source code is available at https://github.com/SciML/SBMLToolkit.jl.
During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.
JIB PublicationsThe identification of genes and SNPs involved in human diseases remains a challenge. Many public resources, databases and applications, collect biological data and perform annotations, increasing the global biological knowledge. The need of SNPs prioritization is emerging with the development of new high-throughput genotyping technologies, which allow to develop customized disease-oriented chips. Therefore, given a list of genes related to a specific biological process or disease as input, a crucial issue is finding the most relevant SNPs to analyse. The selection of these SNPs may rely on the relevant a-priori knowledge of biomolecular features characterising all the annotated SNPs and genes of the provided list. The bioinformatics approach described here allows to retrieve a ranked list of significant SNPs from a set of input genes, such as candidate genes associated with a specific disease. The system enriches the genes set by including other genes, associated to the original ones by ontological similarity evaluation. The proposed method relies on the integration of data from public resources in a vertical perspective (from genomics to systems biology data), the evaluation of features from biomolecular knowledge, the computation of partial scores for SNPs and finally their ranking, relying on their global score. Our approach has been implemented into a web based tool called SNPRanker, which is accessible through at the URL http://www.itb.cnr.it/snpranker . An interesting application of the presented system is the prioritisation of SNPs related to genes involved in specific pathologies, in order to produce custom arrays.
JIB Publications
Library
Many important aspects of biological knowledge at the molecular level can be represented by . Through their analysis, we gain mechanistic insights and interpret lists of interesting genes from experiments (usually omics and functional genomic experiments). As a result, pathways play a central role in the development of bioinformatics methods and tools for computing predictions from known molecular-level mechanisms. Qualitative as well as quantitative knowledge about pathways can be effectively represented through linking the and the compounds (, proteins) occurring in the considered pathways. So, repositories providing biochemical networks for known pathways play a central role in bioinformatics and in . Here we focus on Reactome, a free, comprehensive, and widely used repository for biochemical networks and pathways. In this paper, we: (1) introduce a tool StARGate-X ( Reactome nEtworkX) to carry out an automated analysis of the connectivity properties of Reactome biochemical reaction network and of its biological hierarchy (, cell compartments, namely, the closed parts within the cytosol, usually surrounded by a membrane); the code is freely available at https://github.com/marinoandrea/stargate-x; (2) show the effectiveness of our tool by providing an analysis of the Reactome network, in terms of centrality measures, with respect to in- and out-degree. As an example of usage of StARGate-X, we provide a detailed automated analysis of the Reactome network, in terms of centrality measures. We focus both on the subgraphs induced by single compartments and on the graph whose nodes are the strongly connected components. To the best of our knowledge, this is the first freely available tool that enables automatic analysis of the large biochemical network within Reactome through easy-to-use APIs ().
Command-line tool
Systems biology Molecular interactions, pathways and networks Genomics
The generation and use of metabolic network reconstructions has increased over recent years. The development of such reconstructions has typically involved a time-consuming, manual process. Recent work has shown that steps undertaken in reconstructing such metabolic networks are amenable to automation. The SuBliMinaL Toolbox (http://www.mcisb.org/subliminal/) facilitates the reconstruction process by providing a number of independent modules to perform common tasks, such as generating draft reconstructions, determining metabolite protonation state, mass and charge balancing reactions, suggesting intracellular compartmentalisation, adding transport reactions and a biomass function, and formatting the reconstruction to be used in third-party analysis packages. The individual modules manipulate reconstructions encoded in Systems Biology Markup Language (SBML), and can be chained to generate a reconstruction pipeline, or used individually during a manual curation process. This work describes the individual modules themselves, and a study in which the modules were used to develop a metabolic reconstruction of Saccharomyces cerevisiae from the existing data resources KEGG and MetaCyc. The automatically generated reconstruction is analysed for blocked reactions, and suggestions for future improvements to the toolbox are discussed.
Visualization is pivotal for gaining insight in systems biology data. As the size and complexity of datasets and supplemental information increases, an efficient, integrated framework for general and specialized views is necessary. MAYDAY is an application for analysis and visualization of general 'omics' data. It follows a trifold approach for data visualization, consisting of flexible data preprocessing, highly customizable data perspective plots for general purpose visualization and systems based plots. Here, we introduce two new systems biology visualization tools for MAYDAY. Efficiently implemented genomic viewers allow the display of variables associated with genomic locations. Multiple variables can be viewed using our new track-based ChromeTracks tool. A functional perspective is provided by visualizing metabolic pathways either in KEGG or BioPax format. Multiple options of displaying pathway components are available, including Systems Biology Graphical Notation (SBGN) glyphs. Furthermore, pathways can be viewed together with gene expression data either as heatmaps or profiles. We apply our tools to two 'omics' datasets of Pseudomonas aeruginosa. The general analysis and visualization tools of MAYDAY as well as our ChromeTracks viewer are applied to a transcriptome dataset. We furthermore integrate this dataset with a metabolome dataset and compare the activity of amino acid degradation pathways between these two datasets, by visually enhancing the pathway diagrams produced by MAYDAY.
JIB Publications
Workflow
Phylogenetics Genomics Sequence analysis Protein structure analysis
Unipro UGENE is an open-source bioinformatics toolkit that integrates popular tools along with original instruments for molecular biologists within a unified user interface. Nowadays, most bioinformatics desktop applications, including UGENE, make use of a local data model while processing different types of data. Such an approach causes an inconvenience for scientists working cooperatively and relying on the same data. This refers to the need of making multiple copies of certain files for every workplace and maintaining synchronization between them in case of modifications. Therefore, we focused on delivering a collaborative work into the UGENE user experience. Currently, several UGENE installations can be connected to a designated shared database and users can interact with it simultaneously. Such databases can be created by UGENE users and be used at their discretion. Objects of each data type, supported by UGENE such as sequences, annotations, multiple alignments, etc., can now be easily imported from or exported to a remote storage. One of the main advantages of this system, compared to existing ones, is the almost simultaneous access of client applications to shared data regardless of their volume. Moreover, the system is capable of storing millions of objects. The storage itself is a regular database server so even an inexpert user is able to deploy it. Thus, UGENE may provide access to shared data for users located, for example, in the same laboratory or institution. UGENE is available at: http://ugene.net/download.html.
Desktop application
Systems biology Molecular interactions, pathways and networks Data visualisation Biomedical science
VANESA is a modeling software for the automatic reconstruction and analysis of biological networks based on life-science database information. Using VANESA, scientists are able to model any kind of biological processes and systems as biological networks. It is now possible for scientists to automatically reconstruct important molecular systems with information from the databases KEGG, MINT, IntAct, HPRD, and BRENDA. Additionally, experimental results can be expanded with database information to better analyze the investigated elements and processes in an overall context. Users also have the possibility to use graph theoretical approaches in VANESA to identify regulatory structures and significant actors within the modeled systems. These structures can then be further investigated in the Petri net environment of VANESA. It is platform-independent, free-of-charge, and available at http://vanesa.sf.net.
Web application
Sequence sites, features and motifs Sequence analysis Database management Gene and protein families Molecular modelling
During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.
Structure, is a widely used software tool to investigate population genetic structure with multi-locus genotyping data. The software uses an iterative algorithm to group individuals into "K" clusters, representing possibly K genetically distinct subpopulations. The serial implementation of this programme is processor-intensive even with small datasets. We describe an implementation of the program within a parallel framework. Speedup was achieved by running different replicates and values of K on each node of the cluster. A web-based user-oriented GUI has been implemented in PHP, through which the user can specify input parameters for the programme. The number of processors to be used can be specified in the background command. A web-based visualization tool "Visualstruct", written in PHP (HTML and Java script embedded), allows for the graphical display of population clusters output from Structure, where each individual may be visualized as a line segment with K colors defining its possible genomic composition with respect to the K genetic sub-populations. The advantage over available programs is in the increased number of individuals that can be visualized. The analyses of real datasets indicate a speedup of up to four, when comparing the speed of execution on clusters of eight processors with the speed of execution on one desktop. The software package is freely available to interested users upon request.
JIB Publications