Volltext-Downloads (blau) und Frontdoor-Views (grau)

STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs

  • The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models (KGEMs). However, representations based on a single modality are inherently limited. To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs. This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA) consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against two baseline models trained on either one of the modalities (i.e., text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.083. Additionally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Finally, the source code and pre-trained STonKGs models are available at https://github.com/stonkgs/stonkgs and https://huggingface.co/stonkgs/stonkgs-150k.

Export metadata

Additional Services

Search Google Scholar Check availability


Show usage statistics
Document Type:Preprint
Author:Helena Balabin, Charles Tapley Hoyt, Colin Birkenbihl, Benjamin M. Gyori, John Bachman, Alpha Tom Kodamullil, Paul G. Plöger, Martin Hofmann-Apitius, Daniel Domingo-Fernández
Parent Title (English):bioRxiv
Number of pages:16
Publisher:Cold Spring Harbor Laboratory
Date of first publication:2021/08/18
Publication status:published in Bioinformatics, Volume 38, Issue 6, 15 March 2022, Pages 1648–1656, doi:10.1093/bioinformatics/btac001
Copyright:The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Keyword:Bioinformatics; Knowledge Graphs; Machine Learning; Natural Language Processing; Transformers
Departments, institutes and facilities:Fachbereich Informatik
Dewey Decimal Classification (DDC):0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 006 Spezielle Computerverfahren
Entry in this database:2021/08/31
Licence (German):License LogoCreative Commons - CC BY - Namensnennung 4.0 International