OntoClue, a framework to compare vector-based approaches for document relatedness using the RELISH corpus
- The continuous increase of biomedical scholarly publications makes it challenging to construct document recommendation algorithms to navigate through literature, an important feature for researchers to keep up with relevant publications. Understanding semantic relatedness and similarity between two documents could improve document recommendations. The objective of this study is performing a comparative analysis of vector-based approaches to assess document similarity in the RELISH corpus. Here we present our approach to compare five different techniques to generate vectors representing the text in the documents. These techniques employ a combination of various Natural Language Processing frameworks such as Word2Vec, Doc2Vec, dictionary-based Named Entity Recognition as well as state-of-the-art models based on BERT.
Document Type: | Conference Object |
---|---|
Language: | English |
Author: | Rohitha Ravinder, Tim Fellerhoff, Vishnu Dadi, Lukas Geist, Guillermo Rocamora, Muhammad Talha, Dietrich Rebholz-Schuhmann, Leyla Jael Castro |
Parent Title (English): | Proceedings Semantic Web Applications and Tools for Healthcare and Life Sciences, February 13–16, 2023, Basel, Switzerland |
Number of pages: | 2 |
First Page: | 159 |
Last Page: | 160 |
ISSN: | 1613-0073 |
URN: | urn:nbn:de:hbz:1044-opus-73854 |
URL: | https://ceur-ws.org/Vol-3415/#paper-38 |
URL: | https://nbn-resolving.org/urn:nbn:de:0074-3415-0 |
Publisher: | RWTH Aachen |
Place of publication: | Aachen, Germany |
Publishing Institution: | Hochschule Bonn-Rhein-Sieg |
Date of first publication: | 2023/06/22 |
Copyright: | © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). |
Funding: | This work was partially supported by the STELLA project funded by DFG (project no. 407518790), the NFDI4DataScience project funded by GWK and DFG (no. NFDI 34/1), and the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A) |
Keyword: | Named Entity Recognition; Word embeddings; document similarity |
Dewey Decimal Classification (DDC): | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik |
Entry in this database: | 2023/07/04 |
Licence (German): | Creative Commons - CC BY - Namensnennung 4.0 International |