A Comparison of Vector-based Approaches for Document Similarity Using the RELISH Corpus
- The continuously increasing number of biomedical scholarly publications makes it challenging to construct document recommendation algorithms that can efficiently navigate through literature. Such algorithms would help researchers in finding similar, relevant, and related publications that align with their research interests. Natural Language Processing offers various alternatives to compare publications, ranging from entity recognition to document embeddings. In this paper, we present the results of a comparative analysis of vector-based approaches to assess document similarity in the RELISH corpus. We aim to determine the best approach that resembles relevance without the need for further training. Specifically, we employ five different techniques to generate vectors representing the text in the documents. These techniques employ a combination of various Natural Language Processing frameworks such as Word2Vec, Doc2Vec, dictionary-based Named Entity Recognition, and state-of-the-art models based on BERT. To evaluate the document similarity obtained by these approaches, we utilize different evaluation metrics that account for relevance judgment, relevance search, and re-ranking of the relevance search. Our results demonstrate that the most promising approach is an in-house version of document embeddings, starting with word embeddings and using centroids to aggregate them by document.
Document Type: | Conference Object |
---|---|
Language: | English |
Author: | Rohitha Ravinder, Tim Fellerhof, Vishnu Dadi, Lukas Geist, Guillermo Rocamora, Muhammad Talha, Dietrich Rebholz-Schuhmann, Leyla Jael Castro |
Parent Title (English): | Proceedings SeWebMeDa-2023: 6th International Workshop on Semantic Web solutions for large-scale biomedical data analytics, May 29, 2023, Hersonissos, Greece |
Number of pages: | 10 |
URN: | urn:nbn:de:hbz:1044-opus-75198 |
URL: | https://ceur-ws.org/Vol-3466#paper5 |
Publisher: | RWTH Aachen |
Place of publication: | Aachen, Germany |
Publishing Institution: | Hochschule Bonn-Rhein-Sieg |
Date of first publication: | 2023/08/29 |
Copyright: | © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). |
Funding: | This work was partially supported by the STELLA project funded by the Deutsche Forschungsgemeinschaft DFG (project no. 407518790), the NFDI4DataScience project also funded by DFG (project no. 460234259), and the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A). |
Keywords: | Document relevance; Named Entity Recognition; Recommendation systems; Word embeddings; document similarity |
Dewey Decimal Classification (DDC): | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik |
Entry in this database: | 2023/09/15 |
Licence (German): | Creative Commons - CC BY - Namensnennung 4.0 International |