A Comparison of Vector-based Approaches for Document Similarity Using the RELISH Corpus

Ravinder, Rohitha; Fellerhof, Tim; Dadi, Vishnu; Geist, Lukas; Rocamora, Guillermo; Talha, Muhammad; Rebholz-Schuhmann, Dietrich; Castro, Leyla Jael

The search result changed since you submitted your search request. Documents might be displayed in a different sort order.

search hit 39 of 1269

Back to Result List

A Comparison of Vector-based Approaches for Document Similarity Using the RELISH Corpus

Rohitha Ravinder, Tim Fellerhof, Vishnu Dadi, Lukas Geist, Guillermo Rocamora, Muhammad Talha, Dietrich Rebholz-Schuhmann, Leyla Jael Castro

The continuously increasing number of biomedical scholarly publications makes it challenging to construct document recommendation algorithms that can efficiently navigate through literature. Such algorithms would help researchers in finding similar, relevant, and related publications that align with their research interests. Natural Language Processing offers various alternatives to compare publications, ranging from entity recognition to document embeddings. In this paper, we present the results of a comparative analysis of vector-based approaches to assess document similarity in the RELISH corpus. We aim to determine the best approach that resembles relevance without the need for further training. Specifically, we employ five different techniques to generate vectors representing the text in the documents. These techniques employ a combination of various Natural Language Processing frameworks such as Word2Vec, Doc2Vec, dictionary-based Named Entity Recognition, and state-of-the-art models based on BERT. To evaluate the document similarity obtained by these approaches, we utilize different evaluation metrics that account for relevance judgment, relevance search, and re-ranking of the relevance search. Our results demonstrate that the most promising approach is an in-house version of document embeddings, starting with word embeddings and using centroids to aggregate them by document.

Metadaten
Document Type:	Conference Object
Language:	English
Author:	Rohitha Ravinder, Tim Fellerhof, Vishnu Dadi, Lukas Geist, Guillermo Rocamora, Muhammad Talha, Dietrich Rebholz-Schuhmann, Leyla Jael Castro
Parent Title (English):	Proceedings SeWebMeDa-2023: 6th International Workshop on Semantic Web solutions for large-scale biomedical data analytics, May 29, 2023, Hersonissos, Greece
Number of pages:	10
URN:	urn:nbn:de:hbz:1044-opus-75198
URL:	https://ceur-ws.org/Vol-3466#paper5
Publisher:	RWTH Aachen
Place of publication:	Aachen, Germany
Publishing Institution:	Hochschule Bonn-Rhein-Sieg
Date of first publication:	2023/08/29
Copyright:	© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Funding:	This work was partially supported by the STELLA project funded by the Deutsche Forschungsgemeinschaft DFG (project no. 407518790), the NFDI4DataScience project also funded by DFG (project no. 460234259), and the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A).
Keyword:	Document relevance; Named Entity Recognition; Recommendation systems; Word embeddings; document similarity
Dewey Decimal Classification (DDC):	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Entry in this database:	2023/09/15
Licence (German):	Creative Commons - CC BY - Namensnennung 4.0 International

Open Access

A Comparison of Vector-based Approaches for Document Similarity Using the RELISH Corpus

Download full text files

Export metadata

Additional Services

Statistics