Refine
H-BRS Bibliography
- yes (5)
Departments, institutes and facilities
Document Type
- Conference Object (3)
- Article (1)
- Dataset (1)
Keywords
- Large Language Models (2)
- natural language processing (2)
- Automatic Short Answer Grading (1)
- ChatGPT (1)
- Intelligent virtual agents (1)
- Interpretability (1)
- Llama (1)
- Mistral (1)
- Multimodal Models (1)
- Rubrics (1)
A company's financial documents use tables along with text to organize the data containing key performance indicators (KPIs) (such as profit and loss) and a financial quantity linked to them. The KPI’s linked quantity in a table might not be equal to the similarly described KPI's quantity in a text. Auditors take substantial time to manually audit these financial mistakes and this process is called consistency checking. As compared to existing work, this paper attempts to automate this task with the help of transformer-based models. Furthermore, for consistency checking it is essential for the table's KPIs embeddings to encode the semantic knowledge of the KPIs and the structural knowledge of the table. Therefore, this paper proposes a pipeline that uses a tabular model to get the table's KPIs embeddings. The pipeline takes input table and text KPIs, generates their embeddings, and then checks whether these KPIs are identical. The pipeline is evaluated on the financial documents in the German language and a comparative analysis of the cell embeddings' quality from the three tabular models is also presented. From the evaluation results, the experiment that used the English-translated text and table KPIs and Tabbie model to generate table KPIs’ embeddings achieved an accuracy of 72.81% on the consistency checking task, outperforming the benchmark, and other tabular models.
Grading student answers and providing feedback are essential yet time-consuming tasks for educators. Recent advancements in Large Language Models (LLMs), including ChatGPT, Llama, and Mistral, have paved the way for automated support in this domain. This paper investigates the efficacy of instruction-following LLMs in adhering to predefined rubrics for evaluating student answers and delivering meaningful feedback. Leveraging the Mohler dataset and a custom German dataset, we evaluate various models, from commercial ones like ChatGPT to smaller open-source options like Llama, Mistral, and Command R. Additionally, we explore the impact of temperature parameters and techniques such as few-shot prompting. Surprisingly, while few-shot prompting enhances grading accuracy closer to ground truth, it introduces model inconsistency. Furthermore, some models exhibit non-deterministic behavior even at near-zero temperature settings. Our findings highlight the importance of rubrics in enhancing the interpretability of model outputs and fostering consistency in grading practices.
Integration of Multi-modal Cues in Synthetic Attention Processes to Drive Virtual Agent Behavior
(2017)
This dataset contains questions and answers from an introductory computer science bachelor course on statistics and probability theory at Hochschule Bonn-Rhein-Sieg. The dataset includes three questions and a total of 90 answers, each evaluated using binary rubrics (yes/no) associated with specific scores.