Volltext-Downloads (blau) und Frontdoor-Views (grau)
  • search hit 66 of 222
Back to Result List

Classifying Unstructured Data into Natural Language Text and Technical Information

  • Software repository data, for example in issue tracking systems, include natural language text and technical information, which includes anything from log files via code snippets to stack traces. However, data mining is often only interested in one of the two types e.g. in natural language text when looking at text mining. Regardless of which type is being investigated, any techniques used have to deal with noise caused by fragments of the other type i.e. methods interested in natural language have to deal with technical fragments and vice versa. This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering. The approach was evaluated using 225 manually annotated text passages from developer emails and issue tracker data. Using white space tokenization as a basis, the overall precision of the approach is 0.84 and the recall is 0.85.

Export metadata

Additional Services

Search Google Scholar Check availability

Statistics

Show usage statistics
Metadaten
Document Type:Conference Object
Language:English
Author:Thorsten Merten, Bastian Mager, Simone Bürsner, Barbara Paech
Parent Title (English):MSR 2014. Proceedings of the 11th Working Conference on Mining Software Repositories. May 31 - June 1, 2014, Hyderabad, India
First Page:300
Last Page:303
ISBN:978-1-4503-2863-0
DOI:https://doi.org/10.1145/2597073.2597112
Publisher:Association for Computing Machinery
Place of publication:New York, NY, United States
Date of first publication:2014/05/31
Copyright:Copyright is held by the author/owner(s). Publication rights licensed to ACM. Abstracting with credit is permitted.
Funding:This work is partly funded by the Bonn-Rhein-Sieg University of Applied Sciences Graduate Institute.
Keyword:Mining software repositories; heuristics; hierarchical clustering; preprocessing; unstructured data
Departments, institutes and facilities:Fachbereich Informatik
Dewey Decimal Classification (DDC):0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Entry in this database:2015/08/21