Refine
H-BRS Bibliography
- yes (3)
Document Type
- Master's Thesis (2)
- Bachelor Thesis (1)
Language
- English (3) (remove)
Has Fulltext
- no (3)
The introduction of gestures as a supplementary input modality has become of increasing interest to human computer interaction design, especially for 3D computer environments. This thesis describes the concepts and development of a gesture recognition system based on the machine learning technique of Hidden Markov Models. Well-known from the field of speech recognition, this statistical method is employed in this thesis to represent and recognize predefined gestures. Within this work, gestures are defined as symbols, such as simple geometric shapes or Roman letters. They are extracted from a stream of three-dimensional optical tracking data which is resampled, reduced to 2D and quantized to be used as input to discrete Hidden Markov Models. A set of prerecorded training data is used to learn the parameters of the models and recognition is achieved by evaluating the trained models. The devised system was used to augment an existing virtual reality prototype application which serves as a demonstration and development platform for the VRGeo consortium. The consortium's goal is to investigate and utilize the benefits of virtual reality technology for the oil and gas industry. An isolated test of the system with seven gestures showed accuracies of up to 98.57% and the review from experts in the fields of virtual reality and geophysics was predominantly positive.
The recent explosion of available audio-visual media is the new challenge for information retrieval research. Audio speech recognition systems translate spoken content to the text domain. There is a need for searching and indexing this data which possesses no logical structure. One possible way to structure it on a high level of abstraction is by finding topic boundaries. Two unsupervised topic segmentation methods were evaluated with real-world data in the course of this work. The first one, TSF, models topic shifts as fluctuations in the similarity function of the transcript. The second one, LCSeg, approaches topic changes as places with the least overlapping lexical chains. Only LCSeg performed close to a similar real-world corpus. Other reported results could not be outperformed. Topic analysis based on the repeated word usage models renders topic changes more ambiguous than expected. This issue has more impact on the segmentation quality than the state-of-the-art ASR word error rate. It could be concluded that it is advisable to develop topic segmentation algorithms with real-world data to avoid potential biases to artificial data. Unlike evaluated approaches based on word usage analysis, methods operating with local contexts can be expected to perform better through emulation of semantic dependencies.
Semantic Image Segmentation Combining Visible and Near-Infrared Channels with Depth Information
(2015)
Image understanding is a vital task in computer vision that has many applications in areas such as robotics, surveillance and the automobile industry. An important precondition for image understanding is semantic image segmentation, i.e. the correct labeling of every image pixel with its corresponding object name or class. This thesis proposes a machine learning approach for semantic image segmentation that uses images from a multi-modal camera rig. It demonstrates that semantic segmentation can be improved by combining different image types as inputs to a convolutional neural network (CNN), when compared to a single-image approach. In this work a multi-channel near-infrared (NIR) image, an RGB image and a depth map are used. The detection of people is further improved by using a skin image that indicates the presence of human skin in the scene and is computed based on NIR information. It is also shown that segmentation accuracy can be enhanced by using a class voting method based on a superpixel pre-segmentation. Models are trained for 10-class, 3-class and binary classification tasks using an original dataset. Compared to the NIR-only approach, average class accuracy is increased by 7% for 10-class, and by 22% for 3-class classification, reaching a total of 48% and 70% accuracy, respectively. The binary classification task, which focuses on the detection of people, achieves a classification accuracy of 95% and true positive rate of 66%. The report at hand describes the proposed approach and the encountered challenges and shows that a CNN can successfully learn and combine features from multi-modal image sets and use them to predict scene labeling.