Refine
Departments, institutes and facilities
Document Type
- Master's Thesis (36) (remove)
Year of publication
Language
- English (36) (remove)
Keywords
- Active Learning (2)
- Computer Vision (2)
- Emergency support system (2)
- Mobile sensors (2)
- Object Detection (2)
- deep learning (2)
- object detection (2)
- 3D-Scanner (1)
- ASAG (1)
- Adaptive mesh refinement (1)
In order to help journalists investigate inside large audiovisual archives, as maintained by news broadcast agencies, the multimedia data must be indexed by text-based search engies. By automatically creating a transcript through automatic speech recognition (ASR), the spoken word becomes accessible to text search, and queries for keywords are made possible. But stil, important contextual information like the identity of the speaker is not captured. Especially when gathering original footage in the political domain, the identity of the speaker can be the most important query constraint, although this name may not be prominent in the words spoken. It is thus desireable to have this information provided explicitely to the search engine. To provide this information, the archive must be an alyzed by automatic Speaker Identification (SID). While this research topic has seen substantial gains in accuracy and robustness over last years, it has not yet established itself as a helpful, large-scale tool outside the research community. This thesis sets out to establish a workflow to provide automatic speaker identification. Its application is to help journalists searching on speeches given in the German parliament (Bundestag). This is a contribution to the News-Stream 3.0 project, a BMBF funded research project that addresses accessibility of various data sources for journalists.
Modern engineering relies heavily on utilizing computer technologies. This is especially true for thermoplastic manufacturing, such as blow molding. A crucial milestone for digitalization is the continuous integration of data in unified or interoperable systems. While new simulation technologies are constantly developed, data management standards such as STEP fail at integrating them. On the other hand, industrial standards such as ”VMAP” manage to improve interoperability for Small and Medium-sized Enterprises. However, they do not provide Simulation Process and Data Management (SPDM) technologies. For SPDM integration of VMAP data, Ontology-Based Data Access is used to allow continuing the digital thread in custom semantic-based open-source solutions. An ontology of the database format (VMAP) was generated alongside an expandable knowledge graph of data access methods. A Python-based software architecture was developed, automatically using the semantic representations of database format and data access to query data and metadata within the VMAP file. The result is a software architecture template that can be adapted for other data standards and integrated into semantic data management systems. It allows semantic queries on simulation data down to element-wise resolution without integrating the whole model information. The architecture can instantiate a file in a knowledge graph, query a file’s metadatum and, in case it is not yet available, find a semantically represented process that allows the creation and instantiation of the required metadatum. See Figure 1. The results of this thesis can be expected to form a basis for semantic SPDM tools.
Semantic Image Segmentation Combining Visible and Near-Infrared Channels with Depth Information
(2015)
Image understanding is a vital task in computer vision that has many applications in areas such as robotics, surveillance and the automobile industry. An important precondition for image understanding is semantic image segmentation, i.e. the correct labeling of every image pixel with its corresponding object name or class. This thesis proposes a machine learning approach for semantic image segmentation that uses images from a multi-modal camera rig. It demonstrates that semantic segmentation can be improved by combining different image types as inputs to a convolutional neural network (CNN), when compared to a single-image approach. In this work a multi-channel near-infrared (NIR) image, an RGB image and a depth map are used. The detection of people is further improved by using a skin image that indicates the presence of human skin in the scene and is computed based on NIR information. It is also shown that segmentation accuracy can be enhanced by using a class voting method based on a superpixel pre-segmentation. Models are trained for 10-class, 3-class and binary classification tasks using an original dataset. Compared to the NIR-only approach, average class accuracy is increased by 7% for 10-class, and by 22% for 3-class classification, reaching a total of 48% and 70% accuracy, respectively. The binary classification task, which focuses on the detection of people, achieves a classification accuracy of 95% and true positive rate of 66%. The report at hand describes the proposed approach and the encountered challenges and shows that a CNN can successfully learn and combine features from multi-modal image sets and use them to predict scene labeling.
The Java Virtual Machine (JVM) executes the compiled bytecode version of a Java program and acts as a layer between the program and the operating system. The JVM provides additional features such as Process, Thread, and Memory Management to manage the execution of these programs. The Garbage Collection (GC) is part of the memory management and has an impact on the overall runtime performance because it is responsible for removing dead objects from the heap. Currently, the execution of a program needs to be halted during every GC run. The problem of this stop-the-world approach is that all threads in the JVM need to be suspended. It would be desirable to have a thread-local GC that only blocks the current thread and does not affect any other threads. In particular, this would improve the execution of multi-threaded Java programs. An object that is accessible by more than one thread is called escaped. It is not possible to thread-locally determine if escaped objects are still alive so that they cannot be handled in a thread-local GC. To gain significant performance improvements with a thread-local GC, it is therefore necessary to determine if it is possible to reliably predict if a given object will escape. Experimental results show that the escaping of objects can be predicted with high accuracy based on the line of code the object was allocated from. A thread-local GC was developed to minimize the number of stop-the-world GCs. The prototype implementation delivers a proof-of-concept that shows that this goal can be achieved in certain scenarios.
An analysis of sharing string objects with the Java Virtual Machine was conducted; they are the most used objects in Java programs and they are immutable - thus they are read-only and easily identified. While the results are promising, it is clear that sharing more objects would result in better performance. Automatic object selection for sharing is non-trivial, because in the current state only read-only objects can be shared. This attribute can not be easily determined during runtime by an algorithm; the developer on the other hand can. This thesis presents the development of an Application Programmer Interface (API) that allows programmers to use the Java Virtual Machine (JVM) internal sharing functionality. Furthermore, we present the usage of the sharing API. Open-source software was used as real-world test cases. Afterwards the evaluation shows that the ratio between memory savings and start-up time overhead is reasonable.
RNA is one of the most important molecules in living organisms. One of its main functions is to regulate gene expression. This involves binding to and forming a joint structure with a messenger RNA. An RNAs functions is determined by its sequence and the structure it folds into. Accordingly, the prediction of individual as well as joint structures is an important area of research. In this thesis a method for the prediction of RNA-RNA joint structure using their minimum free energy (mfe) structures was developed. It is able to extensively explore the joint structural landscape of two interacting RNAs by taking advantage of the locality of changes in the RNAs structures as well as natural and energetic constraints. The method predicts the mfe joint structure as well as alternative stable joint structures while also computing non-optimal folding pathways from the unbound individual mfe structures to the predicted joint structures. It is shown how an enumeration approach is used which is able to deal with the enormous search space as well as to avoid any cyclic behaviour. The method is evaluated using two standard datasets of known interacting RNAs and shows good results.
In the field of autonomous robotics, sensors have played a major role in defining the scope of technology and to a great extent, limitations of it as well. This cycle of constant updates and hence technological advancement has made given birth to some serious industries which were once inconceivable. Industries like autonomous driving which has a serious impact on safety and security of people, also has an equally harsh implication on the dynamics and economics of the market. With sensors like LiDAR and RADAR delivering 3D measurements as point clouds, there is a necessity to process the raw measurements directly and many research groups are working on the same. A sizable research has gone in solving the task of object detection on 2D images. In this thesis we aim to develop a LiDAR based 3D object detection scheme. We combine the ideas of PointPillars and feature pyramid networks from 2D vision to propose Pillar-FPN. The proposed method directly takes 3D point clouds as input and outputs a 3D bounding box. Our pipeline consists of multiple variations of proposed Pillar-FPN at the feature fusion level that are described in the results section. We have trained our model on the KITTI train dataset and evaluated it on KITTI validation dataset.
This project focuses on object detection in dense volume data. There are several types of dense volume data, namely Computed Tomography (CT) scan, Positron Emission Tomography (PET), Magnetic Resonance Imaging (MRI). This work focuses on CT scans. CT scans are not limited to the medical domain; they are also used in industries. CT scans are used in airport baggage screening, assembly lines, and the object detection systems in these places should be able to detect objects fast. One of the ways to address the issue of computational complexity and make the object detection systems fast is to use low-resolution images. Low-resolution CT scanning is fast. The entire process of scanning and detection can be made faster by using low-resolution images. Even in the medical domain, to reduce the rad iation dose, the exposure time of the patient should be reduced. The exposure time of patients could be reduced by allowing low-resolution CT scans. Hence it is essential to find out which object detection model has better accuracy as well as speed at low-resolution CT scans. However, the existing approaches did not provide details about how the model would perform when the resolution of CT scans is varied. Hence in this project, the goal is to analyze the impact of varying resolution of CT scans on both the speed and accuracy of the model. Three object detection models, namely RetinaNet, YOLOv3, and YOLOv5, were trained at various resolutions. Among the three models, it was found that YOLOv5 has the best mAP and f1 score at multiple resolutions on the DeepLesion dataset. RetinaNet model h as the least inference time on the DeepLesion dataset. From the experiments, it could be asserted that sacrificing mean average precision (mAP) to improve inference time by reducing resolution is feasible.
This thesis proposes a multi-label classification approach using the Multimodal Transformer (MulT) [80] to perform multi-modal emotion categorization on a dataset of oral histories archived at the Haus der Geschichte (HdG). Prior uni-modal emotion classification experiments conducted on the novel HdG dataset provided less than satisfactory results. They uncovered issues such as class imbalance, ambiguities in emotion perception between annotators, and lack of representative training data to perform transfer learning [28]. Hence, the objectives of this thesis were to achieve better results by performing a multi-modal fusion and resolving the problems arising from class imbalance and annotator-induced bias in emotion perception. A further objective was to assess the quality of the novel HdG dataset and benchmark the results using SOTA techniques. Through a literature survey on the challenges, models, and datasets related to multi-modal emotion recognition, we created a methodology utilizing the MulT along with a multi-label classification approach. This approach produced a considerable improvement in the overall emotion recognition by obtaining an average AUC of 0.74 and Balanced-accuracy of 0.70 on the HdG dataset, which is comparable to state-of-the-art (SOTA) results on other datasets. In this manner, we were also able to benchmark the novel HdG dataset as well as introduce a novel multi-annotator learning approach to understand each annotator’s relative strengths and weaknesses for emotion perception. Our evaluation results highlight the potential benefits of the novel multi-annotator learning approach in improving overall performance by resolving the problems arising from annotator-induced bias and variation in the perception of emotions. Complementing these results, we performed a further qualitative analysis of the HdG annotations with a psychologist to study the ambiguities found in the annotations. We conclude that the ambiguities in annotations may have resulted from a combination of several socio-psychological factors and systemic issues associated with the process of creating these annotations. As these problems are also present in most multi-modal emotion recognition datasets, we conclude that the domain could benefit from a set of annotation guidelines to create standardized datasets.
Today publications are digitally available which enables researchers to search the text and often also the content of tables. On the contrary, images cannot be searched which is not a problem for most fields, but in chemistry most of the information are contained in images, especially structure diagrams. Next to the "normal" chemical structures, which represent exactly one molecule, there also exist generic structures, so called Markush structures. These contain variable parts and additional textual information which enable them to represent several molecules at once. This can vary between just a few and up to thousands or even millions. This ability lead to a spread of Markush structures in patents, because it enables patents to protect entire families of molecules at once. Next to the prevention of an enumeration of all structures it also has the advantage that, if a Markush structure is used in a patent, it is much harder to determine whether a specific structure is protected by it or not. To solve the question about the protection of a structure, it is necessary to search the patents. Appropriate databases for this task already do exist, but are filled manually. An automatic processing does not yet exist. In this project a Markush structure reconstruction prototype is developed which is able to reconstruct bitmaps including Markush structures (meaning a depiction of the structure and a text part describing the generic parts) into a digital format and save them in the newly developed context-free grammar based file format extSMILES. This format is searchable due to its context-free grammar based design. To be able to develop a Markush structure reconstruction prototype, an in depth analysis of the concept of Markush structures and their requirements for a reconstruction process was performed. Thereby it is stated, that the common connection table concept of the existing file formats is not able to store Markush structures. Especially challenging are conditions for most of the formats. Thus, a context-free grammar based file format is developed, which extends the SMILES format. This extSMILES called format assures the searchability of the results by its context-free grammar based concept, and is able to store all information contained in Markush structures. In addition it is generic, extendable and easily understandable. The developed prototype for the Markush structure reconstruction uses extSMILES as output format and is based on the chemical structure recognition tool chemoCR and the Unstructured Information Management Architecture UIMA. For chemoCR modules are developed which enable it to recognize and assemble Markush structures as well as to return the reconstruction result in extSMILES. For UIMA on the other hand, a pipeline is developed, which is able to analyse and translate the input text files to extSMILES. The results of both tools then are combined and presented in chemoCR. An evaluation of the prototype is performed on a representative set of twelve structures of interest and low image quality which contain all typical Markush elements. Trivial structures containing only one R-group are not evaluated. Due to the challenging nature of the images, no Markush structure could be correctly reconstructed. But by regarding the assumption, that R-group definitions which are described by natural language are excluded from the task, and under the condition that the core structure reconstruction is improved, the rate of success can be increased to 58.4%.
Statins are a group of hypolipidemic drugs that act by competitive inhibition of the HMGR enzyme. They are generally considered effective and safe but claimed to have side effects on skeletal muscles. A molecular side effect of statins is the block of terpene biosynthesis and hence of dolichol involved in N-glycosylation and O-mannosylation of proteins. Defects in O-mannosylation lead to α-dystroglycan (α-DG) hypoglycosylation and a series of hereditary dystroglycanopathies. The current project aims to get insight into molecular pathomechanisms induced by statins in mammalian muscle cells and to unravel a potential link between these effects and statin-induced decreases of α-DG O-mannosylation. The study was based on mass spectrometric proteomics supported by western blot analysis to reveal Rosuvastatin effects on cellular pathways under high (micromolar) or low (nanomolar) conditions. Differential proteomics revealed higher statin effects on muscle cell function in micromolar than nanomolar concentration, which is reached in the patient’s plasma. We demonstrated distinct and partially overlapping patterns of fold-changed proteins under high and low statin conditions. Gene ontology term enrichment (GOTE) analyses of fold-changed proteins revealed cellular pathways related to muscle function and development are affected, even under low statin conditions, typically reached in the patient’s plasma during prophylactic medication.
The recent explosion of available audio-visual media is the new challenge for information retrieval research. Audio speech recognition systems translate spoken content to the text domain. There is a need for searching and indexing this data which possesses no logical structure. One possible way to structure it on a high level of abstraction is by finding topic boundaries. Two unsupervised topic segmentation methods were evaluated with real-world data in the course of this work. The first one, TSF, models topic shifts as fluctuations in the similarity function of the transcript. The second one, LCSeg, approaches topic changes as places with the least overlapping lexical chains. Only LCSeg performed close to a similar real-world corpus. Other reported results could not be outperformed. Topic analysis based on the repeated word usage models renders topic changes more ambiguous than expected. This issue has more impact on the segmentation quality than the state-of-the-art ASR word error rate. It could be concluded that it is advisable to develop topic segmentation algorithms with real-world data to avoid potential biases to artificial data. Unlike evaluated approaches based on word usage analysis, methods operating with local contexts can be expected to perform better through emulation of semantic dependencies.
Interactive Object Detection
(2019)
The success of state-of-the-art object detection methods depend heavily on the availability of a large amount of annotated image data. The raw image data available from various sources are abundant but non-annotated. Annotating image data is often costly, time-consuming or needs expert help. In this work, a new paradigm of learning called Active Learning is explored which uses user interaction to obtain annotations for a subset of the dataset. The goal of active learning is to achieve superior object detection performance with images that are annotated on demand. To realize active learning method, the trade-off between the effort to annotate (annotation cost) unlabeled data and the performance of object detection model is minimised.
Random Forests based method called Hough Forest is chosen as the object detection model and the annotation cost is calculated as the predicted false positive and false negative rate. The framework is successfully evaluated on two Computer Vision benchmark and two Carl Zeiss custom datasets. Also, an evaluation of RGB, HoG and Deep features for the task is presented.
Experimental results show that using Deep features with Hough Forest achieves the maximum performance. By employing Active Learning, it is demonstrated that performance comparable to the fully supervised setting can be achieved by annotating just 2.5% of the images. To this end, an annotation tool is developed for user interaction during Active Learning.
This master thesis describes a supervised approach to the detection and the identification of humans in TV-style video sequences. In still images and video sequences, humans appear in different poses and views, fully visible and partly occluded, with varying distances to the camera, at different places, under different illumination conditions, etc. This diversity in appearance makes the task of human detection and identification to a particularly challenging problem. A possible solution of this problem is interesting for a wide range of applications such as video surveillance and content-based image and video processing. In order to detect humans in views ranging from full to close-up view and in the presence of clutter and occlusion, they are modeled by an assembly of several upper body parts. For each body part, a detector is trained based on a Support Vector Machine and on densely sampled, SIFT-like feature points in a detection window. For a more robust human detection, localized body parts are assembled using a learned model for geometric relations based on Gaussians. For a flexible human identification, the outward appearance of humans is captured and learned using the Bag-of-Features approach and non-linear Support Vector Machines. Probabilistic votes for each body part are combined to improve classification results. The combined votes yield an identification accuracy of about 80% in our experiments on episodes of the TV series "Buffy the Vampire Slayer". The Bag-of-Features approach has been used in previous work mainly for object classification tasks. Our results show that this approach can also be applied to the identification of humans in video sequences. Despite the difficulty of the given problem, the overall results are good and encourage future work in this direction.
Neural network based object detectors are able to automatize many difficult, tedious tasks. However, they are usually slow and/or require powerful hardware. One main reason is called Batch Normalization (BN) [1], which is an important method for building these detectors. Recent studies present a potential replacement called Self-normalizing Neural Network (SNN) [2], which at its core is a special activation function named Scaled Exponential Linear Unit (SELU). This replacement seems to have most of BNs benefits while requiring less computational power. Nonetheless, it is uncertain that SELU and neural network based detectors are compatible with one another. An evaluation of SELU incorporated networks would help clarify that uncertainty. Such evaluation is performed through series of tests on different neural networks. After the evaluation, it is concluded that, while indeed faster, SELU is still not as good as BN for building complex object detector networks.
Machine learning-based solutions are frequently adapted in several applications that require big data in operations. The performance of a model that is deployed into operations is subject to degradation due to unanticipated changes in the flow of input data. Hence, monitoring data drift becomes essential to maintain the model’s desired performance. Based on the conducted review of the literature on drift detection, statistical hypothesis testing enables to investigate whether incoming data is drifting from training data. Because Maximum Mean Discrepancy (MMD) and Kolmogorov-Smirnov (KS) have shown to be reliable distance measures between multivariate distributions in the literature review, both were selected from several existing techniques for experimentation. For the scope of this work, the image classification use case was experimented with using the Stream-51 dataset. Based on the results from different drift experiments, both MMD and KS showed high Area Under Curve values. However, KS exhibited faster performance than MMD with fewer false positives. Furthermore, the results showed that using the pre-trained ResNet-18 for feature extraction maintained the high performance of the experimented drift detectors. Furthermore, the results showed that the performance of the drift detectors highly depends on the sample sizes of the reference (training) data and the test data that flow into the pipeline’s monitor. Finally, the results also showed that if the test data is a mixture of drifting and non-drifting data, the performance of the drift detectors does not depend on how the drifting data are scattered with the non-drifting ones, but rather their amount in the test set
In (dynamic) adaptive mesh refinement (AMR) an input mesh is refined or coarsened to the need of the numerical application. This refinement happens with no respect to the originally meshed domain and is therefore limited to the geometrical accuracy of the original input mesh. We presented a novel approach to equip this input mesh with additional geometry information, to allow refinement and high-order cells based on the geometry of the original domain. We already showed a limited implementation of this algorithm. Now we evaluate this prototype with a numerical application and we prove its influence on the accuracy of certain numerical results. To be as practical as possible, we implement the ability to import meshes generated by Gmsh and equip them with the needed geometry information. Furthermore, we improve the mapping algorithm, which maps the geometry information of the boundary of a cell into the cell's volume. With these preliminary steps done, we use out new approach in a simulation of the advection of a concentration along the boundary of a sphere shell and past the boundary of a rotating cylinder. We evaluate the accuracy of our approach in comparison to the conventional refinement of cells to answer our research question: How does the performance and accuracy of the hexahedral curved domain AMR algorithm compare to linear AMR when solving the advection equation with the linear finite volume method? To answer this question, we show the influence of curved AMR on our simulation results and see, that it is even able to outperform far finer linear meshes in terms of accuracy. We also see that the current implementation of this approach is too slow for practical usage. We can therefore prove the benefits of curved AMR in certain, geometry-related application scenarios and show possible improvements to make it more feasible and practical in the future.
Estimation of Prediction Uncertainty for Semantic Scene Labeling Using Bayesian Approximation
(2018)
With the advancement in technology, autonomous and assisted driving are close to being reality. A key component of such systems is the understanding of the surrounding environment. This understanding about the environment can be attained by performing semantic labeling of the driving scenes. Existing deep learning based models have been developed over the years that outperform classical image processing algorithms for the task of semantic labeling. However, the existing models only produce semantic predictions and do not provide a measure of uncertainty about the predictions. Hence, this work focuses on developing a deep learning based semantic labeling model that can produce semantic predictions and their corresponding uncertainties. Autonomous driving needs a real-time operating model, however the Full Resolution Residual Network (FRRN) [4] architecture, which is found as the best performing architecture during literature search, is not able to satisfy this condition. Hence, a small network, similar to FRRN, has been developed and used in this work. Based on the work of [13], the developed network is then extended by adding dropout layers and the dropouts are used during testing to perform approximate Bayesian inference. The existing works on uncertainties, do not have quantitative metrics to evaluate the quality of uncertainties estimated by a model. Hence, the area under curve (AUC) of the receiver operating characteristic (ROC) curves is proposed and used as an evaluation metric in this work. Further, a comparative analysis about the influence of dropout layer position, drop probability and the number of samples, on the quality of uncertainty estimation is performed. Finally, based on the insights gained from the analysis, a model with optimal configuration of dropout is developed. It is then evaluated on the Cityscape dataset and shown to be outperforming the baseline model with an AUC-ROC of about 90%, while the latter having AUC-ROC of about 80%.
The work done in this thesis enhances the MMD algorithm in multi-core environments. The MMD algorithm, a transformation based algorithm for reversible logic synthesis, is based on the works introduced by Maslov, Miller and Dueck and their original, sequential implementation. It synthesises a formal function specification, provided by a truth table, into a reversible network and is able to perform several optimization steps after the synthesis. This work concentrates on one of these optimization steps, the template matching. This approach is used to reduce the size of the reversible circuit by replacing a number of gates that match a template which implements the same function and uses less gates. Smaller circuits have several benefits since they need less area and are not as costly. The template matching approach introduced in the original works is computationally expensive since it tries to match a library of templates against the given circuit. For each template at each position in the circuit, a number of different combinations have to be calculated during runtime resulting in high execution times, especially for large circuits. In order to make the template matching approach more efficient and usable, it has been reimplemented in order to take advantage of modern multi-core architectures such as the Cell Broadband Engine or a Graphics Processing Unit. For this work, two algorithmically different approaches that try to consider each multi-core architecture’s strengths, have been analyzed and improved. For the analysis these approaches have been cross-implemented on the two target hardware architectures and compared to the original parallel versions. Important metrics for this analysis are the execution time of the algorithm and the result of the minimization with the template matching approach. It could be shown that the algorithmically different approaches produce the same minimization results, independent of the used hardware architecture. However, both cross-implementations also show a significantly higher execution time which makes them practically irrelevant. The results of the first analysis and comparison lead to the decision to enhance only the original parallel approaches. Using the same metrics for successful enhancements as mentioned above, it could be shown that improving the algorithmic concepts and exploiting the capabilities of the hardware lead to better results for the execution time and the minimization results compared to their original implementations.