CLUSTERING ALGORITHM FOR MASS SPECTROMETRY DATA USING GENERAL-PURPOSE COMPUTING ON GRAPHICS PROCESSING UNITS
MetadataShow full item record
Modern mass spectrometers can produce mass spectra data at a very high rate. Usually, this data has a signi cant percentage of redundant spectra that in- crease the database lookup time when searching for peptides. Therefore, there is a need for data-mining techniques (e.g. clustering) to reduce the complexity of these mass spectra datasets before database search. Multi-core architectures, speci cally Graphics Processing Units (GPUs) have evolved tremendously in the recent years and are an ideal option for clustering these large mass spectra datasets. In this thesis, we present an e cient and scalable parallel algorithm for clustering mass spectra using the well known 'F-set' similarity metric. We describe the algorithmic framework and the various optimizations that serve to vastly improve the algorithm's performance and accuracy. We test the algorithm on a variety of real as well as self-generated mass spectra datasets and show that the algorithm achieves highly accurate clustering with performance gain of around 50 to 100 times as compared to serial implementations in literature. Thus, by clustering mass spectra corresponding to unique peptides to- gether, the algorithm allows faster identi cation of peptides in a subsequent database search.