Conception And Analysis Of A Raspberry Pi Cluster With Apache Spark

Kuhaupt, Nicolas (2017) Conception And Analysis Of A Raspberry Pi Cluster With Apache Spark. Masters thesis, Ulm University.

[thumbnail of Masterthesis - Nicolas Kuhaupt.pdf]

PDF - Registered users only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (2MB) | Request a copy

Abstract

Due to the latest development in the context of Internet of Things, the amount of generated and collected data increases. Business and science applications are interested
in finding data patterns and correlations between the recorded data sets. To generate immediate insights, we need fast algorithms, that take advantage of distributed calculations. The growth of single computers performance stagnates and there is more potential in tackling the problems of big data by combining computers to scale computing power. Therefore, computers are connected to build clusters. The cluster management, responsible for the division of labor between the single nodes, is executed by new tools such as Apache Spark. Spark holds the record for big data sorting in 2014 and is widely used. It offers in-memory computing for faster calculations, an easy and high-level Machine Learning API and fits well into the Hadoop ecosystem for big data.
We evaluate the performance of a cluster. The test setup includes a set of Raspberry Pi mini computers with installed Hadoop and Spark environment. We want to examine the scaling performance of chosen algorithms, such as Wordcount, Kolmogorov-Smirnov Test, Frequent Pattern Growth, Support Vector Machines, Linear Regression, and K-Means. The parameters for these tests are the dataset size and the number of computation nodes. The results offer an indication of the required number of nodes for a problem definition. Furthermore, we analyzed the mentioned algorithms and their
used data structures to explain their performance, represented by scaling patterns. Last, the implementation and abstractions of Apache Spark are examined for potential
bottlenecks.

Item Type:	Thesis (Masters)
Subjects:	DBIS Research > Master and Phd-Thesis
Divisions:	Faculty of Engineering, Electronics and Computer Science > Institute of Databases and Informations Systems > DBIS Research and Teaching > DBIS Research > Master and Phd-Thesis
Depositing User:	Herr Burkhard Hoppenstedt
Date Deposited:	16 May 2017 11:06
Last Modified:	16 May 2017 11:06
URI:	http://dbis.eprints.uni-ulm.de/id/eprint/1490

Actions (login required)

: View Item