Apache Spark is an open source cluster computing framework released in 2014. It serves as an alternative to (part of) Apache Hadoop as it was originally developed in response to limitations in the Hadoop MapReduce cluster computing paradigm (read more on Hadoop). As such, Apache Spark complements Apache Hadoop as Spark replaces the use of Hadoop MapReduce but still requires (if used on scale) the other Hadoop elements: i.e., a file system (e.g. Hadoop Distributed File System) and a cluster manager (e.g. Hadoop YARN).
- Spark Core allows to run basic functions on a Spark Engine. Spark Core will enable basic functionalities including filters, selections, joins and mathematical operations. The Core interface will abstract the back-end for us and invoke parallel operations to get the results asap. The Core has been exposed through programming interfaces with Java, Python, Scala, and R.
- Spark SQL allows to run SQL on Spark Engine. With Spark SQL, Spark was introduced DataFrames which can be seen as a structured data-tables.
- Spark Streaming allows to use Spark for streaming purposes. Simply put: it ingests data in mini-batches and performs transformations on those mini-batches.
- Spark MLLib allows to run pre-configured machine learning algorithms on Spark: this includes recommenders, regressions, clusterings and segmentations. Through Spark MLLib, a Spark Engine could now be used to speed up the building and running of algorithms.
- GraphX allowed to run Graph operations on the Spark Engine.