CHAPTER 6
Spark is, at the moment, a star of the big data processing world. One could say it’s getting more and more popular by the day. It’s the most active open source big-data processing framework. A lot of this popularity is coming from a revolutionary idea at the heart of Spark, the idea that nodes in clusters, not just the disk, can share memory to process tasks.
One of Spark’s selling points is that it can process data up to one hundred times faster than other open source big data technologies available today. This is truly incredible and important progress. Through human history, every now and then a technology comes by that changes the world. Some of them turn it upside down; some of them give a small contribution that proves very important in the long run. To be honest I don’t know how to classify Spark, but it definitely changed a lot in the field of big data processing. Although Spark has been around for almost six years now, it has become incredibly popular by setting a record in the 2014 Daytona Gray Sort Challenge and beating Hadoop’s previous record by running on one tenth of instances. Until that challenge, I had only heard about Spark every now and then and didn’t know much about it. But when I heard about the record I started to look into it. That, and most of the cool people that I follow on Twitter started talking about it and combining it with Cassandra (and those two technologies go very well together). My feeling is that at the moment there are not that many Spark developers out there, so this book is more about an introduction to Spark along with a few more advanced topics. I felt I could contribute the most if I oriented myself towards Spark newcomers.
Recently I watched a talk by Brian Clapper on Spark. One of his statements was that the battle of NoSQL storage technologies is pretty much over, and the market is becoming pretty defined in that segment. In essence, past years yielded a solution for big systems to store all the data coming in. What I found important in his talk was he mentioned that processing wars are starting. After years of Hadoop dominating the world of data processing, new and exciting times are coming again. I hope this book provided you with solid foundations and opened the door to this exciting time.