According to a recent article by JavaWorld, the launch of Spark 1.6 is very important for big data, as Spark’s latest release has given big data a great push at the start of 2016. Spark’simproved streaming as well as memory management will not only make the life of developers and operators way easier, but will also increase big data’s popularity. Among Spark 1.6 new features are the automatic memory management, streaming improvements, ML persistence, and Datasets and the road to Spark 2.0.
As a lot of users had been complaining about Spark’s memory management, Spark’s team had improved this aspect by implementing the Project Tungsten on the latest releases of Spark. The latest release offers a new memory management that adjusts the memory usage automatically, but this new addition to Spark 1.6 won’t be able to solve all the problems regarding memory usage issues.
The latest launch will also improve its streaming. “We've seen some large, aggregation-streaming pipelines that simply can't cope with big state changes -- and Spark's updateStateByKey API has to hold the entire working set in memory. Spark 1.6 adds a new API, mapWithState, that works with deltas instead of the entire working set, immediately offering speed and memory improvements over the previous approach,” according to JavaWorld.
Just like big data, machine learning keeps being a hot topic, and that is why Spark 1.6 has decided to add MLLib, which is a combination of machine learning tools with algorithms.
“The best new feature added to MLLib in this release is Pipeline persistence. While you've been able to persist models in previous versions, Spark 1.6 allows you to import and export the workflow of Estimators, Transformers, Models, and Pipelines from and to external storage,” JavaWorld adds.
The Pipeline persistence featuring will not only allow coders to code less but also ‘encourage’ them to experiment more and become more flexible. Datasets is considered by many as one of the most important features of Spark 1.6. JavaWorld explains that Spark’s execution plan, which is created via the Catalyst query planner, runs on top of a Dataframe - which in turn allows the new version to produce a better and way more optimized plan compared to RDD code.
“The type information presented in the formation of a dataset is used to create an Encoder for serialization purposes. Again, if you've used Spark in production, you've likely come across memory issues due to the overhead of Java serialization to send data across the cluster (and you've likely switched to Kryo serialization).”
Spark’s plan to launch the 2.0 version in the near future might make Spark one of the hottest data processing platforms.
Image Source: www.informationweek.com
44B Borisova Str.
7012, Ruse, Bulgaria
1000 Brussels, Belgium