Titlu Beginning Apache Spark 2

Autor Hien Luu
Categorie De specialitate
Subcategorie Calculatoare / IT

descarca-hien-luu-beginning-apache-spark-2-pdf

Spark is a general distributed data processing engine built for speed, ease of use, and flexibility. The combination of these three properties is what makes Spark so popular and widely adopted in the industry. The Apache Spark website claims it can run a certain data processing job up to 100 times faster than Hadoop MapReduce. In fact, in 2014, Spark won the Daytona GraySort contest, which is an industry benchmark for sorting 100TB of data (one trillion records). The submission from Databricks claimed Spark was able to sort 100TB of data three times faster and using ten times fewer resources than the previous world record set by Hadoop MapReduce. Since the inception of the Spark project, the ease of use has been one of the main focuses of the Spark creators. It offers more than 80 high-level, commonly needed data processing operators to make it easy for developers, data scientists, and data analysts to build all kinds of interesting data applications. In addition, these operators are available in multiple languages, namely, Scala, Java, Python, and R. Software engineers, data scientists, and data analysts can pick and choose their favorite language to solve large-scale data processing problems with Spark.