Foundations for Scaling ML in Apache Spark
published: Oct. 12, 2016, recorded: August 2016, views: 1186
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !