High-efficiency AI and ML distributed systems at Big-Learning scales

Apply

Project Description

The position will be in the context of a project whose goal is to make distributed AI and ML more efficient. Enabled by the rise of big data, today’s AI and ML solutions are obtaining remarkable success in many fields thanks to their ability to learn complex models with millions to billions of parameters. However, these solutions are expensive because running AI and ML algorithms at large scales requires clusters with tens or hundreds of machines to satisfy the high computation and communication costs of these algorithms.Many systems exist for performing AI and ML tasks in a distributed environment. Yet, the performance requirements and input data sizes are steadily growing. The next level of efficiency is required to address key challenges like network communication bottlenecks and uneven cluster performance.Moreover, the fidelity of ML models is very sensitive to many hyperparameters. To produce accurate models, it is of great importance to tune these hyperparameters well. However, this requires exploring a large space of possible configurations, which must be done efficiently. The internship work will generally integrate in the current challenges faced during the project whether that is to investigate the trade-offs between reduced communication and model precision or to identify the bottlenecks and develop new algorithms to overcome them.Ongoing directions are (1) exploring the use of new networking hardware and architectures to make network-based communication more efficient and (2) designing new search algorithms that can make a better use of resources and determine optimized hyperparameters more efficiently.Candidates should be motivated to work on research-oriented problems in a fast-paced and tight-knit team. They should have a strong computing or engineering background with a good background in algorithms, machine learning, distributed systems, and networking. Ideally, they would have experience in building and working with large software systems and tools, and proven knowledge of C++/Java.​​​​
Program - Computer Science
Division - Computer, Electrical and Mathematical Sciences and Engineering
Field of Study - ​Computer Science

About the
Researcher

Marco Canini

Associate Professor, Computer Science

Marco Canini

Professor Canini‘s research interests are in the principled construction and operation of large scale networked computer systems, specifically in distributed systems, large-scale computing and computer networking with emphasis on cloud computing and programmable networks. His current work focuses on improving networked systems design, implementation and operation along several vital properties such as reliability, performance, security and energy efficiency.

Desired Project Deliverables

​The students are expected to study the existing solutions and devise theoretically-sound approaches (with the assistance of the supervisor) to improve their performance. The students will be able to also collaborate with other team members and to evaluate the mechanisms on real-world datasets on a state-of-the-art testbed. The above results, if completed, are considered novel and can result into a publication (with the agreement of the supervisor). ​