Many Models at the Edge: Scaling Deep Inference via Model-Level Caching
Deep learning (DL) models are rapidly expanding in popularity in large part due to rapid innovations in model accuracy, as well as companies’ enthusiasm in integrating deep learning into the existing application logic. This trend will inevitably lead to a deployment scenario, akin to the content delivery network for web objects, where many deep learning models — each with different popularity — run on a shared edge with limited resources. In this paper, we set out to answer the key question of how to manage many deep learning models at the edge effectively. Via an empirical study based on profiling more than twenty deep learning models and extrapolating from an open-source Microsoft Azure workload trace, we pinpoint a promising avenue of leveraging cheaper CPUs, rather than commonly promoted accelerators, for edge-based deep inference serving.
Based on our empirical insights, we formulate the DL model management problem as a classical caching problem, which we refer to as model-level caching. As an initial step towards realizing model-level caching, we propose a simple cache eviction policy, called CremeBrulee, by adapting BeladyMIN to explicitly consider DL model-specific factors when calculating each in-cache object’s utility. Using a small-scale testbed, we demonstrate that CremeBrulee can achieve a 50% reduction in memory while keeping load latency below 92% of execution latency and less than 36% of the penalty of using a random approach to model eviction. Further, when scaling to more models and requests in a simulation, we demonstrate that CremeBrulee can keep the model load delay lower than other eviction policies that only consider workload characteristics by up to 16.6%.