In Linux, the current mainstream solution is CFS (Completely Fair Scheduler). Its goal is to assign running processes to time slices of the CPU in a “fair” way.
Maintaining performance isolation between these different applications is critical to ensuring a good experience for internal and external customers.
What the OS task scheduler is doing is essentially solving a resource allocation problem: I have X threads to run but only Y CPUs available, how do I allocate the threads to the CPUs to give the illusion of concurrency?
Resource allocation problems can be efficiently solved through a branch of mathematics called combinatorial optimization, used for example for airline scheduling or logistics problems.
- avoid spreading a container across multiple NUMA sockets (to avoid potentially slow cross-sockets memory accesses or page migrations)
- don’t use hyper-threads unless you need to (to reduce L1/L2 thrashing)
- try to even out pressure on the L3 caches (based on potential measurements of the container’s hardware usage)
- don’t shuffle things too much between placement decisions
we define three events that trigger a placement optimization:
- add: A new container was allocated by the Titus scheduler to this instance and needs to be run
- remove: A running container just finished
- rebalance: CPU usage may have changed in the containers so we should reevaluate our placement decisions
Every time a placement event is triggered, titus-isolate queries a remote optimization service (running as a Titus service, hence also isolating itself… turtles all the way down) which solves the container-to-threads placement problem.
This service then queries a local GBRT model (retrained every couple of hours on weeks of data collected from the whole Titus platform) predicting the P95 CPU usage of each container in the coming 10 minutes (conditional quantile regression).
The model contains both contextual features (metadata associated with the container: who launched it, image, memory and network configuration, app name…) as well as time-series features extracted from the last hour of historical CPU usage of the container collected regularly by the host from the kernel CPU accounting controller.