User:Tsu2/vCPU Steal Time/

vCPU Steal time

Although I haven't personally experienced this, after coming across the referenced article at the end of this page, I thought the possibility of this scenario is plausible so am archiving the reference and adding my own observations (again, not based on actually having experienced the problem).

The Essential Concept

The article describes "over-subscription" which likely is the same as what I call "over-provisioning" as a possible cause, the idea that because any machine both physical and virtual almost never uses all of the resources available, it's typically good practice to deploy multiple virtual machines (multi-tenant) on the same physical server that altogether theoretically exceed the physical resources available. Under normal circumstances such over-provisioning works fine but should virtual machines altogether are under high enough load to exceed available resources, then contention and starvation can occur.

There is a second possible scenario where the virtualization technology may be poorly written where a Guest process might conflict with an existing (long running?) process running in the HostOS. I suppose that although this scenario might be extended to possible Guest<>Guest process contention, that might be very unlikely as something virtualization code certainly addresses.

Diagnosis

The referenced article actually doesn't suggest any certain way to diagnose, but describes using top to display processes spawned by a stress test, and observing if the Guest CPU% (ie Utilization) is grossly less than what is expected. I don't think the referenced article's diagnosis is very practical, the stress test is easy to identify in a controlled test but on a real world VM running a variety of processes identifying a specific process as abnormally low would not be so easy and might be next to impossible especially if you aren't working with a highly descriptive complaint(which would already mean the problem has been occurring for a very long time).

Solution

The referenced article doesn't suggest much other than common sense...
Migrate high load VMs to a different physical server.
Modifying VM QoS (ie throttling) certain VMs.

Reference Article:

https://opensource.com/article/20/1/cpu-steal-time