We experienced an issue in our GCP us-south region where outbound calls were not working between 9:31pm and 9:53pm CST for users connected to this region. Inbound calls were not affected during this time.
This issue was caused by the Kubernetes HPA (Horizontal Pod Autoscaler) throwing errors related to not being able to obtain CPU metrics from our media servers in this region. The HPA is responsible for making sure a certain amount of media servers are online at any given time as well as scaling resources up and down to meet demand at all times of the day.
When the HPA failed to obtain the CPU metrics it was unable to make decisions about when to adjust the replica count. The errors derived from a host machine hardware issue in GCP where the host machine that the virtual machine was on was not responding to "kube-state-metrics" calls.
After some time of being stuck in this state, Kubernetes created a replacement virtual machine to replace the one that was having the issue and the services were then restored.