Debugging Trove gate failures

Of late, I’ve spent a fair amount of time debugging Trove’s gate failures. And this isn’t the first time, it generally happens around release time. And each time, I relearn the same things. So this time, I’ll make a note of what I’ve done recently. Hopefully, it’ll ease the process next time.

Can you get Trove to work locally?

This is as simple as getting Trove installed from the tip of master (or whatever branch you are debugging) and launching a guest. If you can, move to step 2. If you cannot, figure out why.

This is an example of a failure that can be reproduced locally.

Push a dummy commit and see what the CI does to it.

The hard failures are the ones in jobs that launch an actual guest instance. If the jobs fail, look at the logs.

This from a commit that only changed some documentation. Clearly the Trove functional tests should pass.

Look at the CI log files

One of the common failure modes is where a guest instance fails to launch. That will look something like this (in the CI output). This output is in console.html

Look to see whether you have anything in the conductor logs

Did the guest even manage to get as far as responding to the conductor? If you see anything indicating that the guest got to connect to the conductor, you are looking at a failure in the guest that will likely happen even on your local environment.

You should get some benefit by piping the guest logs back to the host. A sample commit that shows how to do this is here. Get that into your test commit and you should see messages in the conductor log indicating what fails on the guest and you’ll be on your way to figuring out what happened.

Of course, if nothing appears in the conductor log, proceed to the next step.

What is getting installed onto the guest?

One of the most common failures I’ve seen is where something changes in requirements and messes up the guest.

You can see what gets installed on the guest by looking at the diagnostic pip output produced during guest image creation.

Look in logs/devstack-gate-post_test_hook.txt.gz. Search for the string

diagnostic pip freeze output follows

Compare the output from a failing build to one that passes and you will likely have found the change that caused the failure.

Hopefully this will give you enough of an indication of what happened to cause the failure and you can debug it further.

When all else fails, contact infra

They can get you access to the machine running the gate job and you can watch and debug the failure as it happens.

This is an example of a fix that only occurred to me after I was able to debug the failure live on the CI infrastructure. It was something that had been fixed in master and needed to be backported to Mitaka.