Addressing a common misconception regarding OpenStack Trove security

Since my first OpenStack Summit in Atlanta (mid 2014), I have been to a number of OpenStack-related events, meetups, and summits. And at every one of these events, as well as numerous customer and prospect meetings, I’ve been asked some variant of the question:

Isn’t Trove insecure because the guestagent has RabbitMQ credentials?

A bug was entered in 2015 with the ominous (and factually inaccurate) description that reads “Guestagent config leaks rabbit password”.

And while I’ve tried to explain to people that this is not at all the case, this misconception has persisted.

At the Summit in Barcelona, I was asked yet again about this and I realized that obviously, whatever we in the Trove team had been doing to communicate the reality was insufficient. So, in preparation for the upcoming Summit in Boston, I’m writing this post as a handy resource.

What is the problem?

Shown here is a simplified representation of a Trove system with a single guest database instance. The control plane components (Trove API, Trove Task Manager, and Trove Conductor) and the Guest Agent communicate via oslo.messaging which is typically implemented with some messaging transport like RabbitMQ.

rpc-security-1To connect to the underlying transport, each of these four components needs to store credentials; for RabbitMQ this is a username and password.

The contention is that if a guest instance is somehow compromised (and there are many ways to do this) and a bad actor gains access to the RabbitMQ credentials, then the OpenStack deployment is compromised.

Why is this not really a problem?

Here are some reasons this is not really an issue on a properly configured production system.

  1. Nothing requires that Trove use the same RabbitMQ servers as the rest of OpenStack. So at the very least, the compromise can be limited to the RabbitMQ servers used by Trove.
  2. The guest instance is not intended to be a general-purpose instance that a user has access to; in the intended deployment, the only connectivity to the guest instance would be to the database ports for queries. These are configurable with each database (datastore) and enforced by Neutron. Shell access (port 22, ssh) is a no-no. No deployer would use images and configurations that allowed this kind of access.
  3. On the guest instance, other database specific best practices are used to prevent shell escapes and other exploits that will give a user access to the RabbitMQ credentials.
  4. Guest instances can be spawned by Trove using service credentials, or credentials for a shadow tenant to prevent an end user from directly accessing the underlying Nova instance. Similarly Cinder volumes can be provisioned with a different tenant to prevent an end user from directly accessing the underlying volume.

All of this notwithstanding, the urban legend was that Trove was a security risk. The reason invariably involved a system configured by devstack, with a single RabbitMQ, open access to port 22 on the guest, run in the same tenant as the requestor of the database.

Yet, one can safely say that no one in their right mind would operate OpenStack as configured by devstack in production. And certainly, with Trove, one would not use the development images whose elements are part of the source tree in a production deployment.

proposed security related improvements in Ocata

In the Ocata release, one additional set of changes has been made to further secure the system. All RPC calls on the oslo.messaging bus are completely encrypted. Furthermore, different conversations are encrypted using unique encryption keys.

rpc-security-2The messaging traffic on oslo.messaging is solely for oslo_messaging.rpc, the OpenStack Remote Procedure Call mechanism. The API service makes calls into the Task Manager, the Task Manager makes calls into the Guest Agent, and the Guest Agent makes calls into the Conductor.

The picture above shows these different conversations, and the encryption keys used on each. When the API service makes an RPC call to the Task Manager, all parameters to the call are encrypted using K1 which is stored securely on the control plane.

Unique encryption keys are created for each guest instance, and these keys are used for all communication. When the Task Manager wishes to make a call to Guest Agent 1, it uses the instance specific key K2, and when it wants to make a call to Guest Agent 2, it uses the instance specific key K3. When the guest agents want to make calls to the Conductor, the traffic is encrypted using the instance specific keys and the conductor decrypts the parameters using those instance specific keys.

In a well configured production deployment, one that takes steps to secure the system, if a bad actor were to compromise a guest instance (say Guest Agent 1) and get access to K2 and the RabbitMQ Credentials, the user could access RabbitMQ but would not be able to do anything to impact either another guest instance (he wouldn’t have K3) or the Task Manager (he wouldn’t have K1).

Code that implements this capability is currently in upstream review.


This blog post resulted in a brief twitter exchange with Adam Young (@admiyoung)

Unfortunately, a single user (in RabbitMQ) for Trove isn’t the answer. Should a guest get compromised, then those credentials are sufficient to post messages to RabbitMQ and cause some amount of damage.

One would need per guest instance credentials to avoid this; or one of the many other solutions (like shadow tenants, etc).

Enabling hacking extensions: The right way

Of late, I wake up every morning revving to go and work on the next cool thing in Trove and I see that overnight some well-meaning person has contributed a change that looks something like this:

String interpolation should be delayed to be handled by the logging code, rather than being done at the point of the logging call.
Ref:http://docs.openstack.org/developer/oslo.i18n/guidelines.html#log-translation For example:
# WRONG
LOG.info(_LI(‘some message: variable=%s’) % variable)
# RIGHT
LOG.info(_LI(‘some message: variable=%s’), variable)

And the code submitted fixes a small number (lets say 5) places where strings sent to logging are rendered.

As I said at the TC-Board meeting in Barcelona, these well-meaning people are actually submitting what on the face of it appear to be valid corrections to the code. Yet, I submit to you that these changes represent a scourge that we should stamp out.

I know for a fact that in Trove there are (currently) 751 occurrences of this particular style error. This is the hacking extension H904, and when enabled in Trove, I get this:

$ tox -e pep8 | grep H904 | wc -l
751

That’s the catch, Trove does not enable this hacking extension. A quick look indicates that only Neutron does.

Why are these well meaning changes a scourge? Here’s why …

  • They don’t materially improve a project to fix a small fraction of these errors without preventing them from reoccurring
  • Each of these changes takes some considerable CI resources to verify and get approved
  • Each of these changes take time for someone to review, time which could be better spent if we were to fix these problems properly.

So, I submit to you that if you want to submit a patch to fix one of these hacking issues, here is the right way. Of course, I’m opinionated, I’m going to reference one of my own changes as an example!

  1. If your project does not have hacking extensions, this commit shows you what you have to do to enable that. You may have to bump test-requirements.txt and update the version of hacking that you use in order to use the ‘enable-extensions’ option.
  2. Enable the hacking rule or extension for the particular style issue at hand; let’s illustrate with H203. H203 ensures that we use assertNone() and not assertEqual(None, …).
  3. Run the pep8 test against the project and find and correct all places where the failure occurs. Typically this is accomplished by just running ‘tox -e pep8’.
  4. Test that the code does in fact work as expected; correcting style guidelines can introduce functional errors so make sure that the unit tests pass too. Typically this is accomplished by running ‘tox -e py27 -e py34’.
  5. Actually exercise the system; launch a system with devstack and the project enabled, and actually exercise the system. In the case of Trove, actually build a guest and launch a database or two.
  6. Then submit your change including the change to tox.ini that enables the hacking rule for review.

Well, that’s a lot of work! Sure, you really have to work for your Stackalytics credit, right? I’m sure the load on the CI system will show that this is worthwhile.

It is better to do things this way in the long run. With the hacking rule enabled, future commits will also comply with the rule (they will fail pep8 if they don’t). And that will put an end to the cottage industry that has sprung up around finding these kinds of errors and fixing them one at a time.

In conclusion I urge reviewers in all projects to summarily reject style changes that don’t also enable a hacking rule. Approving them is the wrong thing to do. Require the contributor to enable the hacking rule, and fix the problem the right way. That’s what a good code review is about.

My adventures with pylint

I have long believed that it is ok to make any given mistake once. But to make it again is, I believe, unforgivable.

This should, I believe, apply to all things that we do as software developers, and to that end, I feel that code changes to fix issues should be backed up in some way by testing that prevents recurrence of problems.

In my day job, I work on the OpenStack Trove project, and when I review changes proposed by others, I tend to apply this yardstick. When someone fixes a line of code to correct some issue, it is common to expect that there will be a test to verify this operation in the future.

Recently, I reviewed and approved a change that pretty immediately resulted in a regression. Here’s the diff of the code in question:

     if not manager:
-        msg = ("Manager class not registered for datastore manager %s" %
+        msg = (_LE("Manager class not registered for datastore manager %s") %
                CONF.datastore_manager)
         raise RuntimeError(msg)

Not being a compiled language, and since no tests exist for the case where manager is None, this code was never exercised, and _LE was not defined. Sure enough a couple of days later, the regression was reported.

This got me wondering how the problem could be avoided. Surely python must have some tools to catch this kind of thing. How did this escape the development process (obvious explanation of sloppy code review aside).

It turns out that there is a mechanism to catch these kinds of things, pylint. And it turns out that we don’t use pylint very much in OpenStack.

Well, a short while later I was able to run my little pylint based wrapper on Trove and fix some egregious bugs.

pylint doesn’t natively give you a way to provide a specific set of issues that must be ignored (something which bandit does). So I modeled this wrapper on the way bandit does things and allowed for a set of ignored exceptions.

I’ll make this a job in the Trove gate soon and that will help stamp out these kinds of issues more quickly.

Debugging Trove gate failures

Of late, I’ve spent a fair amount of time debugging Trove’s gate failures. And this isn’t the first time, it generally happens around release time. And each time, I relearn the same things. So this time, I’ll make a note of what I’ve done recently. Hopefully, it’ll ease the process next time.

Continue reading “Debugging Trove gate failures”

Utilizing OpenStack Trove DBaaS for deployment management

Ron Bradford posted this interesting article on his blog after a recent trip I made to New York City.

Trove is used for self service provisioning and lifecycle management for relational and non-relational databases in an OpenStack cloud.. Trove provides a RESTful API interface that is same regardless of the type of database.. CLI tools and a web UI via Horizon are also provided wrapping Trove API requests..

Source: Utilizing OpenStack Trove DBaaS for deployment management

%d bloggers like this: