Skip to main content

OpenStack

Database creation fails

  • Problem:

    TASK [keystone : Creating keystone database] ***********************************
    fatal: [testbed-node-0]: FAILED! => changed=false
    action: mysql_db
    msg: 'unable to find /var/lib/ansible/.my.cnf. Exception message: (2003, "Can''t connect to MySQL server on ''api-int.local'' ([Errno 111] Connection refused)")'
  • Solution:

    Restart the kolla_toolbox container. in this case on the node testbed-node-0.

    $ osism console testbed-node-0/
    testbed-node-0>>> restart kolla_toolbox
    kolla_toolbox
    testbed-node-0>>>

Ceph connections not working

  • Problem: auth: error parsing file or auth: failed to load

    $ docker exec -ti nova_compute ceph -k /etc/ceph/ceph.client.nova.keyring -n client.nova -s
    2024-06-28T06:43:05.660+0000 7d5df526b640 -1 auth: error parsing file /etc/ceph/ceph.client.nova.keyring: cannot parse buffer: Malformed input
    2024-06-28T06:43:05.660+0000 7d5df526b640 -1 auth: failed to load /etc/ceph/ceph.client.nova.keyring: (5) Input/output error
    2024-06-28T06:43:05.664+0000 7d5df526b640 -1 auth: error parsing file /etc/ceph/ceph.client.nova.keyring: cannot parse buffer: Malformed input
    2024-06-28T06:43:05.664+0000 7d5df526b640 -1 auth: failed to load /etc/ceph/ceph.client.nova.keyring: (5) Input/output error
    2024-06-28T06:43:05.664+0000 7d5df526b640 -1 auth: error parsing file /etc/ceph/ceph.client.nova.keyring: cannot parse buffer: Malformed input
    2024-06-28T06:43:05.664+0000 7d5df526b640 -1 auth: failed to load /etc/ceph/ceph.client.nova.keyring: (5) Input/output error
    2024-06-28T06:43:05.664+0000 7d5df526b640 -1 auth: error parsing file /etc/ceph/ceph.client.nova.keyring: cannot parse buffer: Malformed input
    2024-06-28T06:43:05.664+0000 7d5df526b640 -1 auth: failed to load /etc/ceph/ceph.client.nova.keyring: (5) Input/output error
    2024-06-28T06:43:05.664+0000 7d5df526b640 -1 monclient: keyring not found
    [errno 5] RADOS I/O error (error connecting to the cluster)
  • Solution:

    Check your Ceph keyfiles. Probably a missing newline at the EOF.

Cinder volume create failure

  • Problem: Volume creation is stuck after creation of the database object with no host assigned.

  • Solution:

    Database objects are created by the API service for valid request while the host is assigned by the scheduler.

    • Check the scheduler logs for errors
    • If there is nothing wrong with the scheduler itself, check the communication between the services via oslo.messaging Usually this is done via rabbitmq:
      • Check cluster status on every node for status, alarms and network partitions
        docker exec rabbitmq rabbitmqctl cluster_status
      • Check rabbitmq logs for errors
      • Check rabbitmq queues for errors or accumulated messages
        docker exec rabbitmq rabbitmqctl list_queues name type state consumers messages | grep -E '^cinder'
      • If everything seems fine check network connectivity to rule out network issues
        osism validate kolla-connectivity
      • If networking is fine then as a last resort a reset of rabbitmq may be considered Beware that this will destroy rabbitmq state which may result in inconsistent resource states
        osism apply rabbitmq-reset-state

Redeploying compute node results in nova-compute service startup error

  • Problem: The nova-compute services is refusing to start because of not our first startup on this host

    nova.exception.InvalidConfiguration: No local node identity found, but this is not our first startup on this host. Refusing to start after potentially having lost that state!
  • Solution:

    • Get the ID of the hypervisor

      $ openstack --os-cloud admin hypervisor show -f value -c id testbed-node-0
      a78b460d-2a38-4d50-b904-7eddbe6cfccb
    • Add this ID to /var/lib/nova/compute_id (in the case you use local storage)

      $ docker exec -it nova_compute bash
      (nova-compute)[nova@testbed-node-0/]$
      # echo -n "a78b460d-2a38-4d50-b904-7eddbe6cfccb" > /var/lib/nova/compute_id

Loadbalancer are stuck in an immutable state

  • Problem: Newly created/updated loadbalancers using the amphora provider are stuck in provisioning state PENDING_CREATE/PENDING_UPDATE

  • Solution:

    • One possible cause is a communication failure between octavia workers and the amphora
      • Check for expired octavia certificates
        osism apply octavia-certificates -- -e octavia_certs_check_expiry=true -e octavia_certs_expiry_limit=0
      • Recreate certificates if they have expired
        Unfortunately the certificates will not be recreated automatically once expired and have to be deleted first
        • To recreate all certificates including server and client CA clean up by running
          docker exec kolla-ansible bash -c "rm -rf /share/{server,client}_ca"
          on the manager
        • If only the client certificate is expired execute the following steps on the manager to clean up the old certificate
          • Create a backup of the Client CA
            docker exec kolla-ansible cp -a /share/client_ca /share/client_ca-$(date -I)
          • Remove the expired client certificate
            docker exec -it kolla-ansible bash -c "rm /share/client_ca/{client.cert-and-key.pem,client.cert.pem,client.csr.pem,index.txt}"
        • To actually create and copy the existing and created octavia certificates execute the following commands on the manager
          osism apply octavia-certificates
          osism apply copy-octavia-certificates
        • Deploy octavia with newly created certificates
          osism apply octavia
      • If certificates are fine check the octavia loadbalancer management network as another possible cause
    • Once the root cause has been resolved fix loadbalancers stuck in state PENDING_*.
      Unfortunately loadbalancers cannot be moved out of PENDING states using the API, so they are set to ERROR state in the DB
      Connect to the octavia database on a control node using the octavia DB credentials
      docker exec -it mariadb mysql -uoctavia -p octavia
      For every ID of a loadbalancer stuck in a pending state set the provisioning_status to ERROR
      MariaDB [octavia]> update load_balancer set provisioning_status='ERROR' where id='7f46b3f1-a405-4bbd-b0d0-5bf33a8cc04f';
    • Failover loadbalancers just set to ERROR state
      openstack --os-cloud admin loadbalancer failover 7f46b3f1-a405-4bbd-b0d0-5bf33a8cc04f
      If the client or server CA certificates have also been changed then all amphora based loadbalancers will need to be failed over to reestablish communication.