February 16, 2012

Recovering from ovs-agent db corruption

I recently had a clustered OVM 2.2 system that had suffered from a catastrophic power failure across all nodes. This ended up corrupting several of the BerkelyDBs that ovs-agent keeps in /etc/ovs-agent/db and /OVS/.ovs-agent/db. On starting ovs-agent, I was getting errors like:

“2012-02-15 21:56:24” INFO=> ha_set_shutdown_mode: inform master agent, leave shutdown mode.
“2012-02-15 21:56:54” ERROR=> ha_set_shutdown_mode: failed. =>errcode=00001, errmsg=CDS accquire lock /etc/ovs-agent/db/ataskaux.lock timeout. locker process is 16953.

StackTrace:
File “/opt/ovs-agent-2.3/OVSXHA.py”, line 488, in ha_set_shutdown_mode
rs = sp.set_shutdown_mode(“,”.join(my_ips), False)
File “/opt/ovs-agent-2.3/OVSServerProxy.py”, line 65, in __getattr__
if not OVSAsyncTaskAux.in_asyncenv():
File “/opt/ovs-agent-2.3/OVSAsyncTaskAux.py”, line 143, in in_asyncenv
taskid = get_asynctaskid()
File “/opt/ovs-agent-2.3/OVSAsyncTaskAux.py”, line 151, in get_asynctaskid
cds = CDS(‘ataskaux’)
File “/opt/ovs-agent-2.3/OVSCDS.py”, line 119, in __init__
raise CDSLockTimeout(ERR_CDS_LOCK_TIMOUT, {

Also when trying to start virtual machines, I would get the following error:

“2012-02-15 21:49:35” ERROR=> ha_unregister_lock: failed. lock_name(‘OVM_EL5U4_X86_OVM_MANAGER_PVM’) =>

StackTrace:
File “/opt/ovs-agent-2.3/OVSXHA.py”, line 358, in ha_unregister_lock
rs = sp.unregister_dlm_lock(lock_name)
File “/opt/ovs-agent-2.3/OVSServerProxy.py”, line 65, in __getattr__
if not OVSAsyncTaskAux.in_asyncenv():
File “/opt/ovs-agent-2.3/OVSAsyncTaskAux.py”, line 143, in in_asyncenv
taskid = get_asynctaskid()
File “/opt/ovs-agent-2.3/OVSAsyncTaskAux.py”, line 151, in get_asynctaskid
cds = CDS(‘ataskaux’)
File “/opt/ovs-agent-2.3/OVSCDS.py”, line 123, in __init__
dbs = dbshelve.open(db_file)
File “/usr/lib/python2.4/bsddb/dbshelve.py”, line 73, in open
d.open(filename, dbname, filetype, flags, mode)

“2012-02-15 21:49:35” ERROR=> ha_release_dlm_lock:failed. lock(‘9f0ba407-8a67-1ceb-d3fc-e0df3ab618fe’) name(‘OVM_EL5U4_X86_OVM_MANAGER_PVM’) => <Exception: failed:

StackTrace:
File “/opt/ovs-agent-2.3/OVSXHA.py”, line 358, in ha_unregister_lock
rs = sp.unregister_dlm_lock(lock_name)
File “/opt/ovs-agent-2.3/OVSServerProxy.py”, line 65, in __getattr__
if not OVSAsyncTaskAux.in_asyncenv():
File “/opt/ovs-agent-2.3/OVSAsyncTaskAux.py”, line 143, in in_asyncenv
taskid = get_asynctaskid()
File “/opt/ovs-agent-2.3/OVSAsyncTaskAux.py”, line 151, in get_asynctaskid
cds = CDS(‘ataskaux’)
File “/opt/ovs-agent-2.3/OVSCDS.py”, line 123, in __init__
dbs = dbshelve.open(db_file)
File “/usr/lib/python2.4/bsddb/dbshelve.py”, line 73, in open
d.open(filename, dbname, filetype, flags, mode)
>

StackTrace:
File “/opt/ovs-agent-2.3/OVSXHA.py”, line 438, in ha_release_dlm_lock
raise Exception(rs)

Have no fear there is a fix.

take note of which of repo is root, and a listing of all your repos /opt/ovs-agent-latest/utils/repos.py -l; mount | grep "/var/ovs/mount" Stop the OVS agent /sbin/service ovs-agent stop --disable-nowayout Delete all the databases (have no fear, honest, or mv them up to you) rm -rf /etc/ovs-agent/db /OVS/.ovs-agent/db Recreate each of the repos, making one root, and then initialise. Do this on each node.

/opt/ovs-agent-latest/utils/repos.py -n /dev/xyz
/opt/ovs-agent-latest/utils/repos.py -n /dev/abc
/opt/ovs-agent-latest/utils/repos.py -l
/opt/ovs-agent-latest/utils/repos.py -r <uuid>
/opt/ovs-agent-latest/utils/repos.py -i

Now start your VM Manager xm create -c /var/ovs/mount/<uuid>/OVM_EL5U4_X86_OVM_MANAGER_PVM/vm.cfg Login to VM Manager. Click on server pools, your pool should be inactive. Click on Restore. Accept the fact that it is about to destroy all the BerkelyDBs you created with accurate data. Coffee, (tea, beer, insert beverage as you see fit). This bit does seem to take a while. Rejoice, Party, jump up and down with joy. Everything should be back to working.

© Greg Cockburn

Powered by Hugo & Kiss.