Last week one of the nodes crashed and has been stuck in recovering state since. I have little experience as a Vertica database administrator, so I'm not sure what to try. Here's what I've tried so far:
- Force the database to restart via admintools. Three of the four nodes are successfully up, the one node remains stuck in recovering state.
- vertica.log - The log file is mostly filled with messages like this that repeat continuously:
- Reduce the load on Vertica. I've greatly increased the time between ETL runs to about 6 hours (from 15 minutes) to give the recovery process more time to recover without dealing with any new table writes. This was suggested in a post I ran across, but so far has had no effect.2014-04-08 04:56:06.000 Timer Service:0x743d550 [Txn] <INFO> Begin Txn: c000000044f3c9 'ProjUtil::getLocalNodeLGE'
2014-04-08 04:56:06.005 Timer Service:0x743d550 [Txn] <INFO> Rollback Txn: c000000044f3c9 'ProjUtil::getLocalNodeLGE'
2014-04-08 04:56:06.005 Timer Service:0x743d550 [Recover] <INFO> My local node LGE = 0x1a4ccc and current epoch = 0x1a616c
2014-04-08 04:56:06.180 DistCall Dispatch:0x7f8e3cb19bc0 [Txn] <INFO> Rollback Txn: a00000009ec804 'sendGetClusterLGE'
2014-04-08 04:56:06.331 DistCall Dispatch:0x7f8e3cb19bc0 [Txn] <INFO> Rollback Txn: a00000009ec805 'sendCheckMissingLibraries'
I would greatly appreciate any suggestions for how to troubleshoot and steps to try.