Vertica Nodes Randomly Fail

Moderator: NorbertKrupa

User avatar
becky
Intermediate
Intermediate
Posts: 118
Joined: Sat Apr 28, 2012 11:37 am

Vertica Nodes Randomly Fail

Post by becky » Tue Jul 16, 2013 4:16 pm

Hey guys,

I have a three node cluster and the the nodes randomly fail.

Here is a part of the vertica.log file that starts where I think a node failed. But I can't figure out why. Can someone take a look and let me know if they've experienced and issues with Vertica nodes failing... constantly... thanks in advance. This is with Vertica 6.1.2.

2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on V:verticadb
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on Vertica:all
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on Vertica:join
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 6144 on V:verticadb
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> NETWORK change with 2 VS sets
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> DB Group changed
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> Got current member #r416-15#NXXXXXXXXX187, v_verticadb_node0004 is UP
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0001 left the cluster
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Checking Deps:Down bits: 001 Deps:
111 - cnt: 38
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Critical [2] PostedTimestamp: 2013-07-16 15:01:00.449481 ExpirationTimestamp: 2081-08-03 18:15:07.449481 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_verticadb_node0004 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0003 left the cluster
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Setting node v_verticadb_node0004 to UNSAFE
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2013-07-16 15:01:00.45005 ExpirationTimestamp: 2081-08-03 18:15:07.45005 EventCodeDescription: Node State Change ProblemDescription: Changing node v_verticadb_node0004 startup state to UNSAFE DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3293: Event Cleared: Event Code:6 Event Id:6 Event Severity: Informational [6] PostedTimestamp: 2013-07-16 15:01:00.450118 ExpirationTimestamp: 2013-07-16 15:01:00.450118 EventCodeDescription: Node State Change ProblemDescription: Changing node v_verticadb_node0004 leaving startup state UP DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Changing node v_verticadb_node0004 startup state from UP to UNSAFE
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2013-07-16 15:01:00.450501 ExpirationTimestamp: 2013-07-16 15:11:00.450501 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=1 total number of nodes=3 DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> stop: disconnecting #r416-15#NXXXXXXXXX187 from spread daemon
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> connected: false
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> DB Group changed
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [VMPI] <INFO> DistCall: Set current group members called with 0 members
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0004 left the cluster
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Lost membership of the DB group
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r3645-15#NXXXXXXXXX181->v_verticadb_node0003 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r416-15#NXXXXXXXXX187->v_verticadb_node0004 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r5705-15#NXXXXXXXXX180->v_verticadb_node0001 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Lost membership of V:All
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> spread thread exiting
2013-07-16 15:01:00.554 SafetyShutdown:0x7f7afc0071b0 [Shutdown] <INFO> Shutting down this node
Last edited by becky on Wed Jul 17, 2013 12:23 pm, edited 2 times in total.
THANKS - BECKSTER

scutter
Master
Master
Posts: 302
Joined: Tue Aug 07, 2012 2:15 am

Re: Vertica Node Randomly Fail

Post by scutter » Tue Jul 16, 2013 5:57 pm

Hi Becky,

What do you see in log files for the nodes that this node sees as going down? Do they actually go down? Is there anything in those log other than just "nodennnn left the cluster"? If there's nothing else in there, then it's a network issue.

- Are these nodes in a hosted environment?
- Are the nodes using a private network for vertica's data and the spread traffic?
- Check dmesg and/or /var/log/messages to see if the network interfaces are going down
- Does the spread process remain up on all nodes?

On a separate topic, the log fragment that you posted:

Deps:
111 - cnt: 38

This tells me that you have 38 projections that have segments on all nodes (unsegmented all nodes). Are you intentionally defining them that way, versus segmenting them across all nodes?

--Sharon
Sharon Cutter
Vertica Consultant, Zazz Technologies LLC

User avatar
becky
Intermediate
Intermediate
Posts: 118
Joined: Sat Apr 28, 2012 11:37 am

Re: Vertica Node2 Randomly Fail

Post by becky » Tue Jul 16, 2013 7:47 pm

Hi Sharon,

Thanks for getting back on my issue! Yes, the servers are in a hosted environment. They are are VMs. I agree that it's a network issue. When a node goes down the spread process continues to run. Also the vertica.pid doesn't get deleted. For me to restart Vertica on the failed host I have to delete the pid file manually.

There is nothing else in the Vertica log files to say why it failed. Is there a spread log file that I can look at? The /var/log/spreadd.log isn't very helpful.

For the segmented nodes issue, yes, I was created them manually to test something else. I was going to drop those tables...

Thanks!
THANKS - BECKSTER

scutter
Master
Master
Posts: 302
Joined: Tue Aug 07, 2012 2:15 am

Re: Vertica Node2 Randomly Fail

Post by scutter » Tue Jul 16, 2013 8:11 pm

When you installed vertica did you use the default -U for the spread communications? If yes, then rerun install_vertica using -T -S default which is recommended both for hosted environments and for VMs.

--Sharon
Sharon Cutter
Vertica Consultant, Zazz Technologies LLC

User avatar
becky
Intermediate
Intermediate
Posts: 118
Joined: Sat Apr 28, 2012 11:37 am

Re: Vertica Nodes Randomly Fail

Post by becky » Wed Jul 17, 2013 12:22 am

Hi Scutter,

Thanks! I re-ran the install like this:

/opt/vertica/sbin/install_vertica -T -s v01,v02,v03 -r vertica-6.1.2-0.x86_64.RHEL5.rpm

It ran okay, and I restarted the DB. I'll let you know how it goes!!!
THANKS - BECKSTER

User avatar
becky
Intermediate
Intermediate
Posts: 118
Joined: Sat Apr 28, 2012 11:37 am

Re: Vertica Nodes Randomly Fail

Post by becky » Wed Jul 17, 2013 12:32 am

Oh, I just noticed one weird message at the end of running the install script (See below):
  • ...
    Updating spread configuration...
    Verifying spread configuration on whole cluster.
    Error Monitor 0 errors 4 warnings
    Installation completed with warnings.
    Exception vertica.utils.pexpect.ExceptionPexpect: ExceptionPexpect() in <bound method spawn.__del__ of <vertica.utils.pexpect.spawn object at 0x23141d0>> ignored
    Installation complete.

    To create a database:
    1. Logout and login as dbadmin.**
    2. Run /opt/vertica/bin/adminTools as dbadmin
    3. Select Create Database from the Configuration Menu

    ** The installation modified the group privileges for dbadmin.
    If you used sudo to install vertica as dbadmin, you will
    need to logout and login again before the privileges are applied.
To you think that exception is an issue?
THANKS - BECKSTER

scutter
Master
Master
Posts: 302
Joined: Tue Aug 07, 2012 2:15 am

Re: Vertica Nodes Randomly Fail

Post by scutter » Wed Jul 17, 2013 2:03 am

If the database is running again and the spread reconfig is correct, then it's probably not an issue. Verify that /opt/vertica/config/vspread.conf has multiple Spread_Segments in it.

But probably worth checking the installation logs to see if you can track down what the script was doing when that error/warning occurred.

--Sharon
Sharon Cutter
Vertica Consultant, Zazz Technologies LLC

Post Reply

Return to “Vertica Database Administration”