Page 1 of 1

Connection reset by peer

Posted: Fri Jun 22, 2012 2:50 pm
by janice
Hi all,

Has anyone seen an issue where Vertica sessions are being lost with messages in the vertica.log reporting messages like "could not receive data from client: Connection reset by peer" and "unexpected EOF on client connection"?

Thanks!

Re: Connection reset by peer

Posted: Sat Jun 23, 2012 1:22 pm
by JimKnicely
Hi Janice,

This issue could be related to what ever client software is connecting to your Vertica database. The connections are basically going away unexpectedly.

Or maybe you are experiencing an issue that we were having.

We had a long running Cognos cube build that would query the database and then go off and do some computing on its own. Sometimes this processing would take a long time (> 1 hour) and when it came back to query the database again the cube build would fail with Cognos reporting the error "RPT-DBL-3501 The following error was detected attempting to validate the report: UDA-SQL-0532 Data Source is not accessible: "invalid data source".".

When checking the Vertica log files we would see these errors:

2012-01-21 14:19:09.757 Init Session:0x2aaaad130e00 <LOG> @v_marketing_node0001: 08006: could not receive data from client: Connection reset by peer
2012-01-21 14:19:09.757 Init Session:0x2aaaad130e00 <LOG> @v_marketing_node0001: 08006: unexpected EOF on client connection


With a little digging around in the Vertica log, we found that the "Connection reset by peer" errors always appeared 1 hour after the last completed databases query of the Cognos connection.

So we figured we'd increase the "Inactivity Timeout" setting in Cognos to be greater than 1 hour (we tried 4 hours). We re-ran the cube build and to our surprise we received the same error in Cognos and saw the same errors in the Vertica log file exactly 1 hour after the last query finished in Vertica for that session! The Vertica session that Cognos was trying to run another query in went away again!

After some more digging, we found that there is a 1 hour inactivity timeout on our network infrastructure (in the routers). This was killing the inactive connection between Cognos and Vertica!

Our connections were being made using the Vertica ODBC drivers which unfortunately do not include any type of "Auto reconnect" feature or "Timeout" settings.

The fix for us was to increase a Linux variable named tcp_keepalive_time on each of our Vertica nodes. The tcp_keepalive_time variable tells the TCP/IP stack how often to send TCP keepalive packets to keep a connection alive if it is currently unused. By default, the variable is set to 7200 seconds (2 hours). We changed it to 1800 seconds (30 minutes) and wa-lah, our disconnections stopped happening!

By the way, here is how to change the value of the variable:

Code: Select all

echo 1800 > /proc/sys/net/ipv4/tcp_keepalive_time
I hope this helps you out!

Re: Connection reset by peer

Posted: Tue Jun 26, 2012 2:17 am
by janice
Thanks, knicely87! Wow, great explanation. That was the issue. I changed the variable and the connections were no longer dropped! Is that all I need to do? Should I make the change on all three of our nodes?

Re: Connection reset by peer

Posted: Wed Jun 27, 2012 3:35 am
by JimKnicely
Should I make the change on all three of our nodes?
Yes, you will need to make the change on all the nodes!

Re: Connection reset by peer

Posted: Sun Jul 01, 2012 12:05 pm
by nnani
Jim,

Amazing description of the issue and its cause. Thanks for sharing this information with us.

Re: Connection reset by peer

Posted: Sun Jul 01, 2012 2:44 pm
by JimKnicely
Thanks, nnani!

That issue plagued us for weeks! I kept blaming cogons for the dropped connections. I kind of feel bad now for blaming that product. Then again, they kept saying it was a Vertica issue... We we're both wrong in the end!