Hi Janice,
This issue could be related to what ever client software is connecting to your Vertica database. The connections are basically going away unexpectedly.
Or maybe you are experiencing an issue that we were having.
We had a long running Cognos cube build that would query the database and then go off and do some computing on its own. Sometimes this processing would take a long time (> 1 hour) and when it came back to query the database again the cube build would fail with Cognos reporting the error "
RPT-DBL-3501 The following error was detected attempting to validate the report: UDA-SQL-0532 Data Source is not accessible: "invalid data source".".
When checking the Vertica log files we would see these errors:
2012-01-21 14:19:09.757 Init Session:0x2aaaad130e00 <LOG> @v_marketing_node0001: 08006: could not receive data from client: Connection reset by peer
2012-01-21 14:19:09.757 Init Session:0x2aaaad130e00 <LOG> @v_marketing_node0001: 08006: unexpected EOF on client connection
With a little digging around in the Vertica log, we found that the "Connection reset by peer" errors
always appeared 1 hour after the last completed databases query of the Cognos connection.
So we figured we'd increase the "Inactivity Timeout" setting in Cognos to be greater than 1 hour (we tried 4 hours). We re-ran the cube build and to our surprise we received the same error in Cognos and saw the same errors in the Vertica log file exactly 1 hour after the last query finished in Vertica for that session! The Vertica session that Cognos was trying to run another query in went away again!
After some more digging, we found that there is a 1 hour inactivity timeout on our network infrastructure (in the routers). This was killing the inactive connection between Cognos and Vertica!
Our connections were being made using the Vertica ODBC drivers which unfortunately do not include any type of "Auto reconnect" feature or "Timeout" settings.
The fix for us was to increase a Linux variable named
tcp_keepalive_time on each of our Vertica nodes. The
tcp_keepalive_time variable tells the TCP/IP stack how often to send TCP keepalive packets to keep a connection alive if it is currently unused. By default, the variable is set to 7200 seconds (2 hours). We changed it to 1800 seconds (30 minutes) and wa-lah, our disconnections stopped happening!
By the way, here is how to change the value of the variable:
Code: Select all
echo 1800 > /proc/sys/net/ipv4/tcp_keepalive_time
I hope this helps you out!