Stale checkpoint and too many ROS

Pmachinaud · Post by **Pmachinaud** » Thu Nov 06, 2014 1:41 pm

Hey guys !

I pretty new to vertica and since 2 weeks, my new job consist in maintaining the vertica database.

Here we are in "stand alone mode" (only 1 node).

The vertica has heavily work - loaded/updated/deleted every 5-15 minutes on multiple schemas.
The vertica is heavily read in buisness hours.

Our vertica server has 32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.

Issue with ROS:

We have frequently this issue happening during heavy workload:
Attempted to Create Too Many ROS Containers.

As manual action we do : SELECT DO_TM_TASK('mergeout', 't1');
But the issue happened the day after.
As workaround : we switched to COPY LOCAL command.
But the issue happen the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 300 and 150.
We added TM thread from 3 to 5.
Memory Size - before : 200MB , after 400MB
But the issue happened the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 150 and 70.

We are planing to add the double of RAM to this server.

Is there a real workaround to avoid this issue ?

Issue with Stale Checkpoint:

As I planned to do monitoring on vertica, I made SELECT * FROM v_monitor.monitoring_events LIMIT 100;
I found this kind of error :

0 Warning 2014-11-05 09:26:45 2014-11-05 09:36:45 Stale Checkpoint Node v_akabi_node0001 has data 1032429 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694778)
0 Warning 2014-11-05 09:21:45 2014-11-05 09:31:45 Stale Checkpoint Node v_akabi_node0001 has data 1032129 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694757)

All I can find about this issue is that we have to monitor it.

I tried to do a MoveOut global in order to get rid of every projections in WOS, but it didn't fixed the issue.

But what should we do if this happen ?

thanks in advance,

Regards,

Philippe

NorbertKrupa · Post by **NorbertKrupa** » Thu Nov 06, 2014 2:17 pm

What hardware are you running & what version of Vertica?

Pmachinaud · Post by **Pmachinaud** » Thu Nov 06, 2014 2:23 pm

We are on 6.1.2-0 (planning to upgrade when the stale issue will be cleared).

On the hardware side :
Physical hardware DELL
32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.
RAID 10 , 1.1T available

Pmachinaud · Post by **Pmachinaud** » Thu Nov 06, 2014 5:36 pm

I reply myself, but still looking for a solution for thoses issues.

regarding
select * from system;

current_epoch ,ahm_epoch ,last_good_epoch
697 590 ,668 639 ,668 639

Seems that the ahm is not sync with the current epoch.

I've executed select make_ahm_now();

And will purge the old data (from previous epoch) with PURGE_TABLE(), hoping that it will help me.

I hope some guys will help me fixing this issue.

Regards,

Philippe

Post by **JimKnicely** » Thu Nov 06, 2014 5:51 pm

Hi Pmachinaud,

How far behind is the AHM epoch? Can you run this?

select now(), ahm_epoch, epoch_close_time from system join epochs on epoch_number = ahm_epoch;

What is the value of your "HistoryRetentionTime" parameter?

select description, current_value, default_value from configuration_parameters where parameter_name = 'HistoryRetentionTime';

Thanks

Pmachinaud · Post by **Pmachinaud** » Fri Nov 07, 2014 9:37 am

Hi knicely87,

When I run
select now(), ahm_epoch, epoch_close_time from system join epochs on epoch_number = ahm_epoch;

The results :
2014-11-07 09:23:24 668639 2014-10-24 11:32:30

The value for HistoryRetentionTime is the default : 0
select description, current_value, default_value from configuration_parameters where parameter_name = 'HistoryRetentionTime';

Number of seconds of epochs kept in the epoch map (seconds) 0 0

And my select make_ahm_now(); is still running.

Thanks,

Post by **JimKnicely** » Fri Nov 07, 2014 2:31 pm

What is the size of the larges ROS container?

select node_name, anchor_table_schema, anchor_table_name, ros_used_bytes/1024^3 ros_used_gb from projection_storage order by ros_used_bytes desc limit 1;

Some advice I got from a friend:

You may have a "skew" configuration.

For example if table partitioned by month and month "JAN" is too big (>2GB on single node), so "mergeout" will work too much long.

I think that solution: to limit a ROS container size
(once I get a db of 20PT totally + 0.5PT each day where a single ROS container takes more than 10GB, so "moveout"&"mergeout" never finished).

Stale checkpoint and too many ROS

Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS