Hey guys !
I pretty new to vertica and since 2 weeks, my new job consist in maintaining the vertica database.
Here we are in "stand alone mode" (only 1 node).
The vertica has heavily work - loaded/updated/deleted every 5-15 minutes on multiple schemas.
The vertica is heavily read in buisness hours.
Our vertica server has 32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.
Issue with ROS:
We have frequently this issue happening during heavy workload:
Attempted to Create Too Many ROS Containers.
As manual action we do : SELECT DO_TM_TASK('mergeout', 't1');
But the issue happened the day after.
As workaround : we switched to COPY LOCAL command.
But the issue happen the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 300 and 150.
We added TM thread from 3 to 5.
Memory Size - before : 200MB , after 400MB
But the issue happened the day after.
As workaround : regarding to https://my.vertica.com/docs/4.1/HTML/Master/14402.htm, we reduced the mergeoutinterval + moveoutinterval to respectively 150 and 70.
We are planing to add the double of RAM to this server.
Is there a real workaround to avoid this issue ?
Issue with Stale Checkpoint:
As I planned to do monitoring on vertica, I made SELECT * FROM v_monitor.monitoring_events LIMIT 100;
I found this kind of error :
0 Warning 2014-11-05 09:26:45 2014-11-05 09:36:45 Stale Checkpoint Node v_akabi_node0001 has data 1032429 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694778)
0 Warning 2014-11-05 09:21:45 2014-11-05 09:31:45 Stale Checkpoint Node v_akabi_node0001 has data 1032129 seconds old which has not yet made it to disk. (LGE: 668639 Last: 694757)
All I can find about this issue is that we have to monitor it.
I tried to do a MoveOut global in order to get rid of every projections in WOS, but it didn't fixed the issue.
But what should we do if this happen ?
thanks in advance,
Regards,
Philippe
Stale checkpoint and too many ROS
Moderator: NorbertKrupa
-
- GURU
- Posts: 527
- Joined: Tue Oct 22, 2013 9:36 pm
- Location: Chicago, IL
- Contact:
Re: Stale checkpoint and too many ROS
What hardware are you running & what version of Vertica?
Checkout vertica.tips for more Vertica resources.
-
- Newbie
- Posts: 8
- Joined: Wed Nov 05, 2014 10:23 am
Re: Stale checkpoint and too many ROS
We are on 6.1.2-0 (planning to upgrade when the stale issue will be cleared).
On the hardware side :
Physical hardware DELL
32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.
RAID 10 , 1.1T available
On the hardware side :
Physical hardware DELL
32 Go RAM, 2*CPU Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + hyperthread.
RAID 10 , 1.1T available
-
- Newbie
- Posts: 8
- Joined: Wed Nov 05, 2014 10:23 am
Re: Stale checkpoint and too many ROS
I reply myself, but still looking for a solution for thoses issues.
regarding
select * from system;
current_epoch ,ahm_epoch ,last_good_epoch
697 590 ,668 639 ,668 639
Seems that the ahm is not sync with the current epoch.
I've executed select make_ahm_now();
And will purge the old data (from previous epoch) with PURGE_TABLE(), hoping that it will help me.
I hope some guys will help me fixing this issue.
Regards,
Philippe
regarding
select * from system;
current_epoch ,ahm_epoch ,last_good_epoch
697 590 ,668 639 ,668 639
Seems that the ahm is not sync with the current epoch.
I've executed select make_ahm_now();
And will purge the old data (from previous epoch) with PURGE_TABLE(), hoping that it will help me.
I hope some guys will help me fixing this issue.
Regards,
Philippe
- JimKnicely
- Site Admin
- Posts: 1825
- Joined: Sat Jan 21, 2012 4:58 am
- Contact:
Re: Stale checkpoint and too many ROS
Hi Pmachinaud,
How far behind is the AHM epoch? Can you run this?
select now(), ahm_epoch, epoch_close_time from system join epochs on epoch_number = ahm_epoch;
What is the value of your "HistoryRetentionTime" parameter?
select description, current_value, default_value from configuration_parameters where parameter_name = 'HistoryRetentionTime';
Thanks
How far behind is the AHM epoch? Can you run this?
select now(), ahm_epoch, epoch_close_time from system join epochs on epoch_number = ahm_epoch;
What is the value of your "HistoryRetentionTime" parameter?
select description, current_value, default_value from configuration_parameters where parameter_name = 'HistoryRetentionTime';
Thanks
Jim Knicely
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.
-
- Newbie
- Posts: 8
- Joined: Wed Nov 05, 2014 10:23 am
Re: Stale checkpoint and too many ROS
Hi knicely87,
When I run
select now(), ahm_epoch, epoch_close_time from system join epochs on epoch_number = ahm_epoch;
The results :
2014-11-07 09:23:24 668639 2014-10-24 11:32:30
The value for HistoryRetentionTime is the default : 0
select description, current_value, default_value from configuration_parameters where parameter_name = 'HistoryRetentionTime';
Number of seconds of epochs kept in the epoch map (seconds) 0 0
And my select make_ahm_now(); is still running.
Thanks,
When I run
select now(), ahm_epoch, epoch_close_time from system join epochs on epoch_number = ahm_epoch;
The results :
2014-11-07 09:23:24 668639 2014-10-24 11:32:30
The value for HistoryRetentionTime is the default : 0
select description, current_value, default_value from configuration_parameters where parameter_name = 'HistoryRetentionTime';
Number of seconds of epochs kept in the epoch map (seconds) 0 0
And my select make_ahm_now(); is still running.
Thanks,
- JimKnicely
- Site Admin
- Posts: 1825
- Joined: Sat Jan 21, 2012 4:58 am
- Contact:
Re: Stale checkpoint and too many ROS
What is the size of the larges ROS container?
select node_name, anchor_table_schema, anchor_table_name, ros_used_bytes/1024^3 ros_used_gb from projection_storage order by ros_used_bytes desc limit 1;
Some advice I got from a friend:
select node_name, anchor_table_schema, anchor_table_name, ros_used_bytes/1024^3 ros_used_gb from projection_storage order by ros_used_bytes desc limit 1;
Some advice I got from a friend:
You may have a "skew" configuration.
For example if table partitioned by month and month "JAN" is too big (>2GB on single node), so "mergeout" will work too much long.
I think that solution: to limit a ROS container size
(once I get a db of 20PT totally + 0.5PT each day where a single ROS container takes more than 10GB, so "moveout"&"mergeout" never finished).
Jim Knicely
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.