Stale checkpoint and too many ROS

Pmachinaud · Post by **Pmachinaud** » Fri Nov 07, 2014 2:35 pm

Here is the result.

Node anchor_table_schema ros_used[GB]
v_akabi_node0001 row tbl_games_connection_history 28.1562486328185

Post by **JimKnicely** » Fri Nov 07, 2014 2:40 pm

Is that table partitioned? You can query the v_catalog.partitions system table or run the following:

select export_tables('','row.tbl_games_connection_history');

Note: Not sure I got the schema name correct. Did you create a schema called "row"?

Pmachinaud · Post by **Pmachinaud** » Fri Nov 07, 2014 2:46 pm

Yeah, the DBA before me called this schema "row"

And this table is not partitionned.

CREATE TABLE "row".tbl_games_connection_history
(
connection_account_id int NOT NULL,
connection_game_id int NOT NULL,
connection_server_id int NOT NULL,
connection_date date NOT NULL,
connection_character_id int NOT NULL,
connection_ip int NOT NULL,
connection_subscribed boolean,
connection_count int NOT NULL,
connection_duration int NOT NULL
);

ALTER TABLE "row".tbl_games_connection_history ADD CONSTRAINT C_PRIMARY PRIMARY KEY (connection_account_id, connection_game_id, connection_server_id, connection_date, connection_ip);

I think we should partition by - or Server id since there is 62 ids and it's quite well distributed (just 5 id have a low quantity of row).

If I do partitionning will it improve perfrmance on mergeout ?

Have you experienced very long lock during partition maintenance ?

We have 6 tables > 5 GB, including 3 > 10GB

Post by **JimKnicely** » Fri Nov 07, 2014 4:12 pm

Adding a partition does not created an exclusive lock on the table.

Example:

Code: Select all

dbadmin=> drop table public.jim;
DROP TABLE
dbadmin=> create table public.jim (c1 int not null, c2 int);
CREATE TABLE
dbadmin=> alter table public.jim partition by c1;
ALTER TABLE
dbadmin=> select distinct object_name, mode, scope from dc_lock_attempts where session_id = current_session() and object_name ilike '%public.jim%' order by 1;
   object_name    | mode |    scope
------------------+------+-------------
 Table:public.jim | O    | TRANSACTION
 Table:public.jim | S    | TRANSACTION
(2 rows)

You won't be able to perform any DML on the table while the partitioning is occurring.

You could create a new table with the partitions defined, insert data from the current table to the new table, drop the old table and finally rename the new table to the old table name...

Partition will help with the performance of the tuple mover as you can specify how many partitions are active.

Also, take a look at the MaxMrgOutROSSizeMB parameter.

Code: Select all

dbadmin=> select current_value, default_value, description from configuration_parameters where parameter_name = 'MaxMrgOutROSSizeMB';
 current_value | default_value |                    description
---------------+---------------+----------------------------------------------------
 2000          | -1            | The max ROS size in MB a merge out job can produce
(1 row)

From the 6.1.3 Admin Guide:

The Tuple Mover has two different policies for managing these different partition workloads:
- Active partitions are loaded or modified frequently. The Tuple Mover uses a STRATA mergeout policy that keeps a collection of ROS container sizes to minimize the number of times any individual tuple is subjected to mergeout. The ActivePartitionCount parameter identifies how many partitions are being actively loaded.
- Inactive partitions are very infrequently loaded or modified. The Tuple Mover consolidates the ROS containers to a minimal set while avoiding merging containers whose size exceeds MaxMrgOutROSSizeMB.

Partitions are not explicitly marked by the user as active or inactive; instead, the Tuple Mover uses the following algorithm to order the partitions from oldest to newest:
- If one partition was created before the other partition, it is older.
- If two partitions were created at the same time, but one partition was last updated earlier than the other partition, it is older.
- If two partitions were created and last updated at the same time, the partition with the smaller key is considered older.

If you perform a manual mergeout using the DO_TM_TASK function, all partitions are consolidated into the smallest possible number of containers, regardless of the value of the ActivePartitionCount parameter.

scutter · Post by **scutter** » Sun Nov 09, 2014 1:12 am

Hi Philippe,

I saw the thread over in the vertica community as well. I wanted to add a few comments and clarifications. I want to be sure that you understand the factors that are contributing to the issue.

- Stale Checkpoints are bad. All production environments should monitor for them. A “stale checkpoint” means that data is “stuck” in the WOS and is not getting successfully moved out by the tuple mover to ROS. The reason that it’s bad is that if it happens on multiple nodes, and on a “bad” combination of nodes (in your case the one single node is enough), if the clusters shuts down abnormally without having moved out the data, then the cluster will rollback to the date when that data was loaded to WOS. So that could be weeks of data loss if the stale checkpoint goes unresolved.

- The long-running mergeout in Replay Delete suggests that the underlying projection is either not optimized at all, or is poorly optimized. A long Replay Delete means that there isn’t a meaningful set of RLE columns in the projection’s ORDER BY , and “replaying” deleted records in the new ROS container being created by mergeout is very inefficient. **Fix the projection definition**.

- Purging in this state is not necessarily helpful since AHM is stuck back, and AHM controls which records can be purged.

- Sometimes you can, as Support had suggested, kill a running mergeout so that moveout can run, and then AHM can advance, and then the next mergeout execution would also purge records since the deleted records would have been deleted prior to the current AHM.

- Is your long-running moveout also stuck in replay delete? If it is, then you may be best off just creating a new copy of this table and dropping the current one to get past the issue.

- The table isn’t partitioned, so the issue is either due to too many loads since the mergeouts got stuck, or too many “delete vectors” that haven’t been merged/purged.

- Partitioning the table as Jim suggested is definitely a good idea. Partitioning should be done on a time-based expression typically. Partitioning has a benefit of resulting in smaller units of work for the Tuple Mover since mergeouts will only be as large as a partition's size.

- You should revert the TM configuration changes that you made back to defaults since the underlying issue is caused by the stuck mergeout not that the TM can’t keep up.

- 7.x includes an alternative replay delete algorithm that avoids issues like this.

—Sharon

Pmachinaud · Post by **Pmachinaud** » Mon Nov 10, 2014 10:40 am

Thank you guys for all the feedback, I really apreciate that.

- Regarding the MaxMrgOutROSSizeMB, it's the default value.
Now we are running on 32GB ram and when I do :
select max((ros_used_bytes/ros_count)/1024^2) from v_monitor.projection_storage => 933.859687154943MB used max in one ROW.
should I decrease MaxMrgOutROSSizeMB to 400MB ?

- Stale Checkpoints : indeed we are in a bad situation because we are running on a single node only :/

- The long-running mergeout in Replay Delete : Yes, the database has never been tuned since I started this new job (3 weeks ago).
So I think I'll have to create projection in order to optimize updates/deletions.

- Purging : Yes, I came to the same conclusion as you, only by making the AHM now, purging will be effective.

- Long-runing
when I do select now(), * from vs_tuple_mover_operations; I see sometimes more than 30H of operations, but that's the maximum I find, I think it's more "very slow" than "stuck", do you think it's stuck ?

- Mergeout : as said before a mergeout process can wait more than 30H before been applied . (this morning, I yet see 2 mergeout in that case ).
eg:
Table - Projection - earliest_container_start_epoch - latest_container_start_epoch - ros_count - total_ros_used_bytes - plan_type - sessionid
tbl_tableau_source_cam_full_query - tbl_tableau_source_cam_full_query_super 672504 700871 32 7978312080 Replay Delete ankmdb29-27798:0xf9eeb

- Moveout : I think there is moveout stuck, but I need to be sure.

What I find :

operation_start_timestamp node_name operation_name operation_status table_schema table_name projection_name projection_id column_id earliest_container_start_epoch latest_container_end_epoch ros_count total_ros_used_bytes plan_type session_id is_executing runtime_priority

2014-11-08 08:39:35 v_akabi_node0001 Moveout Change plan type to Replay Delete tableau tbl_cam_retention tbl_cam_retention_super 45 035 996 361 100 542 0 0 700 882 3 224 051 200 Replay Delete ankmdb29-27798:0xedacb false MEDIUM

2014-11-08 08:39:34 v_akabi_node0001 Moveout Start tableau tbl_cam_retention tbl_cam_retention_super 45 035 996 361 100 542 0 0 700 882 3 224 051 200 Moveout ankmdb29-27798:0xedacb false MEDIUM

2014-11-08 08:38:19 v_akabi_node0001 Moveout Abort tableau tbl_cam_retention tbl_cam_retention_super 45 035 996 361 100 542 0 0 700 311 3 224 051 200 Replay Delete ankmdb29-27798:0xe8869 false MEDIUM
---
Moveout Change plan type to Replay Delete
Moveout Start
---
I never see a state "Complete" for some moveout like "tbl_cam_retention", I think they are stuck.

-Partitioning :
Why should the partitionning be on a time-based expression ? I thought that a partition key with well balanced number of rows was a good option.
In order to be effective, the developper's code must have a where clause on the partition key, am I right ?

- As suggested I reverted the moveoutinterval+mergeoutinterval values to default, should I set default number of TM thread ? (increased to 5).

- It's planned to migrate to 7.x when the database will be consistent, and all data will be on disk. In order to not loose data.

Regards,

NorbertKrupa · Post by **NorbertKrupa** » Mon Nov 10, 2014 3:24 pm

Pmachinaud wrote:- Long-runing
when I do select now(), * from vs_tuple_mover_operations; I see sometimes more than 30H of operations, but that's the maximum I find, I think it's more "very slow" than "stuck", do you think it's stuck ?

Be aware that you're returning the timestamp from the start of the current transaction with NOW(), not the actual current time.

Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS