Stale checkpoint and too many ROS

scutter · Post by **scutter** » Tue Nov 11, 2014 4:06 pm

I don't see a need to change MaxMrgOutROSSizeMB.

You're correct - not really "stuck" but instead "too slow to be practical"

Partitioning is partly for data retention policies so that old data can be easily dropped, partly for "partition pruning" for faster query execution, and partly so that the Tuple Mover has smaller units of work when it's doing mergeouts. The TM has a concept of "current partitions" and uses a different mergeout algorithm for the most recent partition versus "past partitions". If you were using a non-time-based partition expression, you'd always be loading into all partitions, which would cause lots of extra mergeout work.

Also don't confuse partitioning with segmentation. You want your segmentation expression to provide a balanced distribution of data across the nodes. Even distribution isn't important for partitioning.

I still suggest that you create a new copy of this problematic table, interrupt all of the TM operations on the current table, and drop the existing table so that you can get past this issue. Do you have enough disk space to do that?

--Sharon

Pmachinaud · Post by **Pmachinaud** » Wed Nov 12, 2014 9:50 am

Thanks Sharon for the help

I've been in a meeting with data team and developper team.
They told me that the data could be restored since the "raw" data exists in production.
As that there is a lot of tables tables with the same problematic, the new copy of all problematic tables could take too much time.

So I think we can afford to loose this data and rebuild from the "raw" one in production (I will double check with my boss).

I think I've got to launch maintenance of this server in 3 steps :

1/
prereq :
All analytics + etl job should be down during the maintenance.
---
a. Backup database.
b. Shutdown Vertica.
c. upgrade OS server (debian)
d. upgrade Vertica to the latest 7.x
e. install the management console on an another server (can I install it in a VM ?), because there isn't a management console yet in production.
f. launch a big PURGE_TABLE('')
g. launch a big ANALYZE_STATISTICS('')
h. use the database designer to launch the top 1000 sql command long running, in order to create the appropriate PROJECTIONS.
i. Improve the monitoring.

2/
Analyze the table partitioning that could be set on the top 3 heaviest tables.
Launch the partitioning phasis.

3/
Install in 3 node mode in order to have availability + scalability

-----

Regarding the partitioning I think it won't be an easy part :

The TOP 3 heaviest tables are not working in an "archive mode" every row can be updated.
eg : The table with all the activities of an account (tbl_account), a very old account can loggin in our production, so we update the last logging time field in production, and the ETL tool updates it in the vertica database, schema row.

What do you think about this plan ?
Regarding the partitionning, and the way we update old rows frequently, is there a solution that can reduce the workload ?

Thanks,

Philippe.

scutter · Post by **scutter** » Wed Nov 12, 2014 6:56 pm

If you are going to recreate all of the tables, why go through the PURGE_TABLE()? Can you just drop the current tables and recreate them? Rather than recreating them from the raw data you also have the option of creating copies of the tables within vertica and then dropping the original ones, if you have the disk space.

When you run Database Designer, I would limit the query set to a smaller, more focussed set of sample queries than 1000 queries. You could also do a comprehensive design without a query set for starters; some projection optimization is better than none.

Partitioning and updating/deleting older records. That’s not a problem. If you do a lot of deletes or updates, then you should have a cron job set up to execute purge_table() or purge_partition() so that deleted records don’t impact performance.

—Sharon

Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS

Re: Stale checkpoint and too many ROS