Factors impacting performance of Vertica database

mbalajiwale · Post by **mbalajiwale** » Mon Aug 06, 2012 8:32 am

Hello,

I am a newbie to use Vertica. Our requirement is to develop MicroStrategy reports and dashboards which will point to Vertica (single node in dev env and then multinode in QA). Our intention is the develop very good performance reports and dashbaords. We need to come up with all possible factors related to Vertica which if implemented may result in slow performance. Few things which we found googling are as below:

1. Projections: How the projections should be defined? Are there any best practices to do this to have best optimized model? Any golden rules, any tips and tricks...any thing....?

2. Is there any thing like what should be preferred - joins or sub-queries to improve performance. Similarly for other clauses too which are used in a query

3. Any thing related to H/W configuration?

These are just very high level points we have searched so far. We would like to some one guide us here about all possible performance factors which hamper performance of Vertica and in return hamper performance of the application/MicroStrategy reports and dashboards pointing on Vertica

Thanks in advance!

id10t · Post by **id10t** » Mon Aug 06, 2012 10:04 am

Hi!

No "golden rules/bullets"

1. depends on data

2. no "golden rule", but try to avoid constrains (PK/FK).

3. try to dernomolize table (less dims tables = less joins). MSTR can build hierarchy from flat table

mbalajiwale · Post by **mbalajiwale** » Mon Aug 06, 2012 10:42 am

Thanks SKwa for your inputs.
1. When we say 'depends on data'...then do we mean here by volume of data, distribution of data on each node and each storage location? Can you please throw some light on this?

2. Also, does aggregation of data in Vertica hampers it in any way? Should we keep the atomic data (data at lowest possible level) or should we keep it at some higher granular level (level at which reports should display data)?

E.g. fact table has data at say time stamp level but on report data should be displayed at all these levels depending on selection made by user-
minute
hour
day
week
month
quarter
year

so should we let MicroStrategy handle all these aggregations or these should be done in Vertica and pulled directly on report, to get fastest possible performance?

3. As far as I know 'Delete' is not advisable in Vertica, it does not delete the data from physical memory rather marks it as 'Deleted' logically. Hence it is recommended to do right at first time in Vertica and let not this situation arise to revert all/any changes. But If still such action has to be taken the how does this impact the performance? Also are there any other functions/actions which should not be performed?
Please correct me if this is not the case and my understanding is incorrect.

Thanks & Regards
mbalajiwale

Rick · Post by **Rick** » Mon Aug 06, 2012 11:11 am

It would be easier to explain if you could provide some sample data

Also using the delete function is fine, Vertica recommends you purge the data when the deleted data exceeds 10% of the database storage size.

mbalajiwale · Post by **mbalajiwale** » Mon Aug 06, 2012 12:24 pm

Thanks Rick!
I will try to get some sample data for your reference, for that I will have to first talk to the DB team lead here. But mean while, can you please tell what all parameters you would like to see in data, if I am aware of any info, I can provide you that.

Thanks
mbalajiwale

Rick · Post by **Rick** » Mon Aug 06, 2012 12:42 pm

Just a few sample records. For example, some sample data we might take in is:
z_timestamp - datetime - moid - counter1 - counter2 - counter3
2012-01-01 12:00:00|2012-01-01 11:00:00|RNC-A|300|500|123
2012-01-01 12:00:00|2012-01-01 11:00:00|RNC-B|200|400|456
2012-01-01 12:00:00|2012-01-01 11:00:00|RNC-C|100|300|789
2012-01-01 13:00:00|2012-01-01 12:00:00|RNC-A|400|600|234
2012-01-01 13:00:00|2012-01-01 12:00:00|RNC-B|500|700|345

I think that if you provided sample data it would be much easier to give you advice based on your particular data, it would probably especially help with how to define projections, although Vertica generally does a decent job if you can give it some sample queries.

id10t · Post by **id10t** » Mon Aug 06, 2012 7:34 pm

Hi!

1. Depends on data means... depends on data (data properties like cardinality, distribution of values, data type, of course volume). In general, projection - it's how you want to keep/store your data - columns order and their sorting. And one more important thing --- QUERIES/ANALYSIS! If you do not query your data or data analysis is very simple or volume of data is small - so you will prefer optimization for storage or load. If you do a lot of queries or data analysis not simple - so you will prefer "performance optimization" over "storage/load". Sometimes data load is important. So, as you see, no golden rules and can't be - best "practice" depends on data + BI analysis.

It's not rule but in general: cardinality "impacts" on order of columns, queries - on sorting of columns, volume - on partitioning and segmentation, distribution - on encoding and compression.
BTW: zvika wrote wonderful article about Vertica Trace. Read it and I promise you will understand better what is "best practice".

2. Also, depends on project. MSTR is very powerful BI tool, it also optimized to Vertica and likely you will prefer to do aggregation with MSTR rather than with Vertica, coz' it provides GUI + drilling, slice, dice, selectors on NoSQL manner, but it costs in performance and may be you will need to keep some 'temporary' aggregation in Vertica for prformance.

For hierarchy - each attribute should be as separate column, i.e. it's not enought to hold timestamp only, you will need such columns as year, month, day if it your hierarchy.

3. Try to avoid it - UPDATES/DELETES! Performance very poor. Only one advice I can give you: do partitioning and do right partitioning (it depends on delete conditions). Also, 10% as Rick wrote, it's not rule - I suggest to you to find the most number of rows for deletion by tests. Once I tried to delete 200M rows from 3TB of data (over 70 biil of rows) - it's much more less than 10%(i200*10^8/700*10^8 < 1%). It didn't finished it in hour!

PS Rick right: if you will provide us with data sample and queries so we may help, otherwise its... abstraction, best means the "only", the "only" means already implemented inside

Factors impacting performance of Vertica database

Factors impacting performance of Vertica database

Re: Factors impacting performance of Vertica database

Re: Factors impacting performance of Vertica database

Re: Factors impacting performance of Vertica database

Re: Factors impacting performance of Vertica database

Re: Factors impacting performance of Vertica database

Re: Factors impacting performance of Vertica database