Page 1 of 1

How to segment a projection

Posted: Fri Apr 11, 2014 7:55 am
by sarah
Hi all,

I've found a lot of info in the admin guide that explains how to set up the ordering of columns in a projection, but not too much info about what columns to segment on. It it best just to segment by the primary key or the column with the highest cardinality? I think that is correct.... But if do segment by PK I get a lot of rows in the query_events table that have a description of "Many rows were resegmented during plan execution." I assume hat's bad...

:?

Re: How to segment a projection

Posted: Fri Apr 11, 2014 12:54 pm
by NorbertKrupa
Technically, the primary key should have the highest cardinality because it has the most unique values. Vertica suggests to use the PK in the documentation. Otherwise, you find the highest cardinality column(s) and use those.

Re: How to segment a projection

Posted: Fri Apr 11, 2014 1:43 pm
by sarah
#norbertk,

Thanks for your response! That's what I figured :)

Re: How to segment a projection

Posted: Fri Apr 11, 2014 2:02 pm
by scutter
Sarah,

It’s a really good thing to understand how projections are optimized - what’s a good sort order, segmentation, etc. But in the end, using Database Designer to design projections is preferred at least as a first step. Have you tried using Designer?

—Sharon

Re: How to segment a projection

Posted: Sat Apr 12, 2014 6:42 pm
by cbennett
Not a super-easy thing to answer, since there's a lot that goes into this, but in general, the above is all accurate. But even segmenting on a very cardinal PK might not be optimum, depending on what you're trying to accomplish. For example, even having perfect distribution of data through a PK segmentation, you could still get a lot of resegmentation if your query is trying to GROUP BY some logical value, like country_code. If GROUP BY performance is what you're after, you could try creating a projection that's segmented on the logical column instead. That creates local group bys, but beware - it could also create a lot of data skew in your cluster. Tread carefully.