Migrating semi-structured data
Posted: Thu Nov 20, 2014 2:22 am
Hi all,
I'm trying to import data into vertica in a format that's not ideally suited for the task and I wanted to get opinions from experts about how I could do it better. Please let me know if this is not the right forum to post to or my question is too vague.
Simplifying a bit for the purposes of this question, let's say the data comes from a log file and each line looks something like this:
timestamp,user_id,event_name,unstructured_data
where unstructured_data is key-value pairs, separated by pipes (|).
For example, a row may be:
1416445888,12345,someEvent,key1=val1|key2=val2|key3=val3
Let's also say that after the data is imported, I'll want to run queries that involve some of the keys from the unstructured data.
My first attempt was to import the data as-is, with a column for each comma-separated field and all the unstructured data in a single column. Then I wrote a UDF to retrieve a value from that column given a key. This works - as in, it produces correct results - but can be really slow if I'm parsing a lot of large unstructured data columns. I found that if I were to extract those key-value pairs into their own columns my queries complete orders of magnitude faster.
Now my questions are:
1) Is my assumption correct that parsing the key value pairs from a large string in a query that involves a lot of rows (say a few hundred million rows) will always be slow, or should I be profiling my UDF some more?
2) Is there a recommended way of loading data in format like this? I'm thinking of a scheme where I'd have my main table with a lot of extra columns (say C0 through CN) and a schema table that maps those extra columns to keys in the unstructured data. It seems like I may have to process the data first in a format that's easier to import. Are there any facilities in vertica that may help with this? Flex tables looked relevant, but it may be better if I do the parsing myself to keep the schemas in sync, since keys for a given event could change over time
3) Is there anything I'm missing? Brand new to vertica but I'm extremely impressed by what I've seen so far
Thanks in advance for the help / advice!
Alex
I'm trying to import data into vertica in a format that's not ideally suited for the task and I wanted to get opinions from experts about how I could do it better. Please let me know if this is not the right forum to post to or my question is too vague.
Simplifying a bit for the purposes of this question, let's say the data comes from a log file and each line looks something like this:
timestamp,user_id,event_name,unstructured_data
where unstructured_data is key-value pairs, separated by pipes (|).
For example, a row may be:
1416445888,12345,someEvent,key1=val1|key2=val2|key3=val3
Let's also say that after the data is imported, I'll want to run queries that involve some of the keys from the unstructured data.
My first attempt was to import the data as-is, with a column for each comma-separated field and all the unstructured data in a single column. Then I wrote a UDF to retrieve a value from that column given a key. This works - as in, it produces correct results - but can be really slow if I'm parsing a lot of large unstructured data columns. I found that if I were to extract those key-value pairs into their own columns my queries complete orders of magnitude faster.
Now my questions are:
1) Is my assumption correct that parsing the key value pairs from a large string in a query that involves a lot of rows (say a few hundred million rows) will always be slow, or should I be profiling my UDF some more?
2) Is there a recommended way of loading data in format like this? I'm thinking of a scheme where I'd have my main table with a lot of extra columns (say C0 through CN) and a schema table that maps those extra columns to keys in the unstructured data. It seems like I may have to process the data first in a format that's easier to import. Are there any facilities in vertica that may help with this? Flex tables looked relevant, but it may be better if I do the parsing myself to keep the schemas in sync, since keys for a given event could change over time
3) Is there anything I'm missing? Brand new to vertica but I'm extremely impressed by what I've seen so far
Thanks in advance for the help / advice!
Alex