We have the case that Json is pushed from outer system to our Vertica database into tables with longvarchar columns. We do not have any possibility to change this process currently.
In general the Jsons are simple linear ones, so we could in theory just parse those with MAPLOOKUP/MapJSONExtractor like that:
Code: Select all
SELECT
public.MAPLOOKUP(public.MapJSONExtractor(xJSON), 'id') AS id
, public.MAPLOOKUP(public.MapJSONExtractor(xJSON), 'value') AS value
FROM /*EXAMPLE_DATA*/
(
SELECT
'{"id":1,"value":1000}' AS xJSON
UNION ALL
SELECT
'{"id":2,"value":2000}' AS xJSON ) EXAMPLE_DATA
WHERE
1 = 1
/*
id value
1 1000
2 2000
*/
The throughput we get:
1. Select the data out to our ETL tool (Pentaho Data Integration) and parse the JSON there and write back to database 10million records in 4 minutes.
2. Taking some risks (casting it to varchar so string functions can be applied, and then parsing the JSON with string functions like SPLIT_PART etc) then we get 10 million rows parsed in 2 minutes.
3. With MapJSONExtractor the 10 million rows take 6 minutes to be parsed.
So with billions of rows the difference is significant.
We do not like to take the risks of handling the JSON as string and parsing with string functions, but as it is so much faster then we really do not see an better option at the moment than to go with option 2. string parsing. The same example than above with SPLIT_PART:
Code: Select all
SELECT
SPLIT_PART(SPLIT_PART(xJSON, '"id":',2),',',1) AS id
, SPLIT_PART(SPLIT_PART(SPLIT_PART(xJSON, '"value":',2),',',1),'}',1) AS value
FROM /*EXAMPLE_DATA*/
(
SELECT
'{"id":1,"value":1000}' AS xJSON
UNION ALL
SELECT
'{"id":2,"value":2000}' AS xJSON ) EXAMPLE_DATA
WHERE
1 = 1
/*
id value
1 1000
2 2000
*/
Br,
pj