Hi,
Dealing with IDENTITY columns is a pain.
IDENTITY columns start at 1 by default. You are probably seeing such high numbers for the IDENTITY because each node caches 25,000 values by default.
One solution for your question could be to alter the sequence that the IDENTITY is using and make it restart at 1. But, as the following example shows, that wont work!
Example:
dbadmin=> CREATE TABLE pepsi (c1 IDENTITY, c2 VARCHAR(10));
CREATE TABLE
dbadmin=> \ds *pepsi*
List of Sequences
Schema | Sequence | CurrentValue | IncrementBy | Minimum | Maximum | AllowCycle | Comment
--------+--------------+--------------+-------------+---------+---------------------+------------+---------
public | pepsi_c1_seq | 0 | 1 | 1 | 9223372036854775807 | f |
(1 row)
dbadmin=> INSERT INTO pepsi (c2) VALUES ('TEST1');
OUTPUT
--------
1
(1 row)
dbadmin=> SELECT * FROM pepsi;
c1 | c2
----+-------
1 | TEST1
(1 row)
dbadmin=> ALTER SEQUENCE pepsi_c1_seq RESTART 1;
ALTER SEQUENCE
dbadmin=> INSERT INTO pepsi (c2) VALUES ('TEST2);
OUTPUT
--------
1
(1 row)
dbadmin=> SELECT * FROM pepsi;
c1 | c2
----+-------
1 | TEST1
2 | TEST2
(2 rows)
dbadmin=> TRUNCATE TABLE pepsi;
TRUNCATE TABLE
dbadmin=> ALTER SEQUENCE pepsi_c1_seq RESTART 1;
ALTER SEQUENCE
dbadmin=> INSERT INTO pepsi (c2) VALUES ('TEST1');
OUTPUT
--------
1
(1 row)
dbadmin=> SELECT * FROM pepsi;
c1 | c2
----+-------
3 | TEST1
(1 row)
If the new tables in Vertica require IDENTITY columns, below is how I handle keeping source IDENTITY values in Vertica, including gaps...
Note that this code works for preserving identity values, but only on a single node. The reason is because in a multi-node cluster each node is providing a cached IDENTITY value so the data can appear out of order when inserted on multiple nodes, even if each node only caches one value.
Here is an example on a single node:
dbadmin=> SELECT COUNT(*) FROM nodes;
COUNT
-------
1
(1 row)
dbadmin=> CREATE TABLE t1_temp (c1 INT, c2 VARCHAR); -- Load this table with client data
CREATE TABLE
dbadmin=> CREATE TABLE t1 (c1 IDENTITY, c2 VARCHAR); -- This is the new Vertica table
CREATE TABLE
dbadmin=> -- Fake data
dbadmin=> INSERT INTO t1_temp select 1, 'val1';
OUTPUT
--------
1
(1 row)
dbadmin=> INSERT INTO t1_temp select 2, 'val2';
OUTPUT
--------
1
(1 row)
dbadmin=> INSERT INTO t1_temp select 3, 'val3';
OUTPUT
--------
1
(1 row)
dbadmin=> INSERT INTO t1_temp select 5, 'val5';
OUTPUT
--------
1
(1 row)
dbadmin=> COMMIT;
COMMIT
dbadmin=> SELECT * FROM t1_temp ORDER BY 1;
c1 | c2
----+------
1 | val1
2 | val2
3 | val3
5 | val5
(4 rows)
dbadmin=> INSERT INTO t1 (c2)
dbadmin-> SELECT tab2.c2
dbadmin-> FROM (SELECT row_number() over () c1 FROM tables CROSS JOIN columns CROSS JOIN columns cols2) tab1
dbadmin-> LEFT JOIN (SELECT c1, c2 FROM t1_temp) tab2 ON tab2.c1 = tab1.c1
dbadmin-> WHERE tab1.c1 <= (SELECT MAX(c1) FROM t1_temp)
dbadmin-> ORDER BY tab1.c1;
OUTPUT
--------
5
(1 row)
dbadmin=> DELETE FROM t1 WHERE c2 IS NULL;
OUTPUT
--------
1
(1 row)
dbadmin=> SELECT * FROM t1 ORDER BY c1;
c1 | c2
----+------
1 | val1
2 | val2
3 | val3
5 | val5
(4 rows)
dbadmin=> DROP TABLE t1_temp;
DROP TABLE
Notice that the data in T1 now matches the original IDENTITY values. You can now INSERT okay using the IDENTITY:
dbadmin=> INSERT INTO t1 (c2) VALUES ('new-value');
OUTPUT
--------
1
(1 row)
dbadmin=> SELECT * FROM t1 ORDER BY c1;
c1 | c2
--------+-----------
1 | val1
2 | val2
3 | val3
5 | val5
250001 | new-value
(5 rows)
So if this is a new client, with three nodes for example, I can think of several options:
1. Create a three node cluster
- Create a database on only one of the nodes
- Load all of the data (including identities) using SQL like that was shown above
- Add the other two nodes to the database
- Alter the K-Safety to 1
- Rebalance the data
2. Create a three node cluster
- Create a database on only one of the nodes
- Create a database on all three nodes
- Lower the K-Safety to 1
- Create tables having all the projections on a single node
- Load all of the data (including identities) using SQL like that was shown above
- Alter the K-Safety to 1
- Rebalance the data
If the client doesn't care about using IDENTITY columns in Vertica, I think a a better solution is to use a sequence as a default value for an INT column.
Example Code:
dbadmin=> DROP TABLE IF EXISTS t1;
DROP TABLE
dbadmin=> CREATE TABLE t1 (c1 INT, c2 VARCHAR); -- Load this table with client data
CREATE TABLE
dbadmin=> -- Fake data load
dbadmin=> INSERT INTO t1 select 1, 'val1';
OUTPUT
--------
1
(1 row)
dbadmin=> INSERT INTO t1 select 2, 'val2';
OUTPUT
--------
1
(1 row)
dbadmin=> INSERT INTO t1 select 3, 'val3';
OUTPUT
--------
1
(1 row)
dbadmin=> INSERT INTO t1 select 5, 'val5';
OUTPUT
--------
1
(1 row)
dbadmin=> COMMIT;
COMMIT
dbadmin=> SELECT * FROM t1 ORDER BY 1;
c1 | c2
----+------
1 | val1
2 | val2
3 | val3
5 | val5
(4 rows)
dbadmin=> CREATE SEQUENCE t1_seq START WITH 6; -- Start with the max Identity value + 1
CREATE SEQUENCE
dbadmin=> ALTER TABLE t1 ALTER COLUMN c1 SET DEFAULT t1_seq.NEXTVAL;
ALTER TABLE
dbadmin=> INSERT INTO t1 (c2) SELECT 'new-value';
OUTPUT
--------
1
(1 row)
dbadmin=> SELECT * FROM t1 ORDER BY c1;
c1 | c2
----+-----------
1 | val1
2 | val2
3 | val3
5 | val5
6 | new-value
(5 rows)