How can I Improve performance when loading data by copy?

Moderator: NorbertKrupa

muaythai_duan
Newbie
Newbie
Posts: 18
Joined: Sun Aug 25, 2013 11:43 am

How can I Improve performance when loading data by copy?

Post by muaythai_duan » Mon Aug 26, 2013 2:47 am

hi:
everyone,I`m new user.Don`t joy me,if my question so easy.
my disk write rate is 54MB/s,I use cpoy from load date.The datefile is 487MB,row number is 10239857.I load date use 46343.161 ms.how can I Improve performance,thanks!
the node only-one,not cluster!

User avatar
nnani
Master
Master
Posts: 302
Joined: Fri Apr 13, 2012 6:28 am
Contact:

Re: how can I Improve performance when load date by cpoy

Post by nnani » Tue Aug 27, 2013 7:17 am

Hello muaythai_duan

Welcome to Forums :)

As it is a single node cluster, you cannot expect performance too be the best, but in your case, it is not so good .

Can you show us the sample flat file data.

A trick to increase the performance can be

Segment the flat file into some files and then use 2-3 copy statements parallel to load all files.
Beside this, you should take care the no other process are running while loading on your system,
Use of Direct keyword can also make a difference to loading cost.

Hope this helps...
nnani........
Long way to go

You can check out my blogs at vertica-howto

muaythai_duan
Newbie
Newbie
Posts: 18
Joined: Sun Aug 25, 2013 11:43 am

Re: how can I Improve performance when load date by cpoy

Post by muaythai_duan » Tue Aug 27, 2013 2:00 pm

thanks nnani!
I know it not possible for single node to expect performance too be the best.
my test flat file

id,time,name,values
00000001,2013-06-21 14:40:10 140,data0,-86
00000002,2013-06-21 14:40:10 140,data1,C0
00000003,2013-06-21 14:40:10 140,data2,0.6318668418734668
00000004,2013-06-21 14:40:10 140,data3,10001110
00000005,2013-06-21 14:40:10 140,data4,-123
00000006,2013-06-21 14:40:10 140,data5,AE
00000007,2013-06-21 14:40:10 140,data6,0.5959188381581663
00000008,2013-06-21 14:40:10 140,data7,10010111
00000009,2013-06-21 14:40:10 140,data8,-92

the file is about 480MB

"Segment the flat file into some files" means that use "head" command to segment file?for example "head -n number > testxxx.csv"
the opration system is redhat.

other question is that the time use "copy auto"(by wos) is faster than "copy direct"(by ros),why not used "copy auto"?

Very much looking forward to get your teaching!and I'm not good at english don't mind.

muaythai_duan
Newbie
Newbie
Posts: 18
Joined: Sun Aug 25, 2013 11:43 am

Re: how can I Improve performance when load date by cpoy

Post by muaythai_duan » Tue Aug 27, 2013 2:11 pm

DDL
-------------------------------------------------------------------
create table test
(
id numeric,
time varchar(28),
name varchar(20),
value varchar(20)
)
-------------------------------------------------------------------
the script
-------------------------------------------------------------------
\set t_pwd `pwd`
\set input_file '''':t_pwd'/*.csv'''
copy test from :input_file delimiter ',' direct

User avatar
nnani
Master
Master
Posts: 302
Joined: Fri Apr 13, 2012 6:28 am
Contact:

Re: how can I Improve performance when load date by cpoy

Post by nnani » Wed Aug 28, 2013 8:22 am

Hello muaythai_duan,

Segmenting the file - Distribute the data in file in equal chunks of data into two to three files, so after distributing, you can have 2-3 files for loading in place of one. As you have a 480 MB file, you can have 3 files of 160 MB each.

Once all these files are ready, you can fire three copy commands from your system parallel one for each file.

AUTO/DIRECT - The default load method is AUTO, in which COPY loads data into the WOS (Write Optimized Store) in memory. When the WOS is full, the load continues directly into ROS (Read Optimized Store) on disk.
To support very fast data load speeds, the WOS stores records without data compression or indexing. The WOS organizes data by epoch and holds both committed and uncommitted transaction data. But the WOS is recommended only for use when you are loading the file less then 100 MB,So in your case you can use the DIRECT keyword to load data into ROS directly.

Try this
Issue a single multi-node COPY command that loads different files from different nodes specifying the nodename option for each file. In your case, specify the nodename of your single node for each file

Code: Select all

\set t_pwd `pwd`
\set input_file '''':t_pwd'/*.csv'''
copy test from :input_file1 nodename, :input_file2 nodename, :input_file3 nodename delimiter ',' direct
Or Try this
Issue three copy commands with one flat file each

Code: Select all

\set t_pwd `pwd`
\set input_file '''':t_pwd'/*.csv'''
copy test from :input_file1 nodename delimiter ',' direct ;
copy test from :input_file1 nodename delimiter ',' direct ;
copy test from :input_file3 nodename delimiter ',' direct ;
If you still want to use WOS for loading try reducing your 1 file of 480 MB to 6 files of 80 MB and then try with COPY command, but still as you have one node you are limited with your threading options on running multiple COPY commands from single node.

Hope this helps ..
nnani........
Long way to go

You can check out my blogs at vertica-howto

muaythai_duan
Newbie
Newbie
Posts: 18
Joined: Sun Aug 25, 2013 11:43 am

Re: how can I Improve performance when load date by cpoy

Post by muaythai_duan » Wed Aug 28, 2013 9:34 am

thnak you very much!I had try.But I had a question.
script:

Code: Select all

\set t_pwd `pwd`
\set input_file '''':t_pwd'/*.csv'''
copy test from :input_file1 nodename delimiter ',' direct ;
copy test from :input_file1 nodename delimiter ',' direct ;
copy test from :input_file3 nodename delimiter ',' direct ;
the command is executed in turn,is not parallel.maybe I am wrong,I will try again tomorrow.
Last edited by muaythai_duan on Wed Aug 28, 2013 9:43 am, edited 1 time in total.

muaythai_duan
Newbie
Newbie
Posts: 18
Joined: Sun Aug 25, 2013 11:43 am

Re: how can I Improve performance when load date by cpoy

Post by muaythai_duan » Wed Aug 28, 2013 9:42 am

I segment the file to 4 files(120M),put the file(120M) to every node(4 nodes).

excuted command on every node,the same time!

Code: Select all

copy test from file(120M) delimiter ',' dircte;
Time is nearly 4 times.

Post Reply

Return to “Vertica Data Load”