Here is a link that discusses the performance of Hadoop vs PIG vs Vertica ([spoiler alert] Vertica wins!!!)
http://nosql.mypopescu.com/post/1072614 ... -triangles
Hadoop vs PIG vs Vertica for Counting Triangles
Moderator: NorbertKrupa
- JimKnicely
- Site Admin
- Posts: 1825
- Joined: Sat Jan 21, 2012 4:58 am
- Contact:
Hadoop vs PIG vs Vertica for Counting Triangles
Jim Knicely
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.
Re: Hadoop vs PIG vs Vertica for Counting Triangles
I'm trying this but for the moment have gotten stuck in the first, hadoop, part.
I don't have a linux server currently free so what I did was to add 8 GB to a powerful Windows server, bringing it to 16 GB. The processor is a Sandybridge 2600K running at 3.4 GHz. I installed VirtualBox since it's free and it's worked for me in the past and into that installed OpenSUSE 12.1.
This worked for Vertica. I downloaded and installed the community version and it runs fine. Built the VMart database and I can run queries against it. Added another database for my own testing, etc. It's not 'at scale' but everything I've tried so far has run fine. In the VM that is. (since Vertica runs only on linux).
Then came upon this example. Downloaded the zip file from the github site. My VM has 4 GB of memory allotted but only 30GB of disk space so I put the contents of the zip file in a 'shared' folder. It's a folder on the host (Windows) system that's visible from both the Windows host and the Linux guest. And thus it has access to a 500 GB drive that has about 200 GB free.
So I tried the hadoop example last night. With the addition of some packages to OpenSUSE (ant, jdk, subversion) it successfully builds mr-graphs.jar and runs. I note that I'm using edges.txt, that is, the much larger data file that contains some 86 mln edges. You put it in directory input and modify build.xml appropriately.
The whole of the hadoop job consists of 3 constituent jobs. In my case, 1 runs fine but I seem to get stuck in 2.
In particular, deep into 2, I see messages of this kind run for hours
The problem is that the disk (the shared 500 GB disk) continually fills with intermediate files. I was afraid that it was going to fill entirely and killed the job after 150 GB of hadoop files and 8 hrs running time. After killing the job, directory
contains 1521 directories at 48,49 MB a piece. That's a lot.
There are no visible errors. htop (a variant of top) tells me that hadoop (the only thing running) takes between 50-90% of the cpu; memory nevers breaks 1 out of the 4 GB. With hadoop using an 'external' drive, IO would be the largest consideration and I run iostat but it too looks OK. %iowait is never > 2%. Tps started of at about 10,11/ sec but later into the job degraded by half to 5 which indicates something but again I see no explicit error.
So strictly speaking this is a hadoop question, but using code provided by Vertica.
Any help appreciated.
I don't have a linux server currently free so what I did was to add 8 GB to a powerful Windows server, bringing it to 16 GB. The processor is a Sandybridge 2600K running at 3.4 GHz. I installed VirtualBox since it's free and it's worked for me in the past and into that installed OpenSUSE 12.1.
This worked for Vertica. I downloaded and installed the community version and it runs fine. Built the VMart database and I can run queries against it. Added another database for my own testing, etc. It's not 'at scale' but everything I've tried so far has run fine. In the VM that is. (since Vertica runs only on linux).
Then came upon this example. Downloaded the zip file from the github site. My VM has 4 GB of memory allotted but only 30GB of disk space so I put the contents of the zip file in a 'shared' folder. It's a folder on the host (Windows) system that's visible from both the Windows host and the Linux guest. And thus it has access to a 500 GB drive that has about 200 GB free.
So I tried the hadoop example last night. With the addition of some packages to OpenSUSE (ant, jdk, subversion) it successfully builds mr-graphs.jar and runs. I note that I'm using edges.txt, that is, the much larger data file that contains some 86 mln edges. You put it in directory input and modify build.xml appropriately.
The whole of the hadoop job consists of 3 constituent jobs. In my case, 1 runs fine but I seem to get stuck in 2.
In particular, deep into 2, I see messages of this kind run for hours
Code: Select all
[exec] 12/10/02 23:38:03 INFO mapred.Merger: Down to the last merge-pass, with 8 segments left of total size: 49509881 bytes
[exec] 12/10/02 23:38:05 INFO mapred.LocalJobRunner:
[exec] 12/10/02 23:38:06 INFO mapred.TaskRunner: Task:attempt_local_0002_m_001517_0 is done. And is in the process of commiting
[exec] 12/10/02 23:38:06 INFO mapred.LocalJobRunner:
[exec] 12/10/02 23:38:06 INFO mapred.TaskRunner: Task 'attempt_local_0002_m_001517_0' done.
[exec] 12/10/02 23:38:06 INFO mapred.MapTask: io.sort.mb = 100
[exec] 12/10/02 23:38:06 INFO mapred.MapTask: data buffer = 79691776/99614720
[exec] 12/10/02 23:38:06 INFO mapred.MapTask: record buffer = 262144/327680
[exec] 12/10/02 23:38:07 INFO mapred.MapTask: Spilling map output: record full = true
[exec] 12/10/02 23:38:07 INFO mapred.MapTask: bufstart = 0; bufend = 5941019; bufvoid = 99614720
[exec] 12/10/02 23:38:07 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
[exec] 12/10/02 23:38:07 INFO mapred.MapTask: Finished spill 0
[exec] 12/10/02 23:38:07 INFO mapred.MapTask: Spilling map output: record full = true
[exec] 12/10/02 23:38:07 INFO mapred.MapTask: bufstart = 5941019; bufend = 12007923; bufvoid = 99614720
[exec] 12/10/02 23:38:07 INFO mapred.MapTask: kvstart = 262144; kvend = 196607; length = 327680
[exec] 12/10/02 23:38:08 INFO mapred.MapTask: Finished spill 1
[exec] 12/10/02 23:38:08 INFO mapred.MapTask: Spilling map output: record full = true
[exec] 12/10/02 23:38:08 INFO mapred.MapTask: bufstart = 12007923; bufend = 18187103; bufvoid = 99614720
[exec] 12/10/02 23:38:08 INFO mapred.MapTask: kvstart = 196607; kvend = 131070; length = 327680
[exec] 12/10/02 23:38:09 INFO mapred.MapTask: Finished spill 2
[exec] 12/10/02 23:38:09 INFO mapred.MapTask: Spilling map output: record full = true
[exec] 12/10/02 23:38:09 INFO mapred.MapTask: bufstart = 18187103; bufend = 24161342; bufvoid = 99614720
[exec] 12/10/02 23:38:09 INFO mapred.MapTask: kvstart = 131070; kvend = 65533; length = 327680
[exec] 12/10/02 23:38:10 INFO mapred.MapTask: Finished spill 3
[exec] 12/10/02 23:38:10 INFO mapred.MapTask: Spilling map output: record full = true
[exec] 12/10/02 23:38:10 INFO mapred.MapTask: bufstart = 24161342; bufend = 30086990; bufvoid = 99614720
[exec] 12/10/02 23:38:10 INFO mapred.MapTask: kvstart = 65533; kvend = 327677; length = 327680
[exec] 12/10/02 23:38:10 INFO mapred.MapTask: Finished spill 4
[exec] 12/10/02 23:38:11 INFO mapred.MapTask: Spilling map output: record full = true
[exec] 12/10/02 23:38:11 INFO mapred.MapTask: bufstart = 30086990; bufend = 36015090; bufvoid = 99614720
[exec] 12/10/02 23:38:11 INFO mapred.MapTask: kvstart = 327677; kvend = 262140; length = 327680
[exec] 12/10/02 23:38:11 INFO mapred.MapTask: Finished spill 5
[exec] 12/10/02 23:38:11 INFO mapred.MapTask: Spilling map output: record full = true
[exec] 12/10/02 23:38:11 INFO mapred.MapTask: bufstart = 36015090; bufend = 41953832; bufvoid = 99614720
[exec] 12/10/02 23:38:11 INFO mapred.MapTask: kvstart = 262140; kvend = 196603; length = 327680
[exec] 12/10/02 23:38:12 INFO mapred.MapTask: Finished spill 6
[exec] 12/10/02 23:38:12 INFO mapred.MapTask: Starting flush of map output
[exec] 12/10/02 23:38:12 INFO mapred.MapTask: Finished spill 7
[exec] 12/10/02 23:38:12 INFO mapred.Merger: Merging 8 sorted segments
[exec] 12/10/02 23:38:12 INFO mapred.Merger: Down to the last merge-pass, with 8 segments left of total size: 49489852 bytes
[exec] 12/10/02 23:38:12 INFO mapred.LocalJobRunner:
[exec] 12/10/02 23:38:15 INFO mapred.LocalJobRunner:
[exec] 12/10/02 23:38:16 INFO mapred.TaskRunner: Task:attempt_local_0002_m_001518_0 is done. And is in the process of commiting
[exec] 12/10/02 23:38:16 INFO mapred.LocalJobRunner:
[exec] 12/10/02 23:38:16 INFO mapred.TaskRunner: Task 'attempt_local_0002_m_001518_0' done.
Code: Select all
vertica-Graph-Analytics-Triangle-Counting-59aa09e/tmp/mapred/local/taskTracker/jobcache #
Code: Select all
inux-33ql:/media/sf_shared/vertica-Graph-Analytics-Triangle-Counting-59aa09e/tmp/mapred/local/taskTracker/jobcache # la
total 788
drwxrwx--- 1 root vboxsf 0 Oct 2 18:31 .
drwxrwx--- 1 root vboxsf 0 Oct 2 16:39 ..
drwxrwx--- 1 root vboxsf 20480 Oct 2 16:43 job_local_0001
drwxrwx--- 1 root vboxsf 786432 Oct 2 23:38 job_local_0002
linux-33ql:/media/sf_shared/vertica-Graph-Analytics-Triangle-Counting-59aa09e/tmp/mapred/local/taskTracker/jobcache # ls -l *2 | wc -l
1521
linux-33ql:/media/sf_shared/vertica-Graph-Analytics-Triangle-Counting-59aa09e/tmp/mapred/local/taskTracker/jobcache #
So strictly speaking this is a hadoop question, but using code provided by Vertica.
Any help appreciated.
Re: Hadoop vs PIG vs Vertica for Counting Triangles
So I skipped hadoop for the moment (but still want to figure out why it didn't work) and jumped ahead to the Vertica part of the exercise.
From the below it appears to load the 86 mln rows just fine. It then churns away for a while on the actual count
but then craps out on disk space
The disk space failure seems to be on the order of 1.3/1.4 GB yet I have over 9 GB free in that partition
Maybe it's time to learn something about Vertica tuning.
From the below it appears to load the 86 mln rows just fine. It then churns away for a while on the actual count
Code: Select all
select count(*)
from edges e1
join edges e2 on e1.dest = e2.source and e1.source < e2.source
join edges e3 on e2.dest = e3.source and e3.dest = e1.source and e2.source < e3.source;
Code: Select all
dbadmin=> \i triangle_counter.sql
CREATE TABLE
Timing is on.
Rows Loaded
-------------
86220856
(1 row)
Time: First fetch (1 row): 65494.149 ms. All rows formatted: 65494.184 ms
vsql:triangle_counter.sql:12: ERROR 2927: Could not write to [/home/dbadmin/vtest/v_vtest_node0001_data]: [Volume /home/dbadmin/vtest/v_vtest_node0001_data has 1452744704 bytes free (1348275442 unreserved). Minimum free space is 1347433260 (Temporary Data).]
Timing is off.
Code: Select all
dbadmin@linux-33ql:~/devel> df
Filesystem 1K-blocks Used Available Use% Mounted on
rootfs 15480816 5278044 9416392 36% /
devtmpfs 2020164 36 2020128 1% /dev
tmpfs 2027524 356 2027168 1% /dev/shm
tmpfs 2027524 596 2026928 1% /run
/dev/sda2 15480816 5278044 9416392 36% /
tmpfs 2027524 0 2027524 0% /sys/fs/cgroup
tmpfs 2027524 0 2027524 0% /media
tmpfs 2027524 596 2026928 1% /var/run
tmpfs 2027524 596 2026928 1% /var/lock
/dev/sda3 13158528 2990268 9499844 24% /home
none 488282108 291376320 196905788 60% /media/sf_shared
Re: Hadoop vs PIG vs Vertica for Counting Triangles
Maybe you can't get there from here in the Community Edition.
http://stackoverflow.com/questions/1248 ... a-database
What I might want is to turn JOIN SPILL on however searching the online Vertica CE documentation, I find no mention of vertica_set_options.
What occurs to me though (just now) is that I could trim edges.txt down to, say, 1/4 its size and see if that runs.
http://stackoverflow.com/questions/1248 ... a-database
What I might want is to turn JOIN SPILL on however searching the online Vertica CE documentation, I find no mention of vertica_set_options.
What occurs to me though (just now) is that I could trim edges.txt down to, say, 1/4 its size and see if that runs.
Re: Hadoop vs PIG vs Vertica for Counting Triangles
Gee, trimming it down to 20 mln actually worked.
So you get 13+ mln triangles from out of 20 mln edges. That seems not unreasonable. You wouldn't for example expect more triangles than edges.
And vertica, running in a VM, took 4.5 minutes to count the triangles.
I still want to figure out why hadoop didn't work as well as try the pig example. Maybe I should retry hadoop with 20 mln edges.
Code: Select all
dbadmin@linux-33ql:~/devel/input> head -n 20000000 edges.txt > use
dbadmin@linux-33ql:~/devel/input> wc -l use
20000000 use
dbadmin@linux-33ql:~/devel/input> less use
dbadmin@linux-33ql:~/devel/input> rm -f edges.txt
dbadmin@linux-33ql:~/devel/input> mv use edges.txt
dbadmin@linux-33ql:~/devel> cat triangle_counter.sql
\set dir `pwd`
\set file '''':dir'/input/edges.txt'''
create table edges (source int not null, dest int not null) segmented by hash(source,dest) all nodes;
\timing
copy edges from :file direct delimiter ' ';
select count(*)
from edges e1
join edges e2 on e1.dest = e2.source and e1.source < e2.source
join edges e3 on e2.dest = e3.source and e3.dest = e1.source and e2.source < e3.source;
\timing
dbadmin@linux-33ql:~/devel> vsql
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h for help with SQL commands
\? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
dbadmin=> drop table edges;
DROP TABLE
dbadmin=> \set dir `pwd`
dbadmin=> \set file '''':dir'/input/edges.txt'''
dbadmin=> create table edges (source int not null, dest int not null) segmented by hash(source,dest) all nodes;
CREATE TABLE
dbadmin=> copy edges from :file direct delimiter ' ';
Rows Loaded
-------------
20000000
(1 row)
dbadmin=> select count(*)
dbadmin-> from edges e1
dbadmin-> join edges e2 on e1.dest = e2.source and e1.source < e2.source
dbadmin-> join edges e3 on e2.dest = e3.source and e3.dest = e1.source and e2.source < e3.source;
count
----------
13487606
(1 row)
dbadmin=>
dbadmin=> \timing
Timing is on.
dbadmin=> select count(*)
dbadmin-> from edges e1
dbadmin-> join edges e2 on e1.dest = e2.source and e1.source < e2.source
dbadmin-> join edges e3 on e2.dest = e3.source and e3.dest = e1.source and e2.source < e3.source;
count
----------
13487606
(1 row)
Time: First fetch (1 row): 271037.383 ms. All rows formatted: 271054.429 ms
dbadmin=>
\q
dbadmin@linux-33ql:~/devel> bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
scale=5
271.0 / 60.0
4.51666
dbadmin@linux-33ql:~/devel>
And vertica, running in a VM, took 4.5 minutes to count the triangles.
I still want to figure out why hadoop didn't work as well as try the pig example. Maybe I should retry hadoop with 20 mln edges.