Hadoop vs PIG vs Vertica for Counting Triangles

Post by **JimKnicely** » Fri Mar 16, 2012 4:08 pm

Here is a link that discusses the performance of Hadoop vs PIG vs Vertica ([spoiler alert] Vertica wins!!!)

http://nosql.mypopescu.com/post/1072614 ... -triangles

patfla · Post by **patfla** » Wed Oct 03, 2012 6:34 pm

I'm trying this but for the moment have gotten stuck in the first, hadoop, part.

I don't have a linux server currently free so what I did was to add 8 GB to a powerful Windows server, bringing it to 16 GB. The processor is a Sandybridge 2600K running at 3.4 GHz. I installed VirtualBox since it's free and it's worked for me in the past and into that installed OpenSUSE 12.1.

This worked for Vertica. I downloaded and installed the community version and it runs fine. Built the VMart database and I can run queries against it. Added another database for my own testing, etc. It's not 'at scale' but everything I've tried so far has run fine. In the VM that is. (since Vertica runs only on linux).

Then came upon this example. Downloaded the zip file from the github site. My VM has 4 GB of memory allotted but only 30GB of disk space so I put the contents of the zip file in a 'shared' folder. It's a folder on the host (Windows) system that's visible from both the Windows host and the Linux guest. And thus it has access to a 500 GB drive that has about 200 GB free.

So I tried the hadoop example last night. With the addition of some packages to OpenSUSE (ant, jdk, subversion) it successfully builds mr-graphs.jar and runs. I note that I'm using edges.txt, that is, the much larger data file that contains some 86 mln edges. You put it in directory input and modify build.xml appropriately.

The whole of the hadoop job consists of 3 constituent jobs. In my case, 1 runs fine but I seem to get stuck in 2.

In particular, deep into 2, I see messages of this kind run for hours

Code: Select all

     [exec] 12/10/02 23:38:03 INFO mapred.Merger: Down to the last merge-pass, with 8 segments left of total size: 49509881 bytes
     [exec] 12/10/02 23:38:05 INFO mapred.LocalJobRunner: 
     [exec] 12/10/02 23:38:06 INFO mapred.TaskRunner: Task:attempt_local_0002_m_001517_0 is done. And is in the process of commiting
     [exec] 12/10/02 23:38:06 INFO mapred.LocalJobRunner: 
     [exec] 12/10/02 23:38:06 INFO mapred.TaskRunner: Task 'attempt_local_0002_m_001517_0' done.
     [exec] 12/10/02 23:38:06 INFO mapred.MapTask: io.sort.mb = 100
     [exec] 12/10/02 23:38:06 INFO mapred.MapTask: data buffer = 79691776/99614720
     [exec] 12/10/02 23:38:06 INFO mapred.MapTask: record buffer = 262144/327680
     [exec] 12/10/02 23:38:07 INFO mapred.MapTask: Spilling map output: record full = true
     [exec] 12/10/02 23:38:07 INFO mapred.MapTask: bufstart = 0; bufend = 5941019; bufvoid = 99614720
     [exec] 12/10/02 23:38:07 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
     [exec] 12/10/02 23:38:07 INFO mapred.MapTask: Finished spill 0
     [exec] 12/10/02 23:38:07 INFO mapred.MapTask: Spilling map output: record full = true
     [exec] 12/10/02 23:38:07 INFO mapred.MapTask: bufstart = 5941019; bufend = 12007923; bufvoid = 99614720
     [exec] 12/10/02 23:38:07 INFO mapred.MapTask: kvstart = 262144; kvend = 196607; length = 327680
     [exec] 12/10/02 23:38:08 INFO mapred.MapTask: Finished spill 1
     [exec] 12/10/02 23:38:08 INFO mapred.MapTask: Spilling map output: record full = true
     [exec] 12/10/02 23:38:08 INFO mapred.MapTask: bufstart = 12007923; bufend = 18187103; bufvoid = 99614720
     [exec] 12/10/02 23:38:08 INFO mapred.MapTask: kvstart = 196607; kvend = 131070; length = 327680
     [exec] 12/10/02 23:38:09 INFO mapred.MapTask: Finished spill 2
     [exec] 12/10/02 23:38:09 INFO mapred.MapTask: Spilling map output: record full = true
     [exec] 12/10/02 23:38:09 INFO mapred.MapTask: bufstart = 18187103; bufend = 24161342; bufvoid = 99614720
     [exec] 12/10/02 23:38:09 INFO mapred.MapTask: kvstart = 131070; kvend = 65533; length = 327680
     [exec] 12/10/02 23:38:10 INFO mapred.MapTask: Finished spill 3
     [exec] 12/10/02 23:38:10 INFO mapred.MapTask: Spilling map output: record full = true
     [exec] 12/10/02 23:38:10 INFO mapred.MapTask: bufstart = 24161342; bufend = 30086990; bufvoid = 99614720
     [exec] 12/10/02 23:38:10 INFO mapred.MapTask: kvstart = 65533; kvend = 327677; length = 327680
     [exec] 12/10/02 23:38:10 INFO mapred.MapTask: Finished spill 4
     [exec] 12/10/02 23:38:11 INFO mapred.MapTask: Spilling map output: record full = true
     [exec] 12/10/02 23:38:11 INFO mapred.MapTask: bufstart = 30086990; bufend = 36015090; bufvoid = 99614720
     [exec] 12/10/02 23:38:11 INFO mapred.MapTask: kvstart = 327677; kvend = 262140; length = 327680
     [exec] 12/10/02 23:38:11 INFO mapred.MapTask: Finished spill 5
     [exec] 12/10/02 23:38:11 INFO mapred.MapTask: Spilling map output: record full = true
     [exec] 12/10/02 23:38:11 INFO mapred.MapTask: bufstart = 36015090; bufend = 41953832; bufvoid = 99614720
     [exec] 12/10/02 23:38:11 INFO mapred.MapTask: kvstart = 262140; kvend = 196603; length = 327680
     [exec] 12/10/02 23:38:12 INFO mapred.MapTask: Finished spill 6
     [exec] 12/10/02 23:38:12 INFO mapred.MapTask: Starting flush of map output
     [exec] 12/10/02 23:38:12 INFO mapred.MapTask: Finished spill 7
     [exec] 12/10/02 23:38:12 INFO mapred.Merger: Merging 8 sorted segments
     [exec] 12/10/02 23:38:12 INFO mapred.Merger: Down to the last merge-pass, with 8 segments left of total size: 49489852 bytes
     [exec] 12/10/02 23:38:12 INFO mapred.LocalJobRunner: 
     [exec] 12/10/02 23:38:15 INFO mapred.LocalJobRunner: 
     [exec] 12/10/02 23:38:16 INFO mapred.TaskRunner: Task:attempt_local_0002_m_001518_0 is done. And is in the process of commiting
     [exec] 12/10/02 23:38:16 INFO mapred.LocalJobRunner: 
     [exec] 12/10/02 23:38:16 INFO mapred.TaskRunner: Task 'attempt_local_0002_m_001518_0' done.

The problem is that the disk (the shared 500 GB disk) continually fills with intermediate files. I was afraid that it was going to fill entirely and killed the job after 150 GB of hadoop files and 8 hrs running time. After killing the job, directory

Code: Select all

vertica-Graph-Analytics-Triangle-Counting-59aa09e/tmp/mapred/local/taskTracker/jobcache #

contains 1521 directories at 48,49 MB a piece. That's a lot.

Code: Select all

inux-33ql:/media/sf_shared/vertica-Graph-Analytics-Triangle-Counting-59aa09e/tmp/mapred/local/taskTracker/jobcache # la
total 788
drwxrwx--- 1 root vboxsf      0 Oct  2 18:31 .
drwxrwx--- 1 root vboxsf      0 Oct  2 16:39 ..
drwxrwx--- 1 root vboxsf  20480 Oct  2 16:43 job_local_0001
drwxrwx--- 1 root vboxsf 786432 Oct  2 23:38 job_local_0002
linux-33ql:/media/sf_shared/vertica-Graph-Analytics-Triangle-Counting-59aa09e/tmp/mapred/local/taskTracker/jobcache # ls -l *2 | wc -l
1521
linux-33ql:/media/sf_shared/vertica-Graph-Analytics-Triangle-Counting-59aa09e/tmp/mapred/local/taskTracker/jobcache #

There are no visible errors. htop (a variant of top) tells me that hadoop (the only thing running) takes between 50-90% of the cpu; memory nevers breaks 1 out of the 4 GB. With hadoop using an 'external' drive, IO would be the largest consideration and I run iostat but it too looks OK. %iowait is never > 2%. Tps started of at about 10,11/ sec but later into the job degraded by half to 5 which indicates something but again I see no explicit error.

So strictly speaking this is a hadoop question, but using code provided by Vertica.

Any help appreciated.

patfla · Post by **patfla** » Wed Oct 03, 2012 10:48 pm

So I skipped hadoop for the moment (but still want to figure out why it didn't work) and jumped ahead to the Vertica part of the exercise.

From the below it appears to load the 86 mln rows just fine. It then churns away for a while on the actual count

Code: Select all

select count(*)
  from edges e1
  join edges e2 on e1.dest = e2.source and e1.source < e2.source
  join edges e3 on e2.dest = e3.source and e3.dest = e1.source and e2.source < e3.source;

but then craps out on disk space

Code: Select all

dbadmin=> \i triangle_counter.sql 
CREATE TABLE
Timing is on.
 Rows Loaded 
-------------
    86220856
(1 row)

Time: First fetch (1 row): 65494.149 ms. All rows formatted: 65494.184 ms
vsql:triangle_counter.sql:12: ERROR 2927:  Could not write to [/home/dbadmin/vtest/v_vtest_node0001_data]: [Volume /home/dbadmin/vtest/v_vtest_node0001_data has 1452744704 bytes free (1348275442 unreserved).  Minimum free space is 1347433260 (Temporary Data).]
Timing is off.

The disk space failure seems to be on the order of 1.3/1.4 GB yet I have over 9 GB free in that partition

Code: Select all

dbadmin@linux-33ql:~/devel> df
Filesystem     1K-blocks      Used Available Use% Mounted on
rootfs          15480816   5278044   9416392  36% /
devtmpfs         2020164        36   2020128   1% /dev
tmpfs            2027524       356   2027168   1% /dev/shm
tmpfs            2027524       596   2026928   1% /run
/dev/sda2       15480816   5278044   9416392  36% /
tmpfs            2027524         0   2027524   0% /sys/fs/cgroup
tmpfs            2027524         0   2027524   0% /media
tmpfs            2027524       596   2026928   1% /var/run
tmpfs            2027524       596   2026928   1% /var/lock
/dev/sda3       13158528   2990268   9499844  24% /home
none           488282108 291376320 196905788  60% /media/sf_shared

Maybe it's time to learn something about Vertica tuning.

patfla · Post by **patfla** » Thu Oct 04, 2012 1:42 am

Maybe you can't get there from here in the Community Edition.

http://stackoverflow.com/questions/1248 ... a-database

What I might want is to turn JOIN SPILL on however searching the online Vertica CE documentation, I find no mention of vertica_set_options.

What occurs to me though (just now) is that I could trim edges.txt down to, say, 1/4 its size and see if that runs.

patfla · Post by **patfla** » Thu Oct 04, 2012 2:05 am

Gee, trimming it down to 20 mln actually worked.

Code: Select all

dbadmin@linux-33ql:~/devel/input> head -n 20000000 edges.txt > use
dbadmin@linux-33ql:~/devel/input> wc -l use
20000000 use
dbadmin@linux-33ql:~/devel/input> less use
dbadmin@linux-33ql:~/devel/input> rm -f edges.txt
dbadmin@linux-33ql:~/devel/input> mv use edges.txt
dbadmin@linux-33ql:~/devel> cat triangle_counter.sql
\set dir `pwd`
\set file '''':dir'/input/edges.txt'''

create table edges (source int not null, dest int not null) segmented by hash(source,dest) all nodes;

\timing
copy edges from :file direct delimiter ' ';

select count(*)
  from edges e1
  join edges e2 on e1.dest = e2.source and e1.source < e2.source
  join edges e3 on e2.dest = e3.source and e3.dest = e1.source and e2.source < e3.source;
\timing
dbadmin@linux-33ql:~/devel> vsql
Welcome to vsql, the Vertica Analytic Database interactive terminal.

Type:  \h for help with SQL commands
       \? for help with vsql commands
       \g or terminate with semicolon to execute query
       \q to quit

dbadmin=> drop table edges;
DROP TABLE
dbadmin=> \set dir `pwd`
dbadmin=> \set file '''':dir'/input/edges.txt'''
dbadmin=> create table edges (source int not null, dest int not null) segmented by hash(source,dest) all nodes;
CREATE TABLE
dbadmin=> copy edges from :file direct delimiter ' ';
 Rows Loaded 
-------------
    20000000
(1 row)

dbadmin=> select count(*)
dbadmin->   from edges e1
dbadmin->   join edges e2 on e1.dest = e2.source and e1.source < e2.source
dbadmin->   join edges e3 on e2.dest = e3.source and e3.dest = e1.source and e2.source < e3.source;

  count   
----------
 13487606
(1 row)

dbadmin=> 
dbadmin=> \timing
Timing is on.
dbadmin=> select count(*)
dbadmin->   from edges e1
dbadmin->   join edges e2 on e1.dest = e2.source and e1.source < e2.source
dbadmin->   join edges e3 on e2.dest = e3.source and e3.dest = e1.source and e2.source < e3.source;
  count   
----------
 13487606
(1 row)

Time: First fetch (1 row): 271037.383 ms. All rows formatted: 271054.429 ms
dbadmin=> 
\q
dbadmin@linux-33ql:~/devel> bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
scale=5
271.0 / 60.0
4.51666
dbadmin@linux-33ql:~/devel>

So you get 13+ mln triangles from out of 20 mln edges. That seems not unreasonable. You wouldn't for example expect more triangles than edges.

And vertica, running in a VM, took 4.5 minutes to count the triangles.

I still want to figure out why hadoop didn't work as well as try the pig example. Maybe I should retry hadoop with 20 mln edges.

Hadoop vs PIG vs Vertica for Counting Triangles

Hadoop vs PIG vs Vertica for Counting Triangles

Re: Hadoop vs PIG vs Vertica for Counting Triangles

Re: Hadoop vs PIG vs Vertica for Counting Triangles

Re: Hadoop vs PIG vs Vertica for Counting Triangles

Re: Hadoop vs PIG vs Vertica for Counting Triangles