The Vertica Database Forums

Posted: **Tue Jul 23, 2013 12:43 pm**

Hi,

I was wondering if you could help me, I've looked through the documentation but can't seem to find a clear answer.

Is it possible to use the window_partition_clause [ ie OVER( PARTITION BY blah ) ] with transform functions? And how do I do this with R?

More precisely..

I've been trying to run the Kmeans clustering algorithm as a polymorphic transform function like the example in the documentation- my code is here-

Code: Select all

# k-means ploymorphic algorithm

#Input: A dataframe consisting of one column of labels and then n metrics
#Output: A dataframe with one column stating which cluster the data point belongs

kmeans_clusterPoly<-function(x,y)
{
   #load required packages
   #library(cluster)

   #Parameter Check: Number of clusters to be made, k.
   if(!is.null(y[['k']]))
      k=as.numeric(y[['k']])
   else
      stop(" Expected parameter k. Syntax '...USING PARAMETER k=3)'")


   # Get the number of columns in the input dataframe 
   cols <- ncol(x)
   
   #runs the k mean algorithm
   cl<-kmeans(x[,2:cols],k)

   #returns the clustering vector
   Result <- cl$cluster
   
   #Return result to vertica
   Result <- data.frame( x[,1], Result )
   Result
}

kmeans_clusterPolyFactory<-function()
{
   list(
        name=kmeans_clusterPoly,	  #function that does the processing 
        udxtype=c("transform"),      #type of the function 
        intype=c("any"),	               #iput types
        outtype=c("int","int"),				     #output types
        parametertypecallback=kmeans_clusterPolyParameters
   )				
}

kmeans_clusterPolyParameters <- function()
{
   params <- data.frame( datatype=rep( NA, 1), length=rep( NA,1), scale=rep( NA,1), name=rep( NA,1) )
   params[1,1] = "int"
   params[1,4] = "k"
   params
}

I can run the code fine using the select statement below -

Code: Select all

      SELECT 
      kmeans_clusterPoly( c2, c3 USING PARAMETERS k = 4) over( )
   FROM
      t1

I would like to be able to do the following..

Code: Select all

   SELECT 
      kmeans_clusterPoly( c2, c3 USING PARAMETERS k = 4)  over( partition by c1 )
   FROM
      t1

But I get the following error

ERROR 3399: Failure in UDx RPC call InvokeProcessPartition(): Error calling processPartition() in User Defined Object [kmeans_clusterPoly] at [/scratch_a/release/vbuild/vertica/UDxFence/RInterface.cpp:1342], error code: 0, message: Exception in processPartitionForR: [cannot take a sample larger than the population when 'replace = FALSE']

What do I need to do? Is there a way to write the 'Partition' in a similar way to the Factory or Parameters ? Where can I find documentation for this?

Or do I need to write the partition in the R code directly and is it that transform functions can only be applied as one big 'function' acting on the whole 'dataframe'?

Thanks in advance for any help

Posted: **Tue Jul 23, 2013 1:42 pm**

Hi!

[cannot take a sample larger than the population when 'replace = FALSE']

This error means that number of columns of output matrix/"data frame" does not match to output definition (is larger than defined). Your function should return 2 columns only ( outtype=c("int","int") )

Posted: **Tue Jul 23, 2013 2:33 pm**

Hi,

Okay. cool and thanks for a quick response.

I checked using the partition over one of the other transform functions I made and I had no problem with that so I have no problem there, so thank you for clearing that up.

But I still have this error message when I introduce the partition by. (Also, I just noticed I have a few errors in the comments on the code-- I do want the output matrix to consist of two columns one is an index from the input and the other is the cluster that it belongs to.) I see no reason why my R code would not output 2 columns :S and the code seems to work if I have just an OVER() clause.

Do you know a reason why introducing the 'partition by' would mean I no longer have 2 columns?

Posted: **Tue Jul 23, 2013 2:56 pm**

Awh! I'm sorry I think I've found it now!

Your post was really useful,thanks.

I think when I made the partition some of the partitions contained too few data points to run the kmeans from R so would return nothing.. which would mean that the number of columns in the output matrix didn't match the definition.

On to the next error message!

Thanks a lot

I'll keep working at it

Posted: **Thu Jan 30, 2014 11:29 am**

Hi,

I am testing the above code in R and getting error as "Error in y[["k"]] : subscript out of bounds".

mykmeansPoly <- function(x,y)
{
# get the number of clusters to be formed
if(!is.null(y[['k']]))
k=as.numeric(y[['k']])
else
stop("Expected parameter k")
# get the number of columns in the input data frame
cols = ncol(x)
# run the kmeans algorithm
cl <- kmeans(x[,2:cols-1], k)
# get the cluster information from the result of above
Result <- cl$cluster
#return result to vertica
Result <- data.frame(VCol=Result)
Result
}
x<-read.csv('C:/R/Clustering/IRIS/telco.csv')
mykmeansPoly(x,3)

Could you please tell me the solution how to resolve above error.

Posted: **Mon Sep 28, 2015 1:15 am**

I would think that this:

Code: Select all

mykmeansPoly(x,3)

Should be more like this:

Code: Select all

mykmeansPoly(x,list('k'=3))

The Vertica Database Forums

R TransformFunctions and Partition by

R TransformFunctions and Partition by

Re: R TransformFunctions and Partition by

Re: R TransformFunctions and Partition by

Re: R TransformFunctions and Partition by

Re: R TransformFunctions and Partition by

Re: R TransformFunctions and Partition by