R TransformFunctions and Partition by

Moderator: NorbertKrupa

Post Reply
joejoe
Newbie
Newbie
Posts: 3
Joined: Tue Jul 23, 2013 11:40 am

R TransformFunctions and Partition by

Post by joejoe » Tue Jul 23, 2013 12:43 pm

Hi,

I was wondering if you could help me, I've looked through the documentation but can't seem to find a clear answer.

Is it possible to use the window_partition_clause [ ie OVER( PARTITION BY blah ) ] with transform functions? And how do I do this with R?


More precisely..


I've been trying to run the Kmeans clustering algorithm as a polymorphic transform function like the example in the documentation- my code is here-

Code: Select all

# k-means ploymorphic algorithm

#Input: A dataframe consisting of one column of labels and then n metrics
#Output: A dataframe with one column stating which cluster the data point belongs

kmeans_clusterPoly<-function(x,y)
{
   #load required packages
   #library(cluster)

   #Parameter Check: Number of clusters to be made, k.
   if(!is.null(y[['k']]))
      k=as.numeric(y[['k']])
   else
      stop(" Expected parameter k. Syntax '...USING PARAMETER k=3)'")


   # Get the number of columns in the input dataframe 
   cols <- ncol(x)
   
   #runs the k mean algorithm
   cl<-kmeans(x[,2:cols],k)

   #returns the clustering vector
   Result <- cl$cluster
   
   #Return result to vertica
   Result <- data.frame( x[,1], Result )
   Result
}

kmeans_clusterPolyFactory<-function()
{
   list(
        name=kmeans_clusterPoly,	  #function that does the processing 
        udxtype=c("transform"),      #type of the function 
        intype=c("any"),	               #iput types
        outtype=c("int","int"),				     #output types
        parametertypecallback=kmeans_clusterPolyParameters
   )				
}

kmeans_clusterPolyParameters <- function()
{
   params <- data.frame( datatype=rep( NA, 1), length=rep( NA,1), scale=rep( NA,1), name=rep( NA,1) )
   params[1,1] = "int"
   params[1,4] = "k"
   params
}


I can run the code fine using the select statement below -

Code: Select all

      SELECT 
      kmeans_clusterPoly( c2, c3 USING PARAMETERS k = 4) over( )
   FROM
      t1


  
I would like to be able to do the following..

Code: Select all

   SELECT 
      kmeans_clusterPoly( c2, c3 USING PARAMETERS k = 4)  over( partition by c1 )
   FROM
      t1
  

But I get the following error
ERROR 3399: Failure in UDx RPC call InvokeProcessPartition(): Error calling processPartition() in User Defined Object [kmeans_clusterPoly] at [/scratch_a/release/vbuild/vertica/UDxFence/RInterface.cpp:1342], error code: 0, message: Exception in processPartitionForR: [cannot take a sample larger than the population when 'replace = FALSE']

What do I need to do? Is there a way to write the 'Partition' in a similar way to the Factory or Parameters ? Where can I find documentation for this?

Or do I need to write the partition in the R code directly and is it that transform functions can only be applied as one big 'function' acting on the whole 'dataframe'?

Thanks in advance for any help

id10t
GURU
GURU
Posts: 732
Joined: Mon Apr 16, 2012 2:44 pm

Re: R TransformFunctions and Partition by

Post by id10t » Tue Jul 23, 2013 1:42 pm

Hi!
[cannot take a sample larger than the population when 'replace = FALSE']
This error means that number of columns of output matrix/"data frame" does not match to output definition (is larger than defined). Your function should return 2 columns only ( outtype=c("int","int") )

joejoe
Newbie
Newbie
Posts: 3
Joined: Tue Jul 23, 2013 11:40 am

Re: R TransformFunctions and Partition by

Post by joejoe » Tue Jul 23, 2013 2:33 pm

Hi,

Okay. cool and thanks for a quick response.

I checked using the partition over one of the other transform functions I made and I had no problem with that so I have no problem there, so thank you for clearing that up.


But I still have this error message when I introduce the partition by. (Also, I just noticed I have a few errors in the comments on the code-- I do want the output matrix to consist of two columns one is an index from the input and the other is the cluster that it belongs to.) I see no reason why my R code would not output 2 columns :S and the code seems to work if I have just an OVER() clause.

Do you know a reason why introducing the 'partition by' would mean I no longer have 2 columns?

joejoe
Newbie
Newbie
Posts: 3
Joined: Tue Jul 23, 2013 11:40 am

Re: R TransformFunctions and Partition by

Post by joejoe » Tue Jul 23, 2013 2:56 pm

Awh! I'm sorry I think I've found it now!

Your post was really useful,thanks.

I think when I made the partition some of the partitions contained too few data points to run the kmeans from R so would return nothing.. which would mean that the number of columns in the output matrix didn't match the definition.

On to the next error message!

Thanks a lot

I'll keep working at it :)

ssrao
Newbie
Newbie
Posts: 14
Joined: Fri Jun 07, 2013 2:37 pm

Re: R TransformFunctions and Partition by

Post by ssrao » Thu Jan 30, 2014 11:29 am

Hi,

I am testing the above code in R and getting error as "Error in y[["k"]] : subscript out of bounds".

mykmeansPoly <- function(x,y)
{
# get the number of clusters to be formed
if(!is.null(y[['k']]))
k=as.numeric(y[['k']])
else
stop("Expected parameter k")
# get the number of columns in the input data frame
cols = ncol(x)
# run the kmeans algorithm
cl <- kmeans(x[,2:cols-1], k)
# get the cluster information from the result of above
Result <- cl$cluster
#return result to vertica
Result <- data.frame(VCol=Result)
Result
}
x<-read.csv('C:/R/Clustering/IRIS/telco.csv')
mykmeansPoly(x,3)

Could you please tell me the solution how to resolve above error.

VaughanR
Newbie
Newbie
Posts: 3
Joined: Wed Mar 18, 2015 4:46 am

Re: R TransformFunctions and Partition by

Post by VaughanR » Mon Sep 28, 2015 1:15 am

I would think that this:

Code: Select all

mykmeansPoly(x,3)
Should be more like this:

Code: Select all

mykmeansPoly(x,list('k'=3))

Post Reply

Return to “R Language Integration”