### R TransformFunctions and Partition by

Posted:

**Tue Jul 23, 2013 12:43 pm**Hi,

I was wondering if you could help me, I've looked through the documentation but can't seem to find a clear answer.

Is it possible to use the window_partition_clause [ ie OVER( PARTITION BY blah ) ] with transform functions? And how do I do this with R?

More precisely..

I've been trying to run the Kmeans clustering algorithm as a polymorphic transform function like the example in the documentation- my code is here-

I can run the code fine using the select statement below -

I would like to be able to do the following..

But I get the following error

What do I need to do? Is there a way to write the 'Partition' in a similar way to the Factory or Parameters ? Where can I find documentation for this?

Or do I need to write the partition in the R code directly and is it that transform functions can only be applied as one big 'function' acting on the whole 'dataframe'?

Thanks in advance for any help

I was wondering if you could help me, I've looked through the documentation but can't seem to find a clear answer.

Is it possible to use the window_partition_clause [ ie OVER( PARTITION BY blah ) ] with transform functions? And how do I do this with R?

More precisely..

I've been trying to run the Kmeans clustering algorithm as a polymorphic transform function like the example in the documentation- my code is here-

Code: Select all

```
# k-means ploymorphic algorithm
#Input: A dataframe consisting of one column of labels and then n metrics
#Output: A dataframe with one column stating which cluster the data point belongs
kmeans_clusterPoly<-function(x,y)
{
#load required packages
#library(cluster)
#Parameter Check: Number of clusters to be made, k.
if(!is.null(y[['k']]))
k=as.numeric(y[['k']])
else
stop(" Expected parameter k. Syntax '...USING PARAMETER k=3)'")
# Get the number of columns in the input dataframe
cols <- ncol(x)
#runs the k mean algorithm
cl<-kmeans(x[,2:cols],k)
#returns the clustering vector
Result <- cl$cluster
#Return result to vertica
Result <- data.frame( x[,1], Result )
Result
}
kmeans_clusterPolyFactory<-function()
{
list(
name=kmeans_clusterPoly, #function that does the processing
udxtype=c("transform"), #type of the function
intype=c("any"), #iput types
outtype=c("int","int"), #output types
parametertypecallback=kmeans_clusterPolyParameters
)
}
kmeans_clusterPolyParameters <- function()
{
params <- data.frame( datatype=rep( NA, 1), length=rep( NA,1), scale=rep( NA,1), name=rep( NA,1) )
params[1,1] = "int"
params[1,4] = "k"
params
}
```

I can run the code fine using the select statement below -

Code: Select all

```
SELECT
kmeans_clusterPoly( c2, c3 USING PARAMETERS k = 4) over( )
FROM
t1
```

Code: Select all

```
SELECT
kmeans_clusterPoly( c2, c3 USING PARAMETERS k = 4) over( partition by c1 )
FROM
t1
```

But I get the following error

ERROR 3399: Failure in UDx RPC call InvokeProcessPartition(): Error calling processPartition() in User Defined Object [kmeans_clusterPoly] at [/scratch_a/release/vbuild/vertica/UDxFence/RInterface.cpp:1342], error code: 0, message: Exception in processPartitionForR: [cannot take a sample larger than the population when 'replace = FALSE']

What do I need to do? Is there a way to write the 'Partition' in a similar way to the Factory or Parameters ? Where can I find documentation for this?

Or do I need to write the partition in the R code directly and is it that transform functions can only be applied as one big 'function' acting on the whole 'dataframe'?

Thanks in advance for any help