Thanks for giving this wonderful article.

I have a question for the code.

In function domeanshift(), you wrote:

———————————————————————————–

curDataPoint = dataPoints(i,:);

euclideanDist = sqdist(curDataPoint’,origDataPoints’);

bandwidth = getBandWith(origDataPoints(i,:),origDataPoints,euclideanDist,useKNNToGetH);

kernelDist = exp(-euclideanDist ./ (bandwidth^2));

———————————————————————————–

But I think the following codes may be more reasonable

———————————————————————————–

curDataPoint = dataPoints(i,:);

euclideanDist = sqrt( sqdist(curDataPoint’,origDataPoints’) );

bandwidth = getBandWith(origDataPoints(i,:),origDataPoints,euclideanDist,useKNNToGetH);

kernelDist = exp(- (euclideanDist.^2) ./ (bandwidth^2));

———————————————————————————–

The reason is that in getBandWith(), we set the bandwidth to the k-th smallest euclidean distance, not the k-th smallest square euclidean distance. ]]>

However, i have been using a fixed bandwidth so far and on some of my data i find it very hard (literally impossible) to set the bandwidth such that it works on all my data given its diversity. I am now searching for ways to make the bandwidth data-driven or locally adaptive. I like the knn idea (gives large bandwidth at sparse areas) and smaller bandwidth (at dense areas) but then i but setting k doesnt seem easier too.

What does your intuition say on the bandwidth selection issue? It would be great if you could point to some important papers in this regard with ideas that work robustly. I found lots of papers on google but i am not sure which works.

The knn idea would surely make the code run faster but in terms of achieving good clustering consistently of a different datasets (sparse to dense) is it any better than using a fixed bandwidth in terms of sensitivity of parameters?

Looking forward to hear from you.

Regards,

Deepak

]]>