4
$\begingroup$

Let's say you have a set of numbers S and you want to obtain subsets of $S$, $s_1,s_2,\ldots, s_i$, where $i < N$.

Is there a particular operation that will group the numbers that are "close to each other"?

I will give an example to clarify the question:

Let's say you have a set $S$ which contains $\{1.4, 2, 2.5, 2.7, 14, 16, 49, 57, 58\}$

And let's say that you can have maximum of $N=4$ subsets.

So, you might end up with a result that looks like this:

$s_1 = \{1.4, 2, 2.5, 2.7\}$
$s_2 = \{14, 16\}$
$s_3 = \{49\}$
$s_4 = \{57, 58\}$

I am looking for either the name what such a problem would be called (which I can then use to research and write an algorithm). If you can provide a simple solution, even better.

Thank you for any help with this.

  • 0
    set-theory is not the tag. The algorithm will depend on what you mean by "close".2010-10-05
  • 0
    I agree that it depends on what I mean by "close". I am looking for an algorithm that at least generally does what I'm asking so I have a starting point. Right now, I don't even know what to call this. Do you have any suggestions?2010-10-05
  • 0
    You are just looking for the $N$ maximal values of $x_{i+1}-x_i$ after sorting your set $S = \{x_1, \dots, x_n\}$. I doubt this has some name.2010-10-05
  • 3
    Sounds like "clustering"...2010-10-05
  • 0
    The differences belonging to your example are 0.6, 0.5, 0.2, 11.3, 2, 33, 8, 1. The 4 biggest values are 33 (= 49-16), 11.3 (= 14-2.7) and 8 (= 57-49) giving your result.2010-10-05
  • 0
    Thank J.M. for the name "clustering". Makes perfect sense and sometimes giving something a name can make all the difference.2010-10-05
  • 1
    And big thank you to jug for providing the answer. That is a very elegant solution. It does not require any recursion or special analysis. I think that's exactly what I'm looking for.2010-10-05
  • 3
    Jug's answer makes sense, but it's not "the" answer. In some applications it would not be optimal. For instance, let S have 100 values in the interval [0,1], 100 values in [2,3], 100 values in [4,5], 100 values in [6,7], and one value of 10. If the objective is to minimize the sums of squares to the centers of the clusters, 10 must be clustered with the values in [6,7] even though the gap is largest. It merely turns out in the particular example of the OP, the clustering is so strong that any remotely reasonable algorithm will work.2010-10-05
  • 1
    @whuber: I agree. That's why I posted it just as a comment. It's maybe "the" most trivial approach (which might be anyway enough for jeffp). @jeffp: Even if you won't implement the methods in the answer of whuber, I'd recommend to accept his answer (or whatever better answer will be posted), as it provides a better solution in general.2010-10-05
  • 0
    Thanks - I have implemented jug's comment and it is working for some cases, but as whuber pointed it there are cases where it does not do what I would consider to be ideal. Either way it is a great starting point. I will probably end up going with what whuber suggested for my final implementation. Thanks everybody!2010-10-06

2 Answers 2

9

There are many solutions to this problem extant, each rediscovered many times. Search "statistical clustering methods" or "cluster analysis." A particularly nice one for your problem is called "K-means". This was rediscovered by cartographers (!) where it is called "Jenks' method" (and is worth mentioning because it's used by some popular mapping software, so perhaps more people have heard of this than of any other method). The idea behind K-means is to minimize the population-weighted sum of variances of the subsets.

BTW, the challenge in most clustering algorithms lies in determining the number of clusters; usually that needs to be specified (as $N = 4$ in this example) or at least hinted at by the user. For instance, there always exists a minimum K-means solution: partition $S$ into singletons. If you require $i < n$, merge the two nearest neighbors but leave all the other singletons as they are: the sum of variances is then just one-quarter the square of the smallest difference among the $x_i$.

  • 1
    I remembered "clustering", but forgot "k-means". Thanks!2010-10-05
0

There is a somewhat related operation, condensation, in the theory of linear orderings. For example, you can take a linear ordering and identify any two elements $x,y$ which are connected by some finite chain, i.e. there are elements $x = v_1,\ldots,v_n = y$ such that $v_1 < \cdots < v_n$ and there are no elements 'in between', i.e. no $z$ such that $v_i < z < v_{i+1}$. This will compress all copies of $\mathbb{Z}$ and $\mathbb{N}$ (or its opposite) to a single point.

This particular condensation can be used to (de)construct all countable linear orders, if I remember correctly. There are also some other condensations, take a look at Rosenstein's Linear orderings.