Research Repository

K-means initialisation algorithms: an extensive comparative study

Harris, Simon (2021) K-means initialisation algorithms: an extensive comparative study. Masters thesis, University of Essex.

[img]
Preview
Text
thesis_simonharris_20210308.pdf

Download (987kB) | Preview

Abstract

The K-means data clustering algorithm, whilst widely popular, is not without its drawbacks. In this work, we are particularly interested in the sensitivity of K-means to its initialisation, in the form of a set of initial centroids. Since the cluster recovery performance of K-means can potentially be improved by better initialisation, numerous algorithms have been proposed with the intention of producing better initial centroids. However, despite several decades since K-means was first formalised, it is still unclear which initialisation algorithm should be used in any particular clustering scenario. With this in mind, we empirically compare \nalgs{} published K-means initialisation algorithms by running them against 6,000 synthetic and 28 real-world data sets. The synthetic data sets were produced under many different configurations, allowing us to explore how each algorithm performs in each scenario. Hence, the results of our experiments may be particularly useful for those considering K-means for a non-trivial clustering scenario. This work also introduces a new set of software libraries originally developed for use as part of our experiments, including implementations of the K-means initialisation algorithms covered in this work, along with data cleaning and data generation tools. These tools are made freely available to the wider community under a permissive open source licence. This is done in part to make sure that the research presented is replicable, but also in the hope that the research and results shared here may be of value to future researchers and developers in the field. As with all open source projects, contributions from the wider community are invited. It is intended that the research should be placed in its historical context, and so we conduct an extensive literature review, including an overview of previous comparable surveys, followed by more detailed coverage of the specific algorithms included in our survey as described in the respective original literature.

Item Type: Thesis (Masters)
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Science and Health > Computer Science and Electronic Engineering, School of
Depositing User: Simon Harris
Date Deposited: 20 Apr 2021 15:05
Last Modified: 20 Apr 2021 15:05
URI: http://repository.essex.ac.uk/id/eprint/30207

Actions (login required)

View Item View Item