"Kernel Based Algorithms for Mining Huge Data Sets" is the first book treating the fields of supervised, semi-supervised and unsupervised machine learning collectively. The book presents both the theory and the algorithms for mining huge data sets by using support vector machines (SVMs) in an iterative way. It demonstrates how kernel based SVMs can be used for dimensionality reduction (feature elimination) and shows the similarities and differences between the two most popular unsupervised techniques, the principal component analysis (PCA) and the independent component analysis (ICA). The book presents various examples, software, algorithmic solutions enabling the reader to develop their own codes for solving the problems. The book is accompanied by a website for downloading both data and software for huge data sets modeling in a supervised and semisupervised manner, as well as MATLAB based PCA and ICA routines for unsupervised learning. The book focuses on a broad range of machine learning algorithms and it is particularly aimed at students, scientists, and practicing researchers in bioinformatics (gene microarrays), text-categorization, numerals recognition, as well as in the images and audio signals de-mixing (blind source separation) areas.
This is a book about (machine) learning from (experimental) data. Many books devoted to this broad field have been published recently. One even feels tempted to begin the previous sentence with an adjective extremely. Thus, there is an urgent need to introduce both the motives for and the content of the present volume in order to highlight its distinguishing features.
Before doing that, few words about the very broad meaning of data are in order. Today, we are surrounded by an ocean of all kind of experimental data (i.e., examples, samples, measurements, records, patterns, pictures, tunes, observations,..., etc) produced by various sensors, cameras, microphones, pieces of software and/or other human made devices. The amount of data produced is enormous and ever increasing. The first obvious consequence of such a fact is - humans can’t handle such massive quantity of data which are usually appearing in the numeric shape as the huge (rectangular or square) matrices. Typically, the number of their rows (n) tells about the number of data pairs collected, and the number of columns (m) represent the dimensionality of data. Thus, faced with the Giga- and Terabyte sized data files one has to develop new approaches, algorithms and procedures. Few techniques for coping with huge data size problems are presented here. This, possibly, explains the appearance of a wording ’huge data sets’ in the title of the book.