|
Many data-mining algorithms were developed for the world of business, for
example for customer relationship management. The datasets in this environment,
although large, are simple in the sense that a customer either did or
did not buy three widgets, or did or did not fly from Chicago to Albuquerque.
In contrast, the datasets collected in scientific, engineering, medical,
and social applications often contain values that represent a combination of
different properties of the real world. For example, an observation of a star
produces some value for the intensity of its radiation at a particular frequency.
But the observed value is the sum of (at least) three different components:
the actual intensity of the radiation that the star is (was) emitting, properties
of the atmosphere that the radiation encountered on its way from the star to
the telescope, and properties of the telescope itself. Astrophysicists who want
to model the actual properties of stars must remove (as far as possible) the
other components to get at the ‘actual’ data value. And it is not always clear
which components are of interest. For example, we could imagine a detection
system for stealth aircraft that relied on the way they disturb the image of
stellar objects behind them. In this case, a different component would be the
one of interest.
Most mainstream data-mining techniques ignore the fact that real-world
datasets are combinations of underlying data, and build single models from
them. If such datasets can first be separated into the components that underlie
them, we might expect that the quality of the models will improve significantly.
Matrix decompositions use the relationships among large amounts of
data and the probable relationships between the components to do this kind
of separation. For example, in the astrophysical example, we can plausibly
assume that the changes to observed values caused by the atmosphere are independent
of those caused by the device. The changes in intensity might also
be independent of changes caused by the atmosphere, except if the atmosphere
attenuates intensity non-linearly.
|