Ruderman, Avraham Pinchas
Description
Many machine learning models are based on similarities between
new examples and previously observed examples. Such models are
extremely flexible and can adapt to a wide range of tasks.
However, if examples are composed of many variables, then even if
we collect a large number of examples, it is possible that no two
examples will be significantly similar. This, in turn, means that
a learning algorithm may require an unreasonably large number of
examples to...[Show more] learn such models.
In this thesis, however, we show that a small modification of
such models, using sim- ilarities on subsets of the variables,
rather than similarities on all variables, can allow us to learn
using fewer examples and without a large increase in the required
com- putational complexity, if two conditions hold. The first
condition is that similarities on certain subsets are indicative
of the targets. The second condition is that the vari- ables can
be partitioned into a small number of groups such that the
variables within each group are highly dependent.
For instance, suppose we want to learn a classifier which takes
images of handwritten digits and outputs the digit drawn in the
image. We would like to learn such a classifier based on example
images. If we measure similarity to a new example image,
depicting a 4, it is quite possible that many of the most similar
previous images we have depict a 9. However, if we measure
similarity on certain subsets of the pixels, for example, on the
region corresponding to the top third of the image, then on this
subset of the pixels, it is unlikely that an image depicting a 4
will be similar to an image depicting a 9. For such a task, a
learning algorithm based on measuring similarities to previous
examples would be able to learn a more accurate classifier, from
fewer examples, by using similarities on subsets.
But how can the learning algorithm efficiently find suitable
subsets of variables on which to measure similarity? This seems
like a daunting task, as the number of possible subsets grows
exponentially in the number of variables. In this thesis, we
suggest that if similarities on certain subsets are indicative of
the targets, and if the variables can be partitioned into a small
number of groups, such that the variables within each group are
highly dependent, then one can find appropriate subsets on which
to measure similarity, by considering combinations of groups of
highly de- pendent variables. The small number of groups leads to
a small number of subsets we need to consider and hence to a
manageable computational effort. The high dependence of the
variables within each group allows us to achieve much of the re-
duction in the number of examples that can be achieved by
measuring similarities on subsets, even though we are only
considering a small portion of all possible subsets.
Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.