Abstract:
Progress in science is driven through the formulation of hypotheses about phenomena of interest and by
collecting evidence for their validity or refuting them. While some hypotheses are amenable to deductive
proofs, other hypotheses can only be accessed in a data-driven manner. For most phenomena, scientists cannot
control all degrees of freedom and hence data is often inherently stochastic. This stochasticity disallows to
test hypotheses with absolute certainty. The field of statistical hypothesis testing formalizes the probabilistic
assessment of hypotheses, enabling researchers to control the error rates, for example, at which they reject a
true hypothesis, while aiming to reject false hypotheses as often as possible.
But how do we come up with promising hypotheses, and how can we test them efficiently? Can we use
machine learning systems to automatically generate promising hypotheses? This thesis studies different
aspects of this question.
A simple rule for statistical hypothesis testing states that one should not peek at the data when formulating
a hypothesis. This is indeed true if done naively, that is, when the hypothesis is then simply tested with
the data as if one had not looked at it yet. However, we show that in principle using the same data for
learning the hypothesis and testing it is feasible if we can correct for the selection of the hypothesis. We treat
this in the case of the two-sample problem. Given two samples, the hypothesis to be tested is whether the
samples originate from the same distribution. We can reformulate this by testing whether the maximum
mean discrepancy over a (unit ball of a) reproducing kernel Hilbert space is zero. We show that we can
learn the kernel function, hence the exact test we use, and perform the test with the same data, while still
correctly controlling the Type-I error rates. Likewise, we demonstrate experimentally that taking all data into
account can lead to more powerful testing procedures than the data splitting approach. However, deriving
the formulae that correct for the selection procedure requires strong assumptions, which are only valid for a
specific, the linear-time, estimate of the maximum mean discrepancy. In more general settings it is difficult, if
not impossible, to adjust for the selection.
We thus also analyze the case where we split the data and use part of it to learn a test statistic. The maximum
mean discrepancy implicitly optimizes a mean discrepancy over the unit ball of a reproducing kernel Hilbert
space, and often the kernel itself is optimized on held-out data.We instead propose to optimize a witness
function directly on held-out data and use its mean discrepancy as a test statistic. This allows us to directly
maximize the test power, simplifies the theoretical treatment, and makes testing more efficient.We provide
and implement algorithms to learn the test statistics. Furthermore, we show analytically that the optimization
objective to learn powerful tests for the two-sample problem is closely related to the objectives used in
standard supervised learning tasks, namely the least-square loss and cross-entropy loss. This allows us to
indeed use existing machine learning tools when learning powerful hypotheses. Furthermore, since we use
held-out data for learning the test statistic, we can use any kind of model-selection and cross-validation
techniques to maximize the performance. To facilitate this for practitioners, we provide an open-source
Python package ’autotst’ implementing an interface to existing libraries and running the whole testing
pipeline, including the learning of the hypothesis. Our presented methods reach state-of-the-art performance
on two-sample testing tasks. We also show how to trade off the computational resources required for the test
by sacrificing some statistical power, which can be important in practice. Furthermore, our test easily allows
interpreting the results.
Having more computational power potentially allows extracting more information from data and thus obtain
more significant results. Hence, investigating whether quantum computers can help in machine learning tasks
has gained popularity over the past years. We investigate this in light of the two-sample problem. We define
the quantum mean embedding, mapping probability distributions onto quantum states, and analyze when
this mapping is injective. While this is conceptually interesting on its own, we do not find a straight-forward
way of harnessing any speed-up. The main problem here is that there is no known way to efficiently create
the quantum mean embedding. On the contrary, fundamental results in quantum information theory show
that this might generally be hard to do.
For two-sample testing, the usage of reproducing kernel Hilbert spaces has been established for many years
and proven important both theoretically and practically. In this case, we thus focused on practically relevant
aspects to make the tests as powerful and easy to use as possible. For other hypothesis testing tasks, the
usage of advanced machine learning tools still lags far behind. For specification tests based on conditional
moment restrictions, popular in econometrics, we do the first steps by defining a consistent test based on
kernel methods. Our test already has promising performance, but optimizing it, potentially with the other
insights gained in this thesis, is an open task.