Kernel has so many meanings (Wikipedia). I will not speak about the core of your OS, nor about the atomic nucleus or the fictional characters.
We will try to grasp the general concept of “what a kernel is”, and “why is it called like that” in machine learning. Let’s examine the two “large meanings” of kernel in mathematics).
First meaning: a measure of injectiveness
In algebra (kernel of a linear transform, of a homomorphism, of a matrix), a kernel is the set of elements that map to the neutral element. T : V → W,
ker(T) = {v ∈ V | T.v = 0}
Edit: for the readers not familial with algebra, I made a picture.
In category theory (that studies categories of mathematical objects and morphism), the kernel of a morphism f is more general: it is the most general morphism k such that: k∘f return the neutral element for whatever input.
In set theory, the kernel of a function f is the partition D¹…Dⁿ of the domain D such that:
∀ (x, y) ∈ D ∈ {D¹…Dⁿ} : f(x) = f(y), so: ker(f) = {(x,y) | f(x) = f(y)}
All these bound the kernel to a measure of the degree to which the transform/morphism/function fails to be injective (how much elements from the domain doesn’t have separate images).
Second meaning: defining an integral transform
An integral kernel, or kernel function, defines an integral transform, such as the function k in:
(T∘f)(x) = ∫k(x,x’).f(x’).dx
If we take k : x ↦ exp(-i.u.x)/√(2π), we have a Fourier transform, whereas k : x ↦ exp(-u.x)* leads to a Laplace transform.
When an integral kernel has the property of:
∫k(x).dx = 1 and ∀ x, k(x) = k(-x) this kernel can be used for density estimation of a random variable. Perhaps the most known kernel in this regard is the Gaussian kernel: k : x ↦ exp(-x²/2)/√(2π), but there are many others (commons)).
When the integral kernel depends on the difference between its arguments, we have a convolution kernel:
*(T∘f)(x) = ∫k(x – x’).f(x’).dx
As the probability distribution of the sum of two independent random variables is the convolution of their individual distributions, kernels are central for probabilistic models.
So, why “kernel machines/methods” for SVM and Gaussian processes (and PCA etc.)? Because kernels are central to such methods. An SVM is basically a maximum margin linear classifier. Linear classifier means that, given 2 sets of points in a p-dimensional space, it just build an hyperplane (dimension p-1) to separate them. Maximum margin means that it maximizes the margin between the points of the 2 sets closest to the hyperplane (the support vectors). So, how can SVM separate clusters that are not linearly separable? Using a kernel trick.
Kernel trick
A kernel trick is a way to increase the dimensionality of the observations so that, in this higher dimension space, they are linearly separable. For that, we use a kernel, as defined in the above part, as an inner product of the higher dimension space. Observations from S are mapped to V by ϕ : S → V, the kernel used is k : x,y ↦ ⟨ϕ(x),ϕ(y)⟩, the inner product of V. And we don’t need to know ϕ, it suffices to know that k in a inner product in V and that k follows the Mercer’s condition.
There is a cool video to vizualize that (forgive them the Comic Sans) on YouTube.
To conclude, the name “kernel machines” comes from the fact that without kernel tricks, SVM and other linear classifiers would be very limited. It enables them for non-linear classification.