Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding

2026-05-25 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors study a special kind of regression called conditional kernel ridge regression (conditional KRR), which builds on conditionally positive definite kernels. They show it can be understood by comparing it to a standard kernel ridge regression using a related 'residual' kernel. Their main finding is that the performance difference between these methods is small and well-controlled, especially as the sample size grows. They also explore cases using principal eigenfunctions or random features and find that conditional KRR works better when the part of the function captured by these features is strong. Both theory and experiments support their conclusions.

Conditionally positive definite kernelsNative spaceKernel ridge regressionResidual kernelMercer decompositionPrincipal eigenfunctionsRandom featuresExpected test riskStatistical learning theory

Authors

Rustem Takhanov, Zhenisbek Assylbekov

Abstract

Conditionally positive definite (CPD) kernels are defined with respect to a function class $\mathcal{F}$. It is well known that such a kernel $K$ is associated with its native space (defined analogously to an RKHS), which in turn gives rise to a learning method -- called conditional kernel ridge regression (conditional KRR) due to its analogy with KRR -- where the estimated regression function is penalized by the square of its native space norm. This method is of interest because it can be viewed as classical linear regression, with features specified by $\mathcal{F}$, followed by the application of standard KRR to the residual (unexplained) component of the target variable. Methods of this type have recently attracted increasing attention. We study the statistical properties of this method by reducing its behavior to that of KRR with another fixed kernel, called the residual kernel. Our main theoretical result shows that such a reduction is indeed possible, at the cost of an additional term in the expected test risk, bounded by $\mathcal{O}(1/\sqrt{N})$, where $N$ is the sample size and the hidden constant depends on the class $\mathcal{F}$ and the input distribution. This reduction enables us to analyze conditional KRR in the case where $K$ is positive definite and $\mathcal{F}$ is given by the first $k$ principal eigenfunctions in the Mercer decomposition of $K$. We also consider the setting where $\mathcal{F}$ consists of $k$ random features from a random feature representation of $K$. It turns out that these two settings are closely related. Both our theoretical analysis and experiments confirm that conditional KRR outperforms standard KRR in these cases whenever the $\mathcal{F}$-component of the regression function is more pronounced than the residual part.

View PDFOpen arXiv