首页> 外文会议>International conference on artificial neural networks >Superkernels for RBF Networks Initialization (Short Paper)
【24h】

Superkernels for RBF Networks Initialization (Short Paper)

机译:RBF网络初始化的超级内核(简短论文)

获取原文

摘要

One of the basic tasks solved using artificial neural networks is the regression task. In its canonical form, one seeks for adjusting network's parameters so that its response on input training data fits the desired outputs reasonably well. Training data {x_i, y_i}_(i=1)~n, n ∈ N consists of points from R~(d+1) Euclidean space, i.e., x_i ∈ R~d,y_i ∈ R. The quality of the fit is typically measured in terms of the mean integrated squared error (MISE). Various regularization techniques are considered to prevent from overfitting. Optimal setting of parameters can be specified analytically in the linear model (linear computational units), however, for the nonlinear units, the network's parameters are set using different variants of stochastic optimization [1]. The formulation and solution of the regression task is relatively straightforward in the realm of probability theory. Training data are considered being a random sample from the distribution of the random vector (X,Y) : (Ω,A) → (R~(d+1), B(R~(d+1))). It is well known that the optimal MISE estimator of Y given X is the conditional expectation E[Y|X]. That is, given X = x, the regression function writes E[Y|X = x]. An explicit form of E[Y|X = x] : R~d → R is computed using the joint density f of the distribution of (X,Y). Having access to f(x,y) : R~d → [0,∞), it is the classical result that the conditional distribution of Y given X has the density f(y|x) = f(x,y)/f(x) and E[Y|X = x)= ∫yf(y|x)dy = ∫y_(f(x))~(f(x,y))dy=∫f(x,y)dx/∫yf(x,y)dy. Thus the regression function can be at least theoretically computed in the closed form (of course analytical integration can make problems). The key to this computation is the joint density f(x, y). Theory of non-parametric estimation [2] deals with the approximation of f(x,y) on basis of a random sample {x_i, y_i}_(i=1)~n ~ {X, Y). Namely, we work with the nonparametric approximation of E[Y|X = x] known as the Nadaraya-Watson estimator f_n~(NW), [2, Sec. 1.5]. Given the data {x_i,y_i}_(i=1)~n, the kernel estimate f(x,y) = 1/(nh~(d+1)) ∑_(i=1)~n K((x - x_i)/h_n) K((y - y_i)/h_n) of f is constructed for a suitable function K : R~(d+1) → R known as a kernel and a bandwidth h_n > 0, which depends on the number of data n ∈ N. Approximating capabilities of the kernel are related to its order ℓ ∈ N. The Nadaraya-Watson estimator f_n~(NW) uses f to approximate f and consequently E[Y|X = x] as follows: f_n~(NW)(x)=∫f(x,y)dx/∫yf(x,y)dy=∑)(i=1)~nK((x-x_i)/h_n)/∑_(i=1)~ny_iK((x-x_i)/h_n). We presents the idea of using f_n~(NW) to initialize shallow RBF networks for further training to meet some regularization criterions. The straightforward approach to regularization is to limit the number of computation units. In RBF networks, selecting N 《 n units, their centers and widths can be specified on basis is of clustering the training data. Instead of setting the coefficients of a linear combination in the network using the training data, we linearly regress with respect to {x_k,f_n~(NW)(x_k)}_(k=1)~N', N' ∈ Ns, where {x_k}_(k=1)~N' regularly spans some region of interest, for example, [min_i{x_i~1}, max_i{x_i~1}] ×… × {min_i{x_i~d}, max_i{x_i~d}] with x_i = {x_i~1,…,x_i~d). The granularity of the span then determines the number of points N'. Other schemes for utilizing f_n~(NW) in initializing and learning RBF networks can be presented. The main issue discussed is how to deal with convergence of f to f in dependence on properties of f. The following upper bound applies on the MISE of the presented kernel density estimate [3, Theorem 3.5]: E[∫_(R~(d+1)) (f(x,y)- f{x,y))~2 dxdy] ≤ C·n~(-_(2β+d)~(2β)), where C is constant w.r.t. n and β ∈ N refers to the Sobolev character of the density f that relates to its smoothness. To have the bound valid, it is assumed that the order of kernel K meets β, i.e., that ℓ = β. Whilst the above upper bound increases with d, it decreases with β and lim_(β→∞) n~(-_(2β+d)~(2β)) = n~(-1) for the dimension d fixed. So, increasing smoothness can in some sense override the curse of dimensionality. However, the substantial issue here is that the Sobolev character of f is unknown when working with empirical data, and in consequence one cannot use some kernel K with the corresponding order ℓ = β to construct f_n~(NW). In the contribution, we discuss using the superkernels [2, p. 27] for constructing density kernel estimates and f_n~(NW) for RBF networks initialization. The superkernels are kernels which enjoy simultaneously all orders ℓ ∈ N. If a superkernel is used to construct f, then the maximal rate of convergence applies in the upper bound without exact specification of β, which overcomes the mentioned problem of the unknown Sobolev character. We discuss the construction of multidimensional superkernels, a relation to the Fourier transform and results from experiments showing performance in concrete tasks.
机译:使用人工神经网络解决的基本任务之一是回归任务。以规范的形式,人们寻求调整网络的参数,以使其对输入训练数据的响应合理地适合所需的输出。训练数据{x_i,y_i} _(i = 1)〜n,n∈N由R〜(d + 1)欧氏空间中的点组成,即x_i∈R〜d,y_i∈R.拟合的质量通常根据平均积分平方误差(MISE)来衡量。考虑了各种正则化技术以防止过度拟合。可以在线性模型(线性计算单位)中解析地指定参数的最佳设置,但是,对于非线性单位,使用随机优化的不同变量来设置网络的参数[1]。在概率论领域,回归任务的表述和解决方案相对简单。根据随机向量(X,Y):(Ω,A)→(R〜(d + 1),B(R〜(d + 1)))的分布,训练数据被认为是随机样本。众所周知,给定X的Y的最佳MISE估计是条件期望E [Y | X]。也就是说,给定X = x,回归函数将写为E [Y | X = x]。 E [Y | X = x]的显式形式是使用(X,Y)分布的联合密度f计算R〜d→R。可以访问f(x,y):R〜d→[0,∞),经典结果是,给定X的Y的条件分布的密度为f(y | x)= f(x,y)/ f(x)和E [Y | X = x)=∫yf(y | x)dy =∫y_(f(x))〜(f(x,y))dy =∫f(x,y)dx /∫yf(x,y)dy。因此,回归函数至少可以在理论上以封闭形式进行计算(当然,分析积分可能会引起问题)。此计算的关键是关节密度f(x,y)。非参数估计理论[2]基于随机样本{x_i,y_i} _(i = 1)〜n〜{X,Y)处理f(x,y)的近似。即,我们使用E [Y | X = x]的非参数逼近法,即Nadaraya-Watson估算器f_n〜(NW),[2,秒。 1.5]。给定数据{x_i,y_i} _(i = 1)〜n,内核估计f(x,y)= 1 /(nh〜(d + 1))∑_(i = 1)〜n K(( f的K((y-y_i)/ h_n)构造为一个合适的函数K:R〜(d + 1)→R被称为内核,带宽h_n> 0,这取决于数据的个数n∈N。内核的近似能力与其阶数ℓ∈N有关。Nadaraya-Watson估计量f_n〜(NW)使用f来近似f,因此E [Y | X = x]如下: f_n〜(NW)(x)=∫f(x,y)dx /∫yf(x,y)dy = ∑)(i = 1)〜nK((x-x_i)/ h_n)/ ∑_(i = 1)〜ny_iK((x-x_i)/ h_n)。我们提出了使用f_n〜(NW)初始化浅RBF网络以进行进一步训练以满足某些正则化准则的想法。进行正则化的直接方法是限制计算单元的数量。在RBF网络中,选择N << n个单位,可以在对训练数据进行聚类的基础上指定其中心和宽度。代替使用训练数据在网络中设置线性组合的系数,我们相对于{x_k,f_n〜(NW)(x_k)} _(k = 1)〜N',N'∈Ns线性回归,其中{x_k} _(k = 1)〜N'定期跨越某个感兴趣的区域,例如[min_i {x_i〜1},max_i {x_i〜1}]×…×{min_i {x_i〜d},max_i {x_i〜d}],其中x_i = {x_i〜1,…,x_i〜d)。然后,跨度的粒度确定点数N'。可以提出在初始化和学习RBF网络中利用f_n〜(NW)的其他方案。讨论的主要问题是如何根据f的性质处理f到f的收敛。以下上限适用于给出的核密度估计[3,定理3.5]的MISE:E [∫_(R〜(d + 1))(f(x,y)-f {x,y))〜 2 dxdy]≤C·n〜(-_(2β+ d)〜(2β)),其中C是常数wrt n和β∈N是指密度f的Sobolev特征,与光滑度有关。为了使边界有效,假定核K的阶满足β,即i =β。对于固定的d维,上述上限随着d的增加而减小,而随着β和lim_(β→∞)n〜(-_(2β+ d)〜(2β))= n〜(-1)减小。因此,在某种意义上说,增加平滑度可以超越维数的诅咒。但是,这里的一个实质性问题是,当处理经验数据时,f的索伯列夫特征是未知的,因此,人们不能使用具有相应阶数= =β的某个核K来构造f_n〜(NW)。在本文中,我们讨论了使用超级内核[2,p。119]。 [27]为RBF网络初始化构造密度核估计和f_n〜(NW)。超级内核是同时拥有所有阶ℓ∈N的内核。如果使用超级内核来构造f,则最大收敛速度适用于上限,而没有确切的β值,从而克服了上述未知的Sobolev特征的问题。我们讨论了多维超级内核的构造,与傅立叶变换的关系以及实验结果表明了在具体任务中的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号