地 点: 理科大楼A1514室
题 目：A distance metric-based uniform subsampling method for nonparametric models
Taking subset samples from the original data set is an efficient and popular strategy to handle massive data that is too large to be directly modeled. While in most theoretical advances the subset samples are collected independently with equal probability, for accurate inference and prediction it is beneficial to employ a fast subsampling scheme to select observations intelligently. The majority of existing subset selecting methods are provided for linear models or their extensions. In this paper, we propose a novel subsampling method for efficient nonparametric modeling of high-volume data sets. It is a proportionate sampling method that uses distance metric-based strata to select subsamples. To minimize the maximal distance from pairs of samples that locate in the same stratum, Voronoi cells of thinnest covering lattices are used to partition the space. With the help of an algorithm to quickly identify the cell an observation locates in, the computational cost of our subsampling method is proportional to the number of observations and irrelevant to the number of cells, which makes our method applicable to extremely large data sets. Results from simulated studies and real data analysis show that the new method is remarkably better than existing approaches when used in conjunction with k-nearest neighbor or Gaussian process models.
王典朋，北京理工大学副研究员，是北京大数据协会常务理事、中国青年统计学家协会常务理事，主要从事计算机试验设计、机器学习和工业大数据统计分析的科研工作，在Technometrics、Statistica Sinica、Reliability Engineering & System Safety和Journal of Quantity Technology等统计学权威期刊发表论文10余篇，主持国家自然科学基金青年科学基金、面上项目及科工局星箭共性技术项目等多项。