钱智光:Some Statistical Algorithms for Big Data Analytics

时间:2016-12-13浏览:439设置

报告时间:1229日10:00

报告地点:统计楼105

报告主题:Some Statistical Algorithms for Big Data Analytics

报告人:钱智光教授 美国威斯康辛大学麦迪逊分校统计系

摘要:

 Big Data appear in a growing number of areas like the Internet, finance, marketing, physics, biology and engineering. For example, for every hour, more than one million transaction data are stored in WalMart database and a HPC basedcomputer model can produce results of millions of runs. While large volume of data offers more statistical power, it also brings computational challenges.

 In this talk, we will present several statistical algorithms for analyzing Big Data. The increasing access to large social network data has generated substantialinterest in the IT industry.  However, due to its large scale, traditionalanalysis methods often become inadequate. We propose a sequentialsampling enhanced composite likelihood approach for efficient estimation of social intercorrelations in large-scale networks using a spatial model. The proposed approach sequentially takes small samples from the network, and adaptively improves model parameter estimates through learnings obtained from previous samples. Through simulation studies based on simulated networks and real networks, we demonstrate significant advantages of the proposed approach over benchmark estimation methods in terms of both computing time and accuracy in parameter estimation.

 We then present a reformulation and generalization of an experimental design algorithm, called orthogonalizing EM (OEM), thatleads to a reduction in computational complexity for least squares and penalized least squares problems. The reformulation, named the GOEM (Generalized Orthogonalizing EM) algorithm, is further extended to a wider class of models including generalized linear models and Cox's proportional hazards model. Synthetic and real data examples are included to illustrate its efficiency compared with standard techniques.

 Finally, we discuss several statistical designs for distributed computing. A growing trend in science and engineering is to distribute runs of a large job across different groups, machines or locations. Due to the complexity of both the software and the hardware, some batches in a distributed computing job may encounter failure.  By ensuring that the analysis can be doneat the batch level and the combined level, these new designs provide a robust solution to this problem. 



返回原图
/