Big data and statistics: A statistician’s perspective


Abstract


Big Data brings unprecedented power to address scientific, economic and societal issues, but also amplifies the possibility of certain pitfalls. These include using purely data-driven approaches that disregard understanding the phenomenon under study, aiming at a dynamically moving target, ignoring critical data collection issues, summarizing or preprocessing the data inadequately and mistaking noise for signal. We review some success stories and illustrate how statistical principles can help obtain more reliable information from data. We also touch upon current challenges that require active methodological research, such as strategies for efficient computation, integration of heterogeneous data, extending the underlying theory to increasingly complex questions and, perhaps most importantly, training a new generation of scientists to develop and deploy these strategies.


Keywords


Big Data; statistics; case studies; pitfalls; challenges

References


  • Berry, D., 2012. «Adaptive Clinical Trials in Oncology». Nature Reviews Clinical Oncology, 9: 199-207. DOI: <10.1038/nrclinonc.2011.165>.

  • Curtice, J. and D. Firth, 2008. «Exit Polling in a Cold Climate: the BBC-ITV Experience Explained». Journal of the Royal Statistical Society A, 171(3): 509-539. DOI: <10.1111/j.1467-985X.2007.00536.x>.

  • Fan, J.; Han, F. and H. Liu, 2014. «Challenges of Big Data Analysis». National Science Review, 1 (2): 293-314. DOI: <10.1093/nsr/nwt032>.

  • Font-Burgada, J.; Reina, O.; Rossell, D. and F. Azorín, 2013. «ChroGPS, a Global Chromatin Positioning System for the Functional Analysis and Visualization of the Epigenome». Nucleic Acids Research, 42(4): 1-12. DOI: <10.1093/nar/gkt1186>.

  • Gorton, G., 2009. «Information, Liquidity, and the (Ongoing) Panic of 2007». American Economic Review, 99(2): 567-572. DOI: <10.1257/aer.99.2.567>.

  • Hilbert, M., 2012. «How Much Information Is There in the “Information Society”?». Significance, 9(4): 8-12. DOI: <10.1111/j.1740-9713.2012.00584.x>.

  • International Business Machines Corporation, 2011. IBM Big Data Success Stories. International Business Machines Corporation. Armonk, NY. Available at: .

  • Jordan, M., 2013. «On Statistics, Computation and Scalability». Bernoulli, 19(4): 1378-1390. DOI: <10.3150/12-BEJSP17>.

  • King, G. et al., 2009. «Public Policy for the Poor? A Randomized Assessment of the Mexican Universal Health Insurance Programme». The Lancet, 373: 1447-1454. DOI: <10.1016/S0140-6736(09)60239-7>.

  • Lazer, D.; Kennedy, R.; King, G. and A. Vespignani, 2014. «The Parable of Google Flu: Traps in Big Data Analysis». Science, 343(6176): 1203-1205. DOI: <10.1126/science.1248506>.

  • Lewis, M., 2003. Moneyball. The Art of Winning an Unfair Game. W. W. Norton & Company. New York.

  • Lohr, S., 2012. «The age of Big Data». The New York Times, 11 February 2012. Available at: .

  • Müller, P.; Parmigiani, G.; Robert, C. and J. Rousseau, 2004. «Optimal Sample Size for Multiple Testing: the Case of Gene Expression Microarrays». Journal of the American Statistical Association, 99(468): 990-1001. DOI: <10.1198/016214504000001646>.

  • Nuzzo, R., 2014. «Scientific Method: Statistical Errors», Nature, 506: 150-152. DOI: <10.1038/506150a>.

  • Rossell, D.; Stephan-Otto Attolini, C.; Kroiss, M. and A. Stöcker, 2014. «Quantifying Alternative Splicing from RNA-Sequencing Data». The Annals of Applied Statistics, 8(1): 309-330. DOI: <10.1214/13-AOAS687>.

  • Silver, N., 2012. The Signal and the Noise: Why So Many Predictions Fail – but Some Don’t. Penguin Press. New York.

  • Shaw, J., 2014. «Why “Big Data” Is a Big Deal». Harvard Magazine, 3: 30-35, 74-75. Available at: .

  • Student, 1931. «The Lanarkshire Milk Experiment». Biometrika, 23(3-4): 398-406. DOI: <10.2307/2332424>.

  • World Economic Forum, 2012. Big Data, Big Impact: New Possibilities for International Development. World Economic Forum. Cologny, Switzerland. Available at: .







Creative Commons License
Texts in the journal are –unless otherwise indicated– published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

____________________________________________________________________________________________________________________