The Deluge of Spurious Correlations in Big Data : Randomness in Nature and Data

23 March 2016

- Cappella Guinigi

Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. Unfortunately, the effectiveness of these tools is used to support a â€œphilosophyâ€ against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: â€œwith enough data, the numbers speak for themselvesâ€. The â€œend of scienceâ€ is proclaimed. Using classical results from ergodic theory, Ramsey theory and algorithmic information theory, we show that this â€œphilosophyâ€ is wrong. For example, we prove that very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. They can be found in â€œrandomlyâ€ generated, large enough databases, which â€“ as we will prove â€“ implies that most correlations are spurious. Too much information tends to behave like very little information and the mathematics of randomness in nature and in data can help to understand this phenomenon. The scientific method can be enriched by computer mining in immense databases, but not replaced by it.

relatore:

Longo, Giuseppe

Units:

SysMA

Languages

You are here

The Deluge of Spurious Correlations in Big Data : Randomness in Nature and Data

Follow or contact us on: