Introduction to Big Data#

1. Terminologies#

  • Big data problems: not only the processing power, but the size of the data is also the limiting factor in being able to find a timely solution. ​

  • Big Data (Industry, Social Sciences)​

    • Input data carry characteristics of Big Data (the 4V)​.

    • Computational process is simple and straightforward, with minimal intermediate data being generated​.

  • Data Intensive Computing (Natural Sciences)​

    • Input data may or may not be big data​.

    • Computational process produces massive and complex intermediate data that needs to be analyzed during the process.

2. Big Data in Science (Data-enabled Science)#

Scientific process
  • Scientific process:

    • empirical

    • theoretical

    • computational

    • data-enabled/data-intensive science (the fourth pillar of scientific discovery).

  • Big data analytics in science and engineering for data sets that are:

    • Too big

    • Too complex

    • Too fast (streaming)

    • Too noisy

    • Too heterogeneous

Examples of data analytic problems

3. Big Data in Industry#

Gartner Hype Cycle 2014#

Gartner Hype Cycle 2014

Gartner Hype Cycle 2019#

Gartner Hype Cycle 2019

Gartner Hype Cycle 2020#

Gartner Hype Cycle 2020

4. The Vs of Big Data#

  • Used to be four: Volume, Velocity, Variety, Veracity.

  • Other Vs are added over time.

  • Volume: the size of the files used to archive and spread data.

  • Velocity: the speed with which data is generated and processed.

  • Variety: formats and purposes of data, which may include objects as different as samples of animal tissue, free-text observations, humidity measurements, GPS coordinates, and the results of blood tests.

  • Veracity: the extent to which the quality and reliability of big data can be guaranteed. Data with high volume, velocity and variety are at significant risk of containing inaccuracies, errors and unaccounted-for bias.

  • Validity: the selection of appropriate data with respect to the intended use. The choice of a specific dataset as evidence base requires adequate and explicit justification, including recourse to relevant background knowledge to ground the identification of what counts as data in that context.

  • Volatility: the extent to which data can be relied upon to remain available, accessible and re-interpretable despite changes in archival technologies. This is significant given the tendency of formats and tools used to generate and analyse data to become obsolete, and the efforts required to update data infrastructures so as to guarantee data access in the long term.

  • Value: the multifaceted forms of significance attributed to big data by different sections of society, which depend as much on the intended use of the data as on historical, social and geographical circumstances.

5. Programming paradigm for big data#

  • Multi-faceted challenges:​

    • Require not only parallel computation but also parallel data processing​.

  • New computational tools and strategies​.

  • New data intensive scalable architectures​.

  • Science is moving increasingly from hypothesis-driven to data-driven discoveries​.

  • Industry is at a stage where big data infrastructures are integrated and big data sets are beginning to be analyzed to produce business insights.

  • Example general paradigm:

Data parallel programming
  • It is difficult to write parallel programs​

    • Difficult in converting algorithms from serial to parallel​.

    • Difficult in identifying different ways that the program can fail​.

    • No reliable way to detect failure of a process​.

  • It is even more difficult to write parallel programs at large scale​

    • Same set of errors, but scale up with size​.

  • It is even more difficult to debug large scale parallel programs​

    • What if the program doesn’t fail but only produce incorrect results?

6. Data-intensive approach#

  • Scale “out”, not “up”​

    • It is easier and cheaper to add nodes to an existing cluster than to build a faster cluster.​

  • Move computation to the data​

    • Reduce data movement.​

  • Sequential processing, avoid random access​

    • Reduce seek movement on disks.​

  • Seamless scalability