РУС/ENG
Department of Mathematics,
Faculty of Physics, MSU

Theoretical Basics of Big Data Analytics and Real Time Computation Algorithms

The course will briefly review specific challenges of Big Data Analytics, such as problems of extracting, unifying, updating, and merging information and specific needs in processing data, which should be highly parallel and distributed.  With these specific features in mind we will then study more closely a number of mathematical tools for Big Data analytics, such as regression analysis, linear estimation, calibration problems, real time processing of incoming (potentially infinite) data.

Lecturers

Course Goals

The main goal of this course is to provide students with a unique opportunity to acquire conceptual background and mathematical tools applicable to Big Data Analytics and Real Time Computation.  The course will briefly review specific challenges of Big Data Analytics, such as problems of extracting, unifying, updating, and merging information and specific needs in processing data, which should be highly parallel and distributed.  With these specific features in mind we will then study more closely a number of mathematical tools for Big Data analytics, such as regression analysis, linear estimation, calibration problems, real time processing of incoming (potentially infinite) data.  We will see how these approaches can be transformed to conform to the Big Data demands.  We will also discuss why most of widely used algorithmic languages are not quite appropriate for solving such problems and outline alternative approaches.

 

Course Ideas

Within traditional approaches to information processing we have to collect all the data in one array and apply a processing algorithm to it. If raw data is distributed among many sites and its total volume is large it immediately leads to certain technical problems:

  • Accumulating all the raw data in one place would require excessive storage resources.
  • Feeding huge arrays of data to an algorithm would require lots of operating memory and computing power.
  • The idea of processing all the data at once does not reveal (and actually hides) possibilities for parallel or distributed computing.

In the course it is shown that instead of collecting together all the raw data and process it all at once we can naturally split the whole process into simple highly independent pieces.  Specifically:

  • Extract certain sufficient information from each instance of raw data and represent it in a convenient “canonical” form.
  • Combine pieces of canonical information.
  • Update accumulated canonical information when a new instance of raw data becomes available.
  • Obtain a final result from accumulated information in canonical form.

It turns out that often information in canonical form has a fixed size, which does not depend on the amount of raw information used to produce it.  As a result all the separate steps of extracting canonical information from raw data, combining it and obtaining a final result do not require excessive amounts of memory or computing power. After two pieces of canonical information are combined into one they can be immediately discarded.  Extracting and combining pieces of canonical information can be performed on different computers without any need of synchronization. That provides a wide range of natural options for massive parallel distributed computing.

Passing
test or examination
The course content
Additional literature
Course Materials