Sun. Jan 22nd, 2023

Big data, data warehousing and data mining

Big data characteristics

Big data is defined in terms of:

  • Volume of data (how much data is generated)
  • Velocity (how rapidly the data is being generated)
  • Variety (what is the mixture of data to be processed)
  • Storage (large datasets, cloud computing)
  • Processing (real time data analysis, parallel processing)

Read more about big data here

Key features of data warehouses

Subject orientation: this is an organisational method which ensures that data is arranged with the subject at the core: for example, if you have data about people, it would make sense to link the data to the people, rather than trying to find the person that data items relate to. This aids not only the efficiency of any code that may query the data set, but also makes understanding the data model easier for humans.

Data denormalisation is the reverse of normalisation. In a relational database, data is separated across multiple, related tables. This allows for a reduction in duplication, and ensures atomicity of data. However, in high load environments, having data split across multiple tables means that in order to fetch data, the tables must be ‘joined’ back together. This adds a huge overhead to processing, potentially slowing down any responses. To offset this, sometimes it is worth denormalising data: that is, increasing repetition of data, in order to reduce the number of table joins that are required in order to retrieve data. Whilst this leads to larger data sets, it can massively increase the speed of data processing.

Non-volatile storage: without doubt, a data centre should store data in a safe and durable format. This means that data warehouses contain huge volumes of secondary storage. The choice of secondary storage is typically either magnetic hard drives or SSDs – although in a performance environment, SSDs would be the solely used medium. See here for details of resilience

Controlling data load: load balancing systems allow for data to be distributed across multiple servers, and queries to be split between servers. Whilst this allows for an increase in the availability or service capability, it is not without drawbacks: separating a database across multiple servers means that there is a risk that data becomes inconsistent between one server and another. There is a large synchronisation overhead when running a database across multiple servers.

Data mining

Data mining is the process of automating searches across large volumes of data. For an in-depth discussion see here: Data Mining Methods | Top 8 Types Of Data Mining Method With Examples (