Introduction To Big Data

In order to understand ‘Big Data', you first need to know

What Is Data?

Factual information in a form that can be input to, created by, processed by, stored in, and output by a computer. Data can take the form of characters such as letters, numbers, punctuation marks, mathematical operators, and control characters. Data also can take the form of photographic display elements, such as pixels.

To make it clearer, the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

What Is Big Data?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in size and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

A Short History Of Big Data

The term Big Data has been around 2005, when it was launched by O’Reilly Media in 2005. However, the usage of Big Data and the need to understand all available data has been around much longer.

The Properties Of Big Data

Big Data has certain characteristics and hence is defined using 4Vs namely:

  1. Volume
    • Amount of data generated
    • Online & Offline transactions
    • In kilobytes or terabytes
    • Saved in records, tables, files
  2. Velocity
    • Speed of generating data
    • Generated in real-time
    • Online & offline data
    • In streams, batch or bits
  3. Variety – The data that is generated is completely heterogeneous in the sense that it could be in various formats like video, text, database, numeric, sensor data and so on.
  4. Veracity – Knowing whether the data that is available is coming from a credible source is of utmost importance before deciphering and implementing Big Data for business needs.

Types Of Big Data

1. Structured

Structured data: data stored in rows and columns, mostly numerical, where the meaning of each data item is defined.

This type of data constitutes about 10% of the today’s total data and is accessible through database management systems.

Example sources of structured (or traditional) data include official registers that are created by governmental institutions to store data on individuals, enterprises and real estates; and sensors in industries that collect data about the processes.

Today, sensor data is one of the fast growing areas, particularly that sensors are installed in plants to monitor movement, temperature, location, light, vibration, pressure, liquid and flow.

The programming language used for managing structured data is called structured query language, also known as SQL. 

2. Unstructured Data

It can also be in the form of customer complaints, contracts, or internal emails. This type of data accounts for about 90% of the data created in this century. 

Examples of unstructured data include text, video, audio, mobile activity, social media activity, satellite imagery, surveillance imagery 

Unstructured data is difficult to deconstruct because it has no pre-defined model, meaning it cannot be organized in relational databases. Instead, non-relational, or NoSQL databases, are best fit for managing unstructured data.

3. Geographic Data

Data related to roads, buildings, lakes, addresses, people, workplaces, and transportation routes, that are generated from geographic information systems.

These data link between place, time, and attributes (i.e. descriptive information). Geographic data, which is digital, have huge benefits over traditional data sources such as maps, such as paper maps, written reports from explorers, and spoken accounts in that digital data are easy to copy, store, and transmit.

More importantly, they are easy to transform, process, and analyze. Such data is useful in urban planning and for monitoring environmental effects.

4. Real-time Media

Real-time media: real-time streaming of live or stored media data. A special characteristic of real-time media is the amount of data being produced which will be more confusing in the future in terms of storage and processing.

One of the main sources of media data is services like e.g. YouTube, Flicker, and Vimeo that produce a huge amount of video, pictures, and audio. Another important source or real-time media is video conferencing (or visual collaboration) which allow two or more locations to communicate simultaneously in two-way video and audio transmission.

5. Natural Language Data

Human-generated data, particularly in the verbal form. Such data differ in terms of the level of abstraction and level of editorial quality.

The sources of natural language data include speech capture devices, land phones, mobile phones, and Internet of Things that generate large sizes of text-like communication between devices.

6. Network Data

Data concerns very large networks, such as social networks (e.g. Facebook and Twitter), information networks (e.g. the World Wide Web), biological networks (e.g. biochemical, ecological and neural networks), and technological networks (e.g. the Internet, telephone and transportation networks).

Network data is represented as nodes connected via one or more types of relationship. In social networks, nodes typically represent people. In information networks, nodes represent data items (e.g. webpages). In technological networks, nodes may represent Internet devices (e.g. routers and hubs) or telephone switches. In biological networks, nodes may represent neural cells.

7. Linked Data

Data that is built upon standard Web technologies such as HTTP, RDF, SPARQL and URIs to share information that can be semantically queried by computers.

This allows data from different sources to be connected and read. This project allowed the Web to connect related data that wasn’t linked in the past by providing the mechanisms and lowering the barriers to linking data currently linked.

Share this article

shares