Bigdata Glossary

“Data is the new science, big data hold the answers.” When the Chief Operating Officer of EMC Corporation said this, he must have been unaware of a fact- That a number of people are still trying to find out answers for the terms related to the Big Data. So maybe, this extensive glossary of big data terminology holds the answers they seek.

  • ALGORITHM - Simply, a mathematical formula that can perform certain analysis on data. Its goal is to operate on data to solve problems.
  • ANALYTICS - Using algorithms and statistics to derive meaning from data. For example- website to check his/her profile in real time and learn which relevant product or services s(he) might like. Basically, any kind of intelligence coming out from big data is analytics.
  • ARTIFICIAL INTELLIGENCE - Also known as machine learning; it is the capability of device to apply information gained from previous experience in new situations automatically like any other human would do but, much faster than humans could ever hope.
  • BEHAVIORAL ANALYTICS - Analytics that inform about the behavioral aspects like sentiments, mood, taste.
  • BIG DATA SCIENTIST - What scientist is to science, big data scientist is to big data. She/he is someone who is able to analyze the patterns and develop the algorithms.
  • BIOMETRICS - Identification of humans by their physical traits.
  • BUSINESS INTELLIGENCE - Like analytics use statistical and mathematical tools to predict future outcomes, BI uses past information/data to do the same.
  • CLOUD - An internet based application, hosted remotely and not locally to provide the user to store data.
  • CLICKSTREAM ANALYTICS - The analysis of users’ choice based on the clicks on a specific web page.
  • COMPARATIVE ANALYSIS - Comparing two or more data sets to get a more accurate results in very large data sets.
  • DASHBOARD - It gives a quick report on status of analyses performed by the algorithms.
  • DATA ANALYTICS - Examining data in order to make business decisions and draw conclusions.
  • DATA SCIENCE - It is a field that comprise of data cleansing, preparing and analyzing
  • DATA SCIENTIST - An expert who develops algorithms in order to solve complex problems.
  • DATA MINING - The process of uncovering patterns from large data sets.
  • DISTRIBUTED CACHE - A data cache that is spread across multiple systems but works as one. It is used to improve performance.
  • EXABYTE - One million terabytes, or 1 billion gigabytes of information.
  • ETL (EXTRACT, TRANSFORM AND LOAD) - A process of shifting data from one database to another database thus creating new combinations of data for analytics.
  • HADOOP - Administered by the Apache Software Foundation, Hadoop is an open-source framework or a simple formatting tool built to enable the process of Big Data.
  • HANA - A software/hardware in –memory platform of SAP, is a high performance analytical application designed for high volume data transactions.
  • INTERNET OF THINGS - It is a giant network of digitally connected and interrelated “things” viz. sensors on people, animals, in devices, machines and even vehicles which are able to transfer data over a network.
  • MACHINE LEARNING - Part of artificial intelligence, a process to allow a computer to learn what action to take when a specific pattern or event occurs; by the use of algorithms.
  • MAPREDUCE - Breaking up a problem into different pieces and then redistributing it across multiple computers on same network in order to get a combined report of all the results obtained. To perform this function, companies such as google and apache (as part of its Hadoop framework) provide MapReduce tools.
  • METADATA - Data used to describe other data. For example, a data file’s size and its location.
  • MULTI-THREADING - The process of breaking an operation into different threads within the system for faster execution.
  • NATURAL LANGUAGE PROCESSING - The ability of a computer to understand human language accurately.
  • NoSQL - A class of Database Management System that doesn’t use traditional relational database structures. Designed to handle large data volumes, it is best suited for use with large data volumes. For example, Cassandra It is a NoSQL solution that was initially developed by Facebook.
  • NewSQL - Improved version of NoSQL; easy to learn, well defined database.
  • OBJECT-ORIENTED DATABASE - Also called as Object Database Management System, it represents information as objects and not integers or numbers.
  • OBJECT BASED IMAGE ANALYSIS - Analyzing digital images through data from individual pixels, whereas, object based image analyzer uses data from a selection of random pixels.
  • OPERATIONAL DATASTORE - A location to gather and store data from various sources so that more operations can be performed on it before sending for the final reporting.
  • ONLINE TRANSACTIONAL PROCESSING - The process of providing users access to large amount of transactional data so that they can derive meaning from it.
  • PARALLEL DATA ANALYSIS - Breaking up an analytical problem into multiple components and running algorithm on each of them within a system.
  • PENTABYTE - A petabyte (PB) is 1015 bytes of data, 1,000 terabytes (TB) or 1,000,000 gigabytes (GB).
  • PREDICTIVE ANALYSIS - Predicting what someone is most likely to buy, visit, do and even the prediction of future events. Based on analyzing different data sets like historical, transactional and social, it identifies different risks and opportunities.
  • QUERY ANALYSIS - The process of analyzing queries of users to provide the best search results in future.
  • QUANTIFIED SELF- Tracking of one’s behavior over day to gain a better understanding about his/her behavior.
  • R - Open source software used for statistical computing.
  • RFID - Radio Frequency Identification; RFID is a tracking system that uses intelligent bar codes and radio frequency electromagnetic waves to transfer data.
  • REAL-TIME DATA - Data that is created, processed, analyzed within a moment.
  • RECOMMENDATION ENGINE - An algorithm that analyzes the previous buying behavior of the user on a particular e-commerce website.
  • RISK ANALYSIS - The application of statistical methods to analyze the risks related to project, action or a decision.
  • SQL - SQL is a standard language for accessing databases.
  • SENTIMENT ANALYSIS - The application of algorithms on the comments, opinions, status, likes and dislikes on social media to analyze that how they feel about the product.
  • SPATIAL ANALYSIS - Analyzing patterns based on demographics or topological data.
  • SOFTWARE-AS-A-SERVICE (SaaS) - Application software that is used over by the web browser
  • TERABYTE - 1,000 Gigabytes.
  • TRANSACTIONAL DATA - Data that changes unpredictably.
  • TEXT ANALYTICS - In order to derive meaning, the application of statistical, linguistic and machine learning techniques are applied on text based sources
  • TRANSPARENCY - Letting users know where the data is getting stored and for what purpose.
  • UNSTRUCTURED DATA - Data that has no definable structure.
  • VALUE - Data extracted from different patterns would ultimately benefit organizations. Thus, creating a lot of value for society and consumers.
  • VOLUME - Amount of data ranging from megabytes to brontobytes.
  • WEATHER DATA - An important Public Data source that can provide valuable insights if collaborated with the other sources.
  • YOTTABYTES - A unit of computer memory or data storage capacity equal to 1,024 Zettabytes (280 bytes).
  • ZETTABYTES - A unit of computer memory or data storage capacity equal to 1,024 exabytes