The term “big data” refers to extremely large and complex datasets that are challenging to store, process, and analyze using traditional data management and processing techniques. Big data is typically characterized using 5 key attributes known as the “5 Vs” – Volume, Velocity, Variety, Veracity and Value. This article provides an overview of what each of these 5 dimensions encompasses along with real-world examples.
The volume of big data refers to the vast amount of data being accumulated from an increasing number of sources at a rapid pace. We are said to be producing 2.5 quintillion bytes of data on a daily basis. Sources contributing to high data volumes include:
- Social Media: Facebook users upload 350 million+ photos per day. Twitter sees over 500 million tweets sent per day.
- Mobile Data: More than 7 billion people globally own mobile devices today. These devices produce data through apps, multimedia messages, call logs and location services.
- Web/Ecommerce Traffic: Popular websites record billions of page views per month. Retail sites collect data on product searches, transactions, ratings and more driving massive datasets.
- Sensors and Internet of Things: Smart sensors embedded in equipment, appliances, vehicles and more are collecting temporal telemetry data across supply chains and smart spaces.
- Business Transactions: Point of sale systems, enterprise software, credit card payments and other business transactions generate large transactional datasets.
- Biomedical and Genomics Data: Medical devices, health trackers and genomics sequencing are producing biological datasets at unprecedented scales.
The volume of big data being produced globally is experiencing an explosive growth. By 2025, the world is projected to generate 97 zettabytes annually. Storing, processing and deriving insights from such massive volumes of multimodal data requires a distributed, scalable infrastructure with capabilities exceeding traditional database systems.
Challenges with Volume
Dealing with enormous volumes of continuously arriving new data presents a number of key technical and organizational challenges:
- Scalable Storage is essential without blowing budgets. This requires leveraging clusters of cost-efficient commodity hardware and distributed file systems.
- Moving vast Data Volumes can strain networks. Data awareness reduces unnecessary data transfers.
- Identifying Relevant Data gets harder given storage constraints and limitations in indexing at scale. Tight integration with analytics is needed.
- Training Models on Ever-Growing Data is computationally demanding. Algorithms like online machine learning account for this model lifecycle management.
The velocity of big data refers to the speed at which data is created, accumulated and processed. With growing reliance on online services, real-time analytics and smart, internet-connected devices, data velocity has increased massively over the past decade.
Some examples of high velocity data sources include:
- Data Streams from User Interactions: Clickstream data from user sessions across web and mobile apps get generated continuously requiring rapid ingestion.
- Sensor Data: IoT deployments with thousands to millions of continually reporting smart sensors produce steaming telemetry requiring low-latency processing.
- Log Data: Activity log data streams from servers across IT systems record all events and errors. These high-throughput streams require rapid aggregation.
- Social Media Feeds: The firehose of tweets, status updates, photos and videos shared across social platforms calls for real-time capture and analysis.
- Financial Transactions: Each swipe of credit card, trade transaction and fund transfer produces data points that feed into high-velocity streams that continually update positions, balances and risk projections in milliseconds.
To extract value, big data systems need the capability to ingest streaming data feeds with minimal latency, run real-time analytics and deliver insights to decisions and actions.
Challenging with Velocity
Challenges posed by ever-increasing velocities of new data include:
- Real-time Processing Complexity increases exponentially with production deployments requiring predictable throughput, resilience to faults and zero data loss.
- Analytics Model Lifecycles shrink from months to weeks to days as data velocity shortens windows available for extracting training datasets. Retraining has to keep pace.
- Rapid Decision Making requires continuously sensing and responding based on latest data. Lessening cycle times improves customer experiences and business performance.
- Detecting Anomalies Early gets harder with traditional tools. Tailored real-time anomaly detection on temporal data at scale becomes critical.
The variety dimension of big data refers to extensively diverse data types, representations and sources—both structured and unstructured. Structured data includes things like relational data or timeseries data from sensors that confirm to well-defined schemas.
Unstructured data encompasses everything else and can include:
- Text Content: This includes textual data as found in social media posts, webpages, books, documents, notes and electronic messaging systems like email and chat apps.
- Multimedia Content: Includes images, photos, audio files like podcasts, music files and speech; and video footage.
- Biological Data Types: Data produced from bioinformatics, genetic sequencing, medical tests and biometric devices see specialized formats like FASTA files.
- Observations and Sensor Readings: IoT deployments, earth/atmospheric sciences monitoring, business telemetry capture timeseries across differentschemas.
- Metadata: Data defining and describing other data like author, date created, access permissions, tags and classifications.
Dealing with extensively heterogeneous data types, implicit schemas and multiple underlying semantics poses challenges for storage, mining, correlating and fusing data for analytics.
Challenges with Variety
Key technical and analytical challenges posed by widely varied data types and sources include:
- No One-Size-Fits-All Data Model works requiring polyglot persistence and schema-on-read.
- Understanding Implicit Semantics within unstructured data is technically hard but also crucial for value generation.
- Correlating Across Data Types requires tying together contextual data on entities while accounting for observational biases in capture systemic artifacts.
- Infusing Domain Expertise into analytical workflows is non-trivial given specialized, multi-modal data.
- Adapting Analytical Methods to new, unseen data types remains an open research problem.
Veracity refers to the uncertainty around the quality and trustworthiness of big data. Characterizing and improving the veracity of analytical outcomes is crucial for informing decisions and research.
Common data quality challenges include:
- Inaccurate or Erroneous Data arising from faulty collection, corrupted storage and computational artifacts.
- Inconsistent Data across datasets can make fusing disparate data assets unreliable.
- Incomplete Data occurs frequently when capturing sparse and irregular observations especially from physical environments.
- Ambiguous Data happens when data capturing or labeling allows room for multiple interpretations especially with physical sensors and human tagging.
These veracity issues propagate into downstream analytics impacting result quality and trustworthiness. Veracity also has an ethical dimension with fairness and removing unwanted bias also part of data credibility.
Challenges with Veracity
Key challenges to ensure big data veracity cover:
- Detecting anomalies early by characterizing expected statistical distributions.
- Identifying sparse, incomplete datasets and mitigating through collection improvements or modeling.
- Quantifying and improving dataset coverage relative to phenomena studied.
- Corroborating analytical outputs with ground truths gathered through vertical knowledge and painstaking human curation.
- Establishing rigorous approaches to quality assurance and confidence metrics for analytical results and machine learning predictions.
The value dimension focuses on achievable business gains from investments in big data programs. Generating value requires assessing organizational drivers, challenges and objectives to create high-impact analytical use cases.
Common sources of value from big data analytics include:
- Optimizing Pricing through demand modeling analytics
- Micro-segmentation to drive targeted marketing and sales
- Improving Customer Experiences thereby boosting loyalty
- Predictive maintenance helping avoid operational disruptions
- Early detection of fraud improving loss prevention
- Forecasting inventory needs preventing stock-outs
- Personalizing web and mobile experiences to improve engagement
- Optimized logistics through better demand forecasting
- Improving manufacturing yields using sensor analytics
- Enabling new intelligent services leveraging ML
- Exploring usage patterns and technology trends from data exhaust
- Driving R&D transformations through simulation and research data
Challenges with Value
Maximizing big data business value presents key leadership, organizational and computational challenges:
- Prioritizing High-Impact Use Cases needs contextual business understanding.
- Enabling Access and Analytics Democratization to spread benefits beyond specialized teams.
- Monitoring Metrics that Quantify Value from analytics and data science initiatives.
- Building Robust Data Pipelines that acquire, prepare, enrich and serve downstream analytics at scale.
- Promoting Platform Adoption through governance, data culture and upskilling.
With thoughtful strategies around these big data value dimensions, businesses can accelerate competitive advantages.
The 5 Vs – volume, velocity, variety, veracity and value—encapsulate key attributes that distinguish big data problems, systems and initiatives. Architecting for scale, speed, adaptability, trust and business impact is essential in unlocking true potential. As big data techniques become integral across domains like commerce, research, governance and beyond, deeply understanding the 5 Vs will serve both technology practitioners and leaders everywhere.
- Big Data Main Concepts
- Big Data Programming Languages
- Big Data Analytics Tools
- Can Big Data Predict The Future?
- Can I Learn Big Data Without Java?
- Can Big Data Protect A Firm From Competition?