Data is the most important asset in data science. Making sense of data allows data scientists to uncover valuable insights that can improve business decisions, predict future trends, and solve complex problems. However, not all data is created equal.
Understanding the different types of data and their unique characteristics is crucial for using the right tools and techniques to transform raw data into impactful knowledge. This article provides an in-depth look at the main categories of data in data science.
Types of Data in Data Science
Here we’ve discussed 9 types of data in data science which divided into four main categories i.e. by measurement scale, by origin, by source and by time.
Don’t want to read this blog post? Watch this video!
Let’s get started!
By Measurement Scale
It is further divided into quantitative and qualitative data.
Quantitative data represents information that are expressed numerically and given a mathematical value. Due to this, quantitative data can be measured precisely and subjected to statistical analysis to identify patterns and trends. There are two main types:
- Discrete Data: Discrete data represents countable numbers that are separate and distinct from each other. There are gaps between each data point with no in-between values. Some examples include the number of employees in a company, inventory levels, or product ratings. Discrete data can only take certain values within a finite range.
- Continuous Data: Continuous data can take on any value within a continuum range. They represent measurable quantities that can be meaningfully divided into fractional values. Examples include metrics like temperature, time, geographic coordinates, or product dimensions. Continuous data brings virtually endless analysis possibilities.
With 350 million rows of precisely measured numbers, quantitative data allows data scientists to perform accurate calculations and advanced analytics using mathematical and statistical models. However, quantitative data lacks descriptive details and the ability to capture subtle human traits.
If quantitative data answers “how much,” qualitative data answers “what, how, and why.” Qualitative data captures intangible qualities and characteristics through descriptive details rather than numbers. This allows for a nuanced understanding of human behaviors, attitudes, and preferences that is impossible to achieve with quantitative data alone. There are three main types of qualitative data:
- Nominal Data: Nominal data places subjects into categorical groups or naming classifications without a set order or value. For example, gender, country of origin, or product color.
- Ordinal Data: Ordinal data categorizes subjects by relative degree, rank, or position along a scale. Data points have an intrinsic order or hierarchy. Examples include socioeconomic class, customer satisfaction scores, or movie ratings.
- Textual Data: All information captured in written words, sentences, and narratives constitutes textual qualitative data. This includes social media posts, online reviews, survey responses, interview transcripts, and more. Contemporary data science leverages natural language processing to extract insights.
While quantitative data provides precision, qualitative data brings the critical context needed to interpret numbers correctly. Combining both is ideal for impactful analysis.
Structured data conforms to a predefined data model to organize information neatly. This includes delimited text files, spreadsheet tables, relational SQL databases, and bar-coded retail databases. Structured data fits neatly into rows and columns for easy searching, filtering, aggregation, and analysis. Its well-defined structure and consistent format enables seamless data processing. However, organizing messy real-world information into strict tables results in data loss.
Unstructured data does not conform to data models and lacks a defined structure. Over 90% of today’s data is unstructured – including word documents, social media posts, digital images, video files, and audio streams.
It is rich in descriptive detail, unstructured data is difficult to search, process, and analyze with traditional methods. Special big data analytics, computer vision, natural language processing, and machine learning techniques are required to handle unbounded streams of unstructured data.
As a blend of structured and unstructured data, semi-structured data contains both defined entities and elements without conforming formats. This includes JSON documents, XML files, and NoSQL databases. Semi-structured data combines aspects of flexible self-describing structure while retaining enough organization for analysis. It serves as a middle ground between rigid structured data and unpredictable unstructured data.
Primary data is collected at the source, directly from the subject being studied, by the data scientist – as opposed to using secondary data originally collected by others. Examples include clinical trial data, sensor data, website traffic data, survey results, or experimental measurements. Primary data grants data scientists full control to shape the desired analysis. However, collecting primary data first-hand is extremely time and resource intensive.
Secondary data includes any data originally sourced and compiled by others, repurposed for the current analysis. As opposed to costly primary data collection efforts, data scientists rely heavily on pre-existing secondary datasets. These include public government records, commercial transaction records, web scraped content, social media archives, and historic datasets in academia and industry. Secondary data is easily accessible and affordable but may have biases, inaccuracies, or missing information.
Cross-sectional data captures a dataset’s characteristics at a single point in time. It resembles a snapshot taken from one moment, allowing comparisons between groups, subsets, and variables within the same timeframe. Longitudinal studies can leverage cross-sectional data gathered consistently across extended time periods to observe trends. Examples include customer satisfaction surveys, blood pressure screenings, or weekly sales figures. Limitations include an inability to determine causal relationships on its own.
Time-series data tracks multiple data points over consistent time intervals to reveal patterns and trends over time. It adds the critical variable of time, recording what changes and what stays the same. Examples include stock market performance, climate readings, resource usage, and web traffic. Statistical time-series forecasting examines historical sequences to predict future values. This drives planning and strategy by estimating future capacities and demand. However, time-series analysis depends heavily on consistent data collection over long durations.
In summary, a deep knowledge of data types, characteristics, structure, and sources serves as the foundation for designing robust data science systems and workflows to extract actionable insights. Quantifying data relationships through analytics is ultimately less impactful without the human context provided by qualitative data. Understanding time-series trends is limited without comparisons to cross-sectional snapshots. Strategic combination of different data types will reap the most powerful and nuanced analysis results.