What Is Small Data In AI?
Not long ago, Big Data was seen as a management revolution. Enterprise IT invested heavily to acquire large volumes of information, all to drive business decision making. And it turned out well for large enterprises with enough computing resources and data engineers to extract few, but meaningful, insights from exploding volumes of raw data.
The technology and philosophy of big data was appealing to business decision makers because billions of connected devices and users produce several exabytes of data every day. Every data point was a continuation of a trend, pattern, or story that business organizations could — at least in theory — exploit to make profitable data-driven decisions.
But that’s not how it always turned out.
The failure of big data
Several years ago, Gartner estimated that 85% of all data projects failed to deliver the desired outcomes. This stat suggested that organizations were jumping onto the Big Data bandwagon without aligning their objectives with the technology and data assets they sought.
Certainly, customers and end users did not find data-driven technologies appealing all the time, either. Consider the Facebook-Cambridge Analytica story from 2018 where user information was harvested without their explicit consent. Just take a peek at the countless ads on any ecommerce website with no personal relevance.
These use cases turned out to be exploitative or annoying, perhaps both.
It turns out that you don’t always need Big Data. You don’t always need to package all sources of data to make decisions unique to every user. In fact, modern AI technologies are now adopting capabilities to encapsulate knowledge-based intelligence from data and information that is:
- Small
- Specific
- Feature-rich
Consider this simple example: you can train a deep learning model for self-driving cars to stop at a red traffic light. Such a model training dataset must both:
- Contain large volumes of target scenarios that capture all the varied situations arising at a real-world traffic stop.
- Rely on a specialized model and learning algorithm to maintain the desired knowledge when the model is exposed to other scenarios for different self-driving use cases.
A similar limitation is observed for modern LLMs trained on big data. GenAI tools such as ChatGPT can perform well on some tasks — but not on all tasks. They can’t necessarily provide reason or logic to their arguments. (The ongoing issue of “black box” outputs.)
Perhaps this is why we are yet to see a universal AGI model that performs exceptionally well on all tasks.
How do humans really learn?
Toward that goal, AI research and the scientific community is looking into the true ways that humans really learn: based on logic and reasoning. This is usually achieved by integrating small but highly specific data together with some established logic or knowledge.
If you think about the traffic stops example again, humans simply need to identify a red light and apply the traffic rules logic to all scenarios at a traffic stop junction.
This brings us to the definition of Small Data.
So, what is small data?
Small Data refers to a relatively small set of information that is sufficient to capture adequate insights about a specific use case. Here are some clear examples:
- A small set of data points from an IoT sensor
- EEG signals from a few subjects undergoing brain activity research
- A few images of traffic stop scenarios with different lights
As data analyst Austin Chia describes:
Analyzing small data doesn’t require large AI models with billions of parameters. Since the data distribution describes fewer features, it can be analyzed using traditional statistical methods, on low-power IoT and edge-computing devices.
(Related reading: predictive modeling & predictive vs. prescriptive analytics.)
Use cases for small data
This capability can allow business organizations to build highly tailored services. For example:
- A personalized wearable healthcare monitoring device can infer patient health using relatively few measurements.
- A coffee shop can identify retail patterns — such as peak hours and flavor preference — based on only a few days of shopping data.
These use cases are simplistic, of course. Existing knowledge and logic are used to define relationships or model the parameters. An inference is produced when the parameters reach a threshold value.
But what about the more advanced and complex use cases?
Take the example of LLMs. We know that LLMs perform well on generic conversation tasks. But what about specific math problems and programming styles? Do you need to train the models on every single code snippet published on Stackoverflow to learn a particular programming style or paradigm?
Small data vs. big data
In these cases, large models trained on big data can serve as backbone models — a base model state that is further fine-tuned and adapted to perform well on a specialized task. It may require more than just small data to fine tune an LLM, but still small relative to the pretrained backbone model. It will, however, require knowledge or logic as means to train the model.
For example, models such as ChatGPT rely on the so-called Reinforcement Learning with Human Feedback (RLHF) learning algorithm. In simple words, we can say two things:
- The reinforcement learning aspect uses examples. (That is, interactions of a system with its environment in response to an input.)
- The human feedback aspect introduces logic and established knowledge as small data.
Indeed, it is the logic and established knowledge that is sufficient to redirect and adapt model learning in such a way that it performs very well on all tasks related to the small dataset.
Summarizing small data vs. big data
Drawing from our article on Big Data vs. Small Data Analytics, we can summarize their key differences as follows:
- Data size: Small data refers to datasets that are relatively smaller and can be easily processed using traditional methods. Big data is massive in volume and requires advanced tools and techniques for analysis.
- Variety: Small data is usually structured and organized, coming from well-defined sources such as databases or spreadsheets. Big data, however, comes from various sources and can be unstructured or semi-structured.
- Velocity: Small data is static and doesn’t change frequently. Big data streams in continuously at high speeds.
- Sources: Small data typically comes from internal sources (like customer databases) while big data can come from both internal and external sources, like social media platforms.
- Insights obtained: With small data, you can easily draw insights from the data using basic statistical methods. Big data requires advanced analytics tools and techniques to extract meaningful insights.
- Scope: Small data is usually focused on a specific problem or question while big data analytics aims to explore multiple questions, patterns, and correlations at once.
As more organizations experiment with language models and AI, our hunch is that small data will become increasingly important. Perhaps we’ll see a time where small data itself is the star of many business experiments, and we reserve big data only for the use cases that truly require and benefit from it.
Related Articles

How to Use LLMs for Log File Analysis: Examples, Workflows, and Best Practices

Beyond Deepfakes: Why Digital Provenance is Critical Now

The Best IT/Tech Conferences & Events of 2026

The Best Artificial Intelligence Conferences & Events of 2026

The Best Blockchain & Crypto Conferences in 2026

Log Analytics: How To Turn Log Data into Actionable Insights

The Best Security Conferences & Events 2026

Top Ransomware Attack Types in 2026 and How to Defend
