Dr. George Ng, Co-Founder and CTO of GGWP.
Recent advancements in large generative models have resulted in widespread interest in their ability to act on complex instructions. These so-called foundational large language models (LLMs), e.g., OpenAI’s GPT-4 or Google’s Gemini, demonstrate an uncanny ability to understand nuanced contexts and apply them to diverse and ambiguous tasks. Although deploying LLMs in real-time settings is constrained by factors such as latency, cost, privacy and “hallucination” risk, these challenges are manageable for valuable offline tasks.
Specifically in data labeling, where conventional approaches require considerable time and financial investment, most companies cannot justify the costs unless the upsides are substantial and guaranteed or fitting open-source labels can be found. These barriers deter lightweight experimentation and ML deployment on use cases that may be uncommon but key differentiators for a particular business.
For example, a specialty insurer may use LLM labeling to process historical claims for fraudulent behavior using its own fraud cases as a learning context and build a custom model for ongoing monitoring. Similarly, an online retailer may want to target complex personas, using LLM labeling to classify customers into each persona based only on their browsing history and natural language reviews. Such scenarios are easy to dream up across any industry and can impact key metrics like an insurer’s loss ratio or a retailer’s CAC. Most importantly, they are simple and cost-effective to test.
Motivation
Though foundational models are highly capable generalized tools, they are often overkill for specific tasks while bringing downsides in cost, latency, privacy and explainability. Using an LLM, or an ensemble of LLMs, to instead label specific task data for training a smaller model often provides the best of both worlds. This can:
• Greatly reduce the labeling cost (often the biggest development hurdle) while retaining the high-quality reasoning of much larger models.
• Allow for an intermediate human review step to correct for errors/biases before a training set is deemed sufficiently accurate.
• Train smaller models (e.g., DistilBERT) that are performant for narrow tasks, faster, more predictable and much cheaper to run.
Using straightforward applications of few-shot learning and chain-of-thought (CoT) reasoning, LLMs can often generate high-precision labels with little setup beyond writing the prompt. And as discussed above, this generalized process is well-suited for the fast and affordable application of ML in a wide variety of business use cases.
Process Setup
To illustrate the potential of LLMs in data labeling, let’s consider the task of identifying sexual harassment within a social platform aimed at users aged 16-25. This process involves a few straightforward steps:
Create A System Prompt
LLMs such as GPT-4 can receive a system prompt to dictate its role, behavior and context. To tackle a specific labeling job, we simply provide the appropriate setting, including:
• Role Description: Describe the role and mentality the model should assume (e.g., “Community moderator for a social platform with users aged 16-25”).
• Task Details: Describe step-by-step how the task should be accomplished (e.g., “You will receive input messages in a list, evaluate them for sexual content or harassment inappropriate for our platform, and then output the following per message: MESSAGE, REASONING, LABEL”).
• Few-Shot Examples: Provide three to 10 challenging examples of real inputs and human-labeled outputs using the formats described above.
Prepare Input Data
Collect relevant messages from your platform or through open-source datasets, ensure they are anonymized and appropriate to process and feed lists of messages along with the System Prompt in the described format.
Process And Review Output Data
The model should output its decisions also as described (e.g., “MESSAGE, REASONING, LABEL” per input), which we can parse line by line. Primarily, we care about the label, but having the model output a reason as well helps to mitigate hallucinations; in oversimplified terms, the model would have to make two errors rather than one to hallucinate. However, across many examples, errors will certainly occur, so it is imperative that the processing is robust to malformed rows and that a human reviews samples of output labels to determine true performance. This review process then informs how the System Prompt context and examples should be updated to address prior biases.
Risk Mitigation
While the advantages of LLM labeling are evident, we must also address the potential risks.
• Cost: Given the small number of output tokens per label, even a pricier commercial LLM should only cost a few dollars per thousands of labeled examples, which is far cheaper than human labelers and effective for a training set used for ongoing model training.
• Latency: While the process can take hours for tens of thousands of labeled examples, it remains much faster than human labeling. It can be sped up through parallelization or by self-hosting models and is sufficient for training, which may take hours to days anyway.
• Hallucinations: It is highly recommended to incorporate a human review component for evaluating label samples and determining true performance. For sophisticated users, it may be worthwhile to submit each example to multiple LLMs (including commercial and open-source) as an ensemble and aggregate their labels to produce “confidence.”
• Privacy: Especially when working with commercial models and APIs, it is important to ensure that the input data is anonymized, scrubbed for PII and legally viable to provide to a third-party service. When stronger privacy guarantees are required, consider running SOTA open-source LLMs such as the LLaMA family of models instead.
• Ethical Considerations: LLMs contain intrinsic biases that are often hard to measure and harder to mitigate. When avoidable, do not provide demographic, socioeconomic and correlated inputs to the model. When such factors must be considered, have a clear plan for evaluating labeling behavior and performance per cohort.
Conclusion
Using LLMs for data labeling streamlines the development of specialized models, reducing costs and time barriers associated with machine learning. This enables businesses of all sizes and industries to explore niche applications more efficiently, marking a significant step forward in practical and efficient ML deployment.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?