Here's a confession I don't see L&D leaders make very often: sometimes we already know the measurement frameworks. We can recite Kirkpatrick's four levels. We know what a Level 3 behavior change evaluation looks like. We've sat through the conference session where someone explained Phillips ROI and nodded along. The measurement literacy exists.
What doesn't always exist is the willingness to actually use it. Because real measurement means getting real answers. And real answers sometimes tell you that the program you spent months building, the one you presented to leadership as a strategic priority, didn't produce the behavior change you promised. That's uncomfortable. So instead we measure what's easy, report what looks good, and quietly avoid the question of whether any of it mattered.
I've been in that position. My team built 637 eLearning modules in a single year. We got fast. We got efficient. Our development cycle time dropped 31%. We got really good at measuring the pipeline. We were less consistently rigorous about measuring what happened after content left the LMS. That's the gap I want to talk about honestly, and the one I think most of the field is living with right now.
TL;DR
- 90% of organizations evaluate Level 1 (reaction); only 35% consistently measure Level 4 (business results); 65% use only one or two Kirkpatrick levels total
- The problem isn't lack of frameworks. It's the fear that real measurement will show our work didn't work, plus the real complexity of attributing behavior change to a single training intervention
- The practical path forward isn't measuring everything: it's measuring the right things, starting before the program launches, and building measurement into the design rather than bolting it on at the end
📊 What We're Actually Measuring (And What We're Not)
The numbers are stark and consistent across multiple sources. According to Sopact's 2025 analysis of industry data, 90% of organizations evaluate Level 1 (learner satisfaction). Only 35% consistently measure Level 4 (business results). Training Magazine's 2025 survey found that 65% of organizations use only one or two Kirkpatrick levels. Watershed LRS reports that 93% of organizations use satisfaction surveys while only 42% measure actual behavior change.
Gyde.ai's 2026 L&D benchmarking data is even more pointed: only 8% of L&D professionals report being highly confident in their ability to measure business impact. Only 16% say they're effective at tracking metrics for behavior change or business impact. We're spending $102.8 billion annually on corporate training, $874 per learner in 2025, and the overwhelming majority of us can't tell you with confidence whether it worked.
Kevin Yates, who's spent years in this conversation, put it plainly: "We've spent years reporting on activity, on attendance, completion, hours, and satisfaction." We've gotten very good at reporting the things that are easy to collect from an LMS and present in a dashboard. We've stayed much more comfortable avoiding the harder question of whether learning led to changed behavior, and whether changed behavior led to business outcomes.
Kirkpatrick Partners' framing from 2026 is worth holding onto: "The goal of evaluation was never to count. It was to learn." That reframe matters. Measurement isn't about building a case for your team's existence. It's about understanding what works so you can do more of it and less of what doesn't. When we treat measurement as a political activity (proving value) instead of a learning activity (improving decisions), we get exactly the measurement culture most of our field currently has.
😱 The Fear Nobody Talks About
I want to name the thing that doesn't usually get said in the polished conference talk version of this conversation: we avoid rigorous measurement because we're afraid of what we'll find.
There's a version I've heard from multiple L&D leaders in private conversations that goes like this: "If we start pushing back on what the business asks us to build, or if we start honestly evaluating whether our programs are working, leadership will stop coming to us." The fear is that real measurement creates friction. That if you do a genuine Level 3 evaluation and find that behavior change isn't happening, you'll either have to report bad news or explain a complicated attribution problem to executives who want simple answers.
That fear isn't irrational. Some organizations punish honesty about learning effectiveness. If your leadership culture treats "this program didn't produce the behavior change we expected" as a failure of the L&D team rather than useful information about the design or the organizational context, then your incentive to measure honestly is low.
But here's where I've landed: the alternative is worse. If we can't demonstrate that our work produces outcomes, we stay in the position of being a cost center that leadership tolerates rather than a function that drives business results. That's not a sustainable place to build a career or a team. And the measurement fear becomes a self-fulfilling prophecy: we don't measure because we might find the program didn't work, so we never learn how to build programs that do work.
The Learning Guild's framing of this is accurate: "Learning effectiveness sits at the intersection of human behavior, systems, time, and business context. Very few metrics were designed to operate in that complexity." That complexity is real, and it's hard. But it's the work.
🧩 The Attribution Problem Is Real (And Often Used as an Excuse)
Here's where I want to be honest about a nuance that often gets lost in this conversation: measurement of learning outcomes is hard. Not just politically hard. Technically hard.
When a sales rep's close rate improves six months after completing a negotiation training, how much of that improvement is attributable to the training versus their increased tenure, their manager's coaching, a change in the market, or any number of other factors? The attribution problem is real. Claiming that training caused a business outcome is almost always an oversimplification. A competent evaluation design acknowledges this.
Phillips ROI methodology is useful partly because it takes attribution seriously: it includes an isolation step that attempts to estimate how much of an observed performance change is attributable to the training specifically versus other factors. That's methodologically honest. It's also why Phillips himself recommends that only 5% to 10% of programs actually need a full ROI evaluation. Most programs don't have the stakes or the budget to warrant that level of rigor.
The practical implication is this: we shouldn't be trying to do a full ROI evaluation on every module in a 637-module portfolio. That would be absurd. But we should be able to identify, for any given program, what the intended behavior change is, how we'd know it's happening, and whether we're seeing leading indicators that suggest the training is contributing to that change.
That's different from what most teams do, which is either avoid the question entirely or point to completion rates as evidence of effectiveness. Completing a course is not learning. Learning is not behavior change. Behavior change is not business impact. The chain of inference matters, and each link requires deliberate design and measurement.
📐 Why LTEM Matters More Than We've Acknowledged
Will Thalheimer's Learning-Transfer Evaluation Model takes the Kirkpatrick foundation and makes it more operationally useful for practitioners. LTEM's eight tiers distinguish between attending (Tier 1), learner perceptions (Tier 2), knowledge retention (Tiers 3-4), decision-making capability (Tier 5), task performance (Tier 6), transfer (Tier 7), and business impact (Tier 8). The more precise distinction between knowledge, decision-making ability, and actual task performance is useful because it helps us identify specifically where the chain breaks down.
Most eLearning in corporate L&D operates at Tiers 3-4 at best. We're testing whether learners can recall information immediately after the training. We're rarely testing whether they can make better decisions in context, whether they can actually perform the task under realistic conditions, or whether the performance improvement transfers to the job.
The honest version of this: when my team was producing modules at scale, most of our evaluations were Level 1 satisfaction surveys and LMS completion data. That's Tiers 1-2 in LTEM terms. It told us whether content existed and whether people consumed it. It told us almost nothing about whether people could do anything differently because of it.
I'm not saying that's uniquely bad or unusual. It's the norm in the field. But the norm is not defensible when we're asking organizations to spend hundreds of billions of dollars on training and claiming it changes behavior. At some point the gap between what we claim and what we measure becomes a credibility problem.
When my team was at peak production, the place where we pushed evaluation furthest was role-play assessment. We ran role plays with all of our learners, and the design decision that changed the most was who was in the room for them.
We stopped having the trainer who taught the class assess their own learners. Instead, we required a different trainer to conduct the evaluation. That's a small structural change that has a big effect: trainers who taught the material have an unconscious stake in learners performing well. Using a different assessor removes that bias and gives you a more honest read on whether the skill actually transferred.
The bigger shift was bringing managers and SMEs into the process, either as observers or as the ones conducting the role play itself. SMEs in particular were often the loudest critics after the fact, the ones in cross-functional meetings saying learners weren't trained properly or didn't understand the process. When we put those same people in the room during evaluation, the dynamic shifted. They could see directly what learners understood and where gaps existed. And when they saw a gap, instead of a general complaint weeks later, they'd come to us specifically: "Hey, this concept isn't landing" or "In practice it works differently than how you explained it."
That last piece, the gap between how a process was documented or described by the product team and how it actually worked in the field, surfaced consistently through role-play evaluation in a way it never had through satisfaction surveys. Learners were applying what we taught them. The issue was sometimes that what we taught them was slightly wrong, or that the real-world version of the process had diverged from the official one. You don't find that with a knowledge check. You find it when someone has to perform the task and a field expert watches them do it.
Leadership found this approach credible specifically because it involved the business in the evaluation rather than asking them to trust L&D's internal data. It answered the "are they actually trained?" question in a way that completion reports never could.
🔎 Leading Indicators vs. Lagging Ones
One of the most useful reframes I've encountered is the leading vs. lagging indicator distinction applied to learning measurement. Most L&D teams focus on lagging indicators: did performance improve? Did the business metric move? Those are real and important, but they show up late, they're hard to attribute cleanly, and by the time they appear, the window to adjust the design has often closed.
Leading indicators for learning effectiveness might include: Are learners applying specific skills within 30 days of training? Are managers observing the target behaviors in the field? Are we seeing a reduction in the problem the training was designed to address (error rates, support tickets, compliance incidents)? These are harder to design for than completion rates, but they show up faster and give you something to act on.
Josh Bersin's February 2026 research found that companies using AI-native learning approaches are 6x more likely to exceed their financial targets than those using traditional approaches. I think that finding is partly about AI efficiency and partly about something else: organizations that are thoughtfully designing learning systems are also organizations that are thinking more carefully about what learning is supposed to produce. The measurement rigor and the design rigor tend to show up together.
The connection to AI tools is real here. Better data infrastructure makes leading indicator measurement more feasible. If your LMS, your performance management system, and your operational data sources are connected, you can start to see patterns that were previously invisible. Not causation, but correlation that's worth investigating. The teams doing this well aren't treating AI as a content generation tool. They're treating it as a system that connects development to performance data in ways that were previously manual or impossible.
🔧 A Practical Framework That Doesn't Require Measuring Everything
The measurement problem doesn't have a solution that requires evaluating every program at Level 4. That's not realistic, and the researchers who designed these frameworks don't recommend it. What is realistic is building a tiered approach to measurement that matches evaluation rigor to program stakes.
A workable starting point:
Tier 1 (default): Completion + satisfaction. Use this for compliance training, orientation content, and low-stakes reference material. It tells you whether content was consumed and whether learners found it acceptable. That's sufficient for this tier.
Tier 2 (standard): Add knowledge check + short application observation. Use this for any skills-based training with a clear behavioral objective. A knowledge check at end of course plus a structured manager observation at 30 days post-training gives you a legitimate Kirkpatrick Level 3 data point without a massive evaluation investment. The key is designing the observation criteria before the course launches, not after.
Tier 3 (priority programs): Full Level 3-4 design. Reserve this for programs where behavior change directly connects to a measurable business outcome: sales training, safety procedures, customer-facing skills, leadership development at scale. Use isolation techniques (control groups where feasible, manager ratings before and after, business metric tracking) and document your attribution assumptions honestly. This level of rigor is expensive. Apply it where the stakes justify the cost.
The practical starting point for most teams isn't redesigning the entire evaluation framework. It's identifying one program in your portfolio, ideally one you have influence over from design through delivery, and building measurement into the design from the beginning. Not as an add-on. As a design requirement. "How will we know if this worked?" is a question that should be answered before a single module is built.
That single shift, treating evaluation design as a prerequisite rather than a post-launch activity, changes the conversation with business partners, changes the kind of content that gets built, and starts building the data discipline that eventually lets you answer the harder questions.
The honest version of where our measurement practice landed: it was still evolving when I left. That's not a failure, but it's the truth.
One thing we did implement that I think more teams should try: we used Claude (Anthropic) in an evaluation design role. Not to replace an evaluation expert, but to pressure-test our thinking. We'd describe a program, its learning objectives, our current assessment approach, and ask the model to play the role of an evaluation specialist and help us identify gaps, suggest alternative assessment methods, and think through what we could realistically measure both during and after delivery. That process surfaced approaches we hadn't considered and pushed us toward more rigorous designs faster than we would have gotten there alone.
The area I wish we'd gotten further on before I left: connecting L&D measurement to coaching. We were doing well with content-based evaluation, but the missing link was a systematic way to work with the people team and line managers to build coaching that reinforced what was being trained. We had pockets of this. It wasn't systematic. The gap between "learner completed the training" and "manager is actively coaching to the skill" is where a lot of our behavior change data would have lived, and we never fully closed it.
That's the real frontier in L&D measurement right now. Not better smile sheets. Better integration between learning design, manager behavior, and the performance systems that capture what happens after the LMS completion event.
🎯 The One Thing to Do This Week
Pick one program currently in development on your team and write down the answer to this question before anything else gets built: "What would an employee do differently six months from now if this program worked?" If you can't answer that clearly, the program isn't ready to build yet.
I'm curious where other L&D leaders are on this. Not the aspirational version. What does your team actually measure, what have you stopped trying to measure because it was too hard, and what would need to change in your organization for honest evaluation to be possible? That's the conversation I want to keep having.
-- Eian
Sources
- Sopact. (2025). L&D evaluation practices benchmarking data. sopact.com
- Training Magazine. (2025). Training industry report 2025. Training Magazine. trainingmag.com
- Watershed LRS. (2024). The learning measurement report. watershedlrs.com
- Gyde.ai. (2026). L&D measurement benchmarking report 2026. gyde.ai
- Kirkpatrick Partners. (2026). The Kirkpatrick model: New directions in evaluation. kirkpatrickpartners.com
- Thalheimer, W. (2024). LTEM Version 13: The learning-transfer evaluation model. Work-Learning Research. worklearning.com
- Phillips, J. J., & Phillips, P. P. (2016). Handbook of training evaluation and measurement methods (4th ed.). Routledge.
- Bersin, J. (2026, February). AI and the reinvention of corporate training. Josh Bersin Company. joshbersin.com
- The Learning Guild. (2025). Measuring learning effectiveness: Research report. learningguild.com
- ATD. (2025). State of the industry report 2025. Association for Talent Development. td.org