By Gordon Rugg
So what is “evidence-based” anyway, and why do so many people make such a fuss about it?
In this article, I’ll look at the context of “evidence-based” and at some common misconceptions and mistakes about it.
It’s a journey through the limitations of logic, through the legacy of theology on modern debate, and through the nature of evidence.
It starts with a paradox that took over two thousand years to solve, involving pointy sticks and tortoises.
Zeno’s paradox, and the limits of logic
The Ancient Greeks were very keen on logic. They viewed it as a way of getting past messy surface details, and into the underlying principles of how the universe worked.
It’s a nice idea. However, reality has a habit of throwing up awkward surprises. One of them is a paradox that undermined the whole body of Ancient Greek assumptions about logic. It’s named Zeno’s paradox, after the Ancient Greek philosopher who invented it. It involves a chain of reasoning that is clearly completely wrong; however, it took well over two thousand years before anyone was able to explain why it was wrong. It’s still highly relevant today, because it points out the dangers in trusting plausible logical arguments that aren’t checked against reality at each step.
There are several versions of the paradox. One involves an archer and a target; another famous version involves Achilles and a tortoise. I’ll use the archer and target version.
It begins with an archer shooting at a target. What happens when we look in detail at the stages involved? One key stage is that the arrow needs to cover the distance between the bow and the target. Before it can reach the target, it needs to travel part way to it, like this.
So far, so good; there’s no obvious flaw in this reasoning.
However, before it can travel for that part of the distance, it needs to travel for a shorter part of the distance, like this.
You can probably see where this is going. Before it can travel that shorter part of the distance, it needs to travel an even smaller distance, like this.
Now comes the sting in the tail, and it’s a big sting.
You can infinitely repeat this process of reducing the distance it has to travel. So if the reduction process can go on for ever, how can the arrow even start to move, let alone cover the whole distance?
The conclusion is clearly wrong; that’s not in question. The problem is working out why it’s wrong.
It’s a very important problem, because if we can’t work out the flaw in a chain of reasoning that has such an obviously wrong conclusion, what hope do we have of spotting flaws in other chains of reasoning that come up with plausible-looking conclusions?
If you’re still unconvinced about whether this matters, here’s an example that might make the implications clearer. Imagine that you’re about to travel on an aircraft that’s had a very dodgy-looking repair to one wing. You ask the flight attendant if the repair is safe, and they reply that it ought to be safe. Would that reassure you? Probably not; most people would prefer to know that someone had actually tested the repair, and confirmed that it really was safe.
In the specific case of Zeno’s paradox, mathematicians eventually found the flaw in the chain of reasoning. It took them into some territory that was very strange indeed, involving the nature of infinity and infinitesimals. If you’d like a small taste, I’ve written about it here.
However, the more general point remains as valid as it was on the day when Zeno first made it. We can’t trust a chain of logical reasoning on its own merits; it might contain a hidden flaw as subtle and as profound as the one in Zeno’s paradox. Instead, we have to test each link in the chain of reasoning against reality, to make sure that each link is sound. That’s a key part of evidence-based approaches.
So far, so good.
Unfortunately, some interpretations of “testing each link” are as subtly but profoundly flawed as Zeno’s paradox. I’ll concentrate on a particularly widespread misinterpretation, which has much in common with one form of religious argument (and which probably started off in religious argument, as a way of demonstrating that the argument was consistent with accepted religious tenets).
It involves treating a chain of reasoning as if it were a charm bracelet, with reassuring talismans hanging off the links in the chain at frequent intervals. Instead of physical talismans, this approach usually invokes the names of Great Thinkers From The Past plus a recent authority, using quotations that look as if those prestigious people agree with the assertions in question.
The charm bracelet model
In this model of reasoning, the key principle is to have a supporting reference for each link in the chain of logic, preferably from a high-status name.
At first glance, this looks a lot like scientific writing, which consists of a series of assertions, each followed by references to the literature. In reality, though, the charm bracelet model is a weak, dark imitation of a scientific chain of reasoning; hence the thin links and the choice of colour in the image above.
So where’s the difference?
The difference is that the charm bracelet model is selectively looking for references that agree with the assertion in question. It’s not looking at the overall pattern of evidence relating to that assertion, to see how well the argument is corresponding with reality.
Why does that matter? Imagine that we’re testing a real, physical anchor chain, like the one in the image below, for the person who manufactured that chain.
If it breaks, people might die, so we would want to check each link as thoroughly as possible, using a range of tests, because different tests will pick up different types of flaw. A link might have a type of flaw that is missed by one type of test, but that is detected by a different type of test.
In that context, would we only report the result from the test that gives the “it’s all fine” result, and ignore the other results? If we did, and the chain failed, we’d be looking at some serious liabilities for negligence. We would have obvious moral and legal responsibilities to report the full set of results, including the ones that said the chain was flawed.
The moral and practical advantages are obvious. So why doesn’t everyone use the “full testing” model?
One simple reason is that the world is often messy and untidy. There’s usually a large number of tests that could be applied to a given problem, ranging from ultra high tech and highly respectable tests down to more questionable ones such as Dr Omura’s bi-digital O ring technique (and no, I’m not making that one up, and yes, it is just as bizarre as it sounds).
We have to make judgment calls about what’s worth considering and what isn’t. Different people make different calls. In science, there’s usually fairly close agreement between researchers about what should definitely be included and about what should definitely be excluded, but there’s usually a grey, uncertain area in between.
Another reason that people don’t use the “full testing” model is that they’ve simply not grasped the difference between it and the charm bracelet model. The charm bracelet model has been around for a long time, and most people are familiar with it. When they see something that looks superficially similar, in the form of the scientific “assertion plus relevant references” format, they will probably think it’s just another version of what they already know, rather than realising that it’s profoundly different beneath the surface.
The charm bracelet model is also less effort. You just have to find one supporting reference for a link, and then you can move on to the next link. That’s much easier than looking at the whole body of literature relevant to each link, and deciding which references are most appropriate for each link, which can take a very long time indeed if you’re dealing with a complex issue.
Mis-quoting Terry Pratchett: a half-truth can be halfway round the world before truth has even got its boots on. Half-truths are in many ways worse than lies, because a half-truth contains enough truth to be plausible, which means that it can mislead people for a very long time before the truth finally displaces it. If you’re dealing with a topic like finding causes or treatments or cures for major health problems, then misleading half-truths can cause a lot of damage, including suffering and death. That’s why getting it right is more than an academic search for understanding; it’s an issue with major implications for people’s lives.
Evidence-based approaches, in medicine and elsewhere
The sections above looked at the way that evidence is deployed within arguments. This section looks at a slightly different issue, namely evidence-based approaches to policy decisions.
When the medical community started looking systematically at the evidence relating to assorted medical issues, they found quite a few completely unexpected results, where the evidence showed the opposite of what had previously been universally assumed to be true. Usually this was because something had appeared so logical and self-evident that it had never been seriously tested before. Those unexpected discoveries were a major wake-up call.
Other disciplines have spotted the implications, and are keen to use evidence-based approaches to tackle their own problems.
One common way of attempting this is to use methods that have been worked well in medicine, in particular the randomised controlled trial (RCT) and the Systematic Literature Review (SLR).
Both of these have been very effective in medicine, but there’s a major difference between medicine and many other disciplines that needs to be considered before deploying these approaches in other disciplines.
In medicine, the treatments being used, and the populations on which they are being used, are usually crisp sets that are homogeneous in relation to the variable being measured. These are key features of randomised controlled trials and systematic literature reviews. In most disciplines, the situation is different; either the treatments or the populations or both are fuzzy sets. In principle, it’s possible to work round this with sophisticated statistics and research designs, but in practice, this issue severely limits the applicability of RCTs and SLRs in fields other than medicine.
Here’s an example. In medicine, you might be testing the effectiveness of a drug versus a placebo. There will be extremely tight quality control on the drug and on the placebo. The “drug” patients will all receive essentially identical doses of exactly the same drug as each other; the “placebo” patients will all receive essentially identical doses of exactly the same placebo as each other. Similarly, if the drug is being tested on patients diagnosed as having a particular condition, there will usually be a diagnostic test that clearly identifies whether or not the patient has that condition. There may be different levels of severity of the condition, but the key point is that the test divides people crisply into those who definitely have the condition versus those who definitely don’t.
This makes it comparatively easy to combine the results from different studies of the same drug on patients with the same clearly defined condition. In all those studies, the drug will be chemically identical, and the patients will all have an unequivocal objective diagnosis as having the same condition. You’re comparing like with like.
The situation is different in many other fields, and also in some areas of medicine. In the case of medicine, RCTs and SLRs run into problems when you move away from drug-based interventions into interventions that can’t be so tightly constrained, such as testing massage as an intervention for back pain. Similarly, if you’re dealing with conditions such as dyslexia that don’t have a single objective diagnostic test, the populations involved in the tests probably won’t be homogeneous. You’re no longer comparing like with like. Yes, you can do statistical manipulations to address the variation, but you’re dealing with a much more complex problem area, and there’s the risk that you’re missing some key variable and in consequence producing results that look very impressive but that are fatally flawed.
So what can you do if you’re in a field like education, where the interventions and the populations are usually both very far from homogeneous?
For researchers, the answers are well known. You need to make sure that you’re using research methods properly. You take due care about the choice of intervention, and the choice of metrics and measures, and about operationalisation, and choice of research design, and choice of statistical tests, and all the other things that are part of proper research.
For people who aren’t professional researchers, one take-home message is that doing proper evidence-based research requires heavy-duty expertise in research methods.
That doesn’t mean that the rest of the world needs to stand aside and let the experts slug it out. There are a lot of approaches that anyone can use to clarify the key issues, and to conduct a well-informed discussion about key issues of policy and practice.
One particularly clean, simple and powerful approach is systematic visual representation of evidence using methods such as formal argumentation. With this, you can show a chain of reasoning and evidence using clear, consistent diagrams. It’s a good way of spotting trends in evidence, and also gaps in evidence. I’ll discuss argumentation in more detail in a future article. Here’s one example of how you can get a quick, powerful overview of the evidence by using visualisation systematically.
In this hypothetical example, the vertical axis shows how strong the claimed effect of an intervention is, and the horizontal axis shows the strength of quality assurance in the publications involved. The numbers in the black shapes are identification numbers for the different publications, so you can keep track of which publication is where on the diagram.
What this diagram shows is that the publications with the loosest level of quality assurance (publications 4 and 8) are claiming that the intervention is a wonder cure, but as you move towards the publications with tighter quality control, the reported strength of effects grows progressively weaker, until it’s nearly zero in the publication with tightest quality control (publication 6).
This is a common pattern, where the tabloids and the blogosphere make huge claims about some new idea, but those claims rapidly fall apart when they’re looked at more closely and more rigorously. The diagram is simple, but it shows what’s happening systematically and clearly.
So, in conclusion:
- Argument from logic and “it stands to reason” doesn’t stand up well to reality
- You need to test each link in your chain of reasoning against the evidence
- You need to look at the big picture of the evidence, not just the bits of evidence that agree with your pet theory
- Assessing the evidence properly isn’t easy
- Assessing the evidence properly is possible; we can make progress.
I’ll end on that positive note.
Notes, links and sources
For fellow pedants:
- Yes, the archer is a Mongolian woman, not an ancient Greek. I’ve used this image because Mongolian archery isn’t as widely known as it should be.
- Yes, the tortoise is a respectable Greek-type tortoise like the ones that Zeno would have known.
- Yes, I know about anchor cables versus anchor chains; unfortunately, not everyone in the world appreciates such precise distinctions, so I’ve gone for simplicity this time.
I’ve indicated specialist terms from other fields in bold italic. All of the specialist terms I’ve used in this article are well covered in easily accessible sources such as Wikipedia, if you’d like to read about them in more detail.
You’re welcome to use Hyde & Rugg copyleft images for any non-commercial purpose, including lectures, provided that you state that they’re copyleft Hyde & Rugg.
There’s more about the theory behind this article in my latest book:
Blind Spot, by Gordon Rugg with Joseph D’Agnese
Sources for images