By Susan J. Leclair, PhD, CLS (NCA)
I know it is a common belief that statistics and lies are synonymous, mostly I think because statistics seems so foreign to most people. Statistics tries to answer in numbers only a few questions, but uses several different methods to try to get the answers.
- Do groups of similar events/items/etc. have any commonality?
- If they do, is there a wide or narrow range?
- Do two, three or more groups that are not similar at first glance have anything in common?
- Can I rely on the information I have just learned?
Let’s start with some simple definitions using the language of science. The word “anecdote” for most people is considered to be an interesting story about a real person or event, as in “I went to work today”. It could be a lie in that you didn’t go to work today but are saying that to avoid punishment, or it could be true. This sentence refers just to you. It doesn’t mean that everyone went to work today. It doesn’t mean that most people went to work today. Just that you did.
Data are anecdotes connected by time, people, place or some other significant element. So, you went to work today and that is of interest to the bookkeeper of your business and your boss. Data would be: of the 100 people who work with you, 80 came to work today. Data then must contain additional facts which are specifically of the topic, with the same details, and can answer those questions of who, what, where, how and why. Again, for example, that you and a hundred others went to different worksites today would be important if you all lived in the same locale or took the same method of transportation. That you went to work in an office in one city and I went to work in another business over a thousand miles away are not data, since these events are not really connected. Now, if 900 people work in the same factory, then you might be interested in knowing the who, what, when and why of all those people coming together. A lot of people get confused because they confuse anecdotes with data. For example, an anecdote might be in today’s social distancing; let us say that I shop for groceries every Tuesday morning out of the 6 other early morning options during time set aside for the elderly. Data might be that there were over 100 people from one village shopping during the time set aside for the elderly on Tuesday. That would cause someone to ask a) why then? or b) how many others did not choose to go at that time? or c) why did the store choose those hours? Anecdote is one; data is plural. So, rule number one is that data used in statistics are population based and are valid for populations only. If using today’s concerns, the prevalence of COVID19 is 2% of a population, it doesn’t mean that I would be able to point out who in the population has COVID19. It would mean that somewhere in that population, 2% of them are probably positive.
If you can use only data from populations, then whom you choose to use in your populations becomes critical. For example, if I wanted to know the frequency of diabetes in the United States, then picking out 100 people entering a diabetes clinic would be silly since those people would most likely be diagnosed with diabetes or prediabetes, otherwise, why would they be there? If I wanted to know the incidence of alcoholism in a city, would I really want to test children under the age of 10? If you recall, many news reports will comment about research studies by including the number of people involved or other limitations. My personal favorite example of horrid statistics was about a drug trial of 2 people in which one lived, and one died, and the conclusion was a 50% success! In an ideal world, you would include everyone, but that world does not exist, so you choose a population – large enough to be reasonably reflective of the total population you wish to study. So, for a hypothetical example, let’s choose a drug trial for treatment of COVID19. A lot of people are available for this, so if you need to choose 1000 people for this trial, who do you pick? Who gets COVID19? Everyone. So, the elderly, adults, teens, and children must be included, and you would have to make sure that the genders were reflective of the general population. One current question is, should you include people with other illnesses (cardiac disease, diabetes, malignancies, etc.) into the general studies or are those conditions significant by themselves to require unique studies? When you hear the reporter say “it was a small study” or “it was just in one hospital” or “the study concentrated on the military” and so on, they are trying to describe the limitations inherent in the report. Your response to these reports should be to find out if the study participants “look like me”. If they do, then the next step is to find out the total number of participants. Ten is better than two. One thousand is better than five hundred. One million is better than one hundred thousand.
Susan Leclair, PhD, CLS (NCA) is Chancellor Professor Emerita at the University of Massachusetts Dartmouth; Senior Scientist, at Forensic DNA Associates; and Moderator and Speaker, PatientPower.info – an electronic resource for patients and health care providers.