It is often appealing to think of technical communication as a process of answering user’s questions. The difficulty with this view is that one answer can have many questions. If you answer each of those questions, you would be providing substantially the same answer over and over again.
This is very easy to see on StackOverflow, a question and answer site for programmers. Privileged user of StackOverflow can mark a question as a duplicate of another question. Here’s an example:
The question here is “How to check if a variable is a dictionary in python”. This is a question that programmers are going to ask themselves many times. It is a specific instance of a more general question, which is, “how do you check if a variable is of a specific type in Python?”
The programmer may know that their question is an instance of the more general question, or they may not. This depends on how much they know about types in programming languages generally. Even if they do know about type theory, however, they may not think to generalize their question (or to do a search for the more generalized question). Generalizing a question requires mental effort and a person struggling with this problem may not have a lot of additional mental energy to spare. Stress makes us dumb.
Whether they realize it or not, the answer to this question is largely the same whether you want to find out if your variable is a dictionary, a tuple, an exception, or a number: you use either the type() function or the isinstance() function.
Since these functions work somewhat differently, you will need to decide which to use. And this means that the answer to this question is substantially similar to the answer to another question: “[What are the d]ifferences between isinstance() and type() in Python[?]”
And because of this, someone has marked the question about finding out if a variable is a dictionary as a duplicate of the question about the difference between isinstance() and type(). That question is shown in screenshot below:
Now let’s be very clear about this. “How to check if a variable is a dictionary in python” and “[What are the d]ifferences between isinstance() and type() in Python” are not remotely the same question.
- The only words they have in common are “in Python”.
- One is asking how to do something; the other is asking about the difference between two features.
- You would have to know the answer to the first question to know that the second question was in any way relevant to what you are trying to do.
- A person asking the first question is not likely to think that the second question might contain the answer they are looking for. A person asking the second question is not likely to think that the first question might contain the answer they are looking for.
- No search engine is likely to identify one as the answer to the other either, though the answer to “How to check if a variable is a dictionary in python” will at least contain references to type() and isinstance(), which might give it a clue.
They are, in short, not duplicate question at all.
They don’t even really have the same answers. Their answers contain substantially the same information, but that does not make them the same answer. A good answer relates information to the question that was asked, and to the user’s experience and vocabulary. The two questions are asked by people with different levels of knowledge, experience, and vocabulary. Their respective answers must take account of that. They are therefore not the same answer, though they may contain some of the same information.
And yet, someone has marked these questions as duplicates.
Fortunately, StackOverflow does not delete a question or its answers just because they are marked as duplicates. The question and the existing answers remain on the site where they can be found by people who are actually asking the first question and have no idea that the second one might contain the information they are looking for.
Unfortunately, StackOverflow does not let you add a new answer to a question once it has been marked duplicate, which is potentially a problem, since there might still be an opportunity to add a better answer to the specific question being asked.
Why do questions like these get marked as duplicate when neither the question nor the answers are actually duplicates (just the information cited in the answer)? A big part of it is surely our old friend the Curse of Knowledge. Once you know a little bit about how types work in Python, you mind goes straight to the choice between type() and isinstance() for any question related to discovering types. You know the actual answer so well you skip over it entirely to talk about the interesting differences between the alternative approaches.
The interesting thing about the curse of knowledge is just how fast it strikes. Once we learn something, we are under the curse immediately. Thus the person who asked the question about the difference between type() and isinstance() edited their question to link to another question, saying “This seems to be discussed already”. The question they linked to is “What is the best (idiomatic) way to check the type of a Python variable?”
Now, this is actually a different question again. It is asking which method is most idiomatic, which is not an issue that either of the other questions raised. A good answer to this question should address the question of idiom, which the others might not address. So it is not the same answer either, though again the answer will contain substantially the same information.
And yet, once the person who asked the second question had received their answer, the curse of knowledge immediately took hold and they then regarded their own question as a duplicate of the third question.
The kicker here is: The third question had also been marked as a duplicate of the second one. Which, again, it is not, despite its answer containing substantially the same information. What we see, therefore, is not a hierarchy with the canonical question at the top and various deprecated variants below, but a pattern of random and often conflicting accusations of duplication, probably based on which question the particular person saw first.
Despite containing the same information in their answers, all three questions have attracted many thousands of views, and doubtless will continue to do so as other people continue to ask these three very different questions.
Despite the fact that they are actually distinct, however, marking these questions as duplicate does have a useful result. It serves to link the questions together, which is useful exactly because their answers do contain substantially the same information (despite not being the same answers). This makes a wider pool of information, more varied forms of expression, and more diverse code samples available to readers, all of which increases the chances that they will get the best solution to their problem.
It would be better, therefore, if StackOverflow made a distinction between duplicate questions and similar answers. Genuinely duplicate questions should certainly be marked as such. But providing a way to mark distinct questions with similar answers as such would go a long way towards avoiding falsely labeling questions as duplicate when they are actually very different questions with similar answers.
The curse of knowledge might interfere with people’s ability to make the distinction, but having the two categories available would help people make the distinction correctly, and might go a long way to address some of the disputes you find over whether or not questions are duplicates.
This isn’t just a problem for StackOverflow and sites like it, though. The differences between distinct questions with similar answers is one that matters to all of tech comm and directly affects our reader’s ability to find the answers they are looking for.
Whether they browse a TOC in a paper book or type a query into Google, people do not search for answers; they search for questions. Literally, they type their question into the search box and hit Enter. They search for questions because they don’t know the answer. They search for what they don’t know in terms of what they do know, because there is no other way to do it.
When we organize and categorize it is very easy (and very convenient) to forget this fundamental fact. We organize and categorize content based on our full and blinding knowledge of the subject matter and of the content. If we imagine our user’s asking questions, we get the questions wrong because we already know the answers. If we look at actual questions, we tend to group and to paraphrase the questions into consistent forms that we recognize, and in so doing lose all hint of the original confusion and ignorance that went into forming those questions.
The reader’s path cannot be made straight. We must not imagine that we can lead every reader directly to every answer. Rather, we must provide many paths through the wilderness, pausing over and over to reorient the reader as they work their way through the quagmire of sense making.