Too Big to Browse; Too Small to Search

By | 2012/03/03

Findability continues to be the bete noire of technical communication. This may be a parallax error, but it seems that findability is more of a problem in technical communication than in other fields. The reason, I suspect, is that many technical documentation suites are too big to browse but too small to search.

I have commented before on the somewhat counter-intuitive phenomenon that on the Web it is easier to find a needle in a haystack (The Best Place to Find a Needle is a Haystack). This may be counterintuitive, but it is easy enough to explain: search (if it is any more sophisticated than simple string matching) is essentially a statistical analysis function. A search engine works by discovering a statistical correlation between a search string and a set of resources in its index.

All statistical calculations are dependent on the amount of data available. We are reminded of this every time we see a poll reported on the news. “The results are considered to be reliable plus or minus 5% 19 times out of 20.” The more people who are polled, the smaller the poll’s margin of error is considered to be. (Most of the informal polls that people post on the web or in forums have a margin of error the size of Texas.)

What Google is saying, when it presents you a page of search results, is: based on billions of documents, and billions of search strings, and what billions of people read after entering those search strings, these are the top ten pages that correlate most strongly to the search string you have entered. This works astonishingly well when you have billions of data points. It does not work at all well when you have a thousand pages and virtually no history of search strings or page selections. Intelligent search requires billions of data points to perform at its best. Individual help systems on desktop machines are too small to search.

People complain constantly that the search engine of their docs delivery platform of choice does not work well enough. Improved search performance is a perennial feature request made to vendors. But things don’t get better, and I suspect that there is very little more that can be done to improve the search function of most tools, given the two fundamental problem that limit what it can do:

  • The average doc set simply isn’t big enough to provide enough data for the statistical correlations to be meaningful (even if you used this type of search engine, as opposed to a simple string matching engine).
  • The typical documentation deployment is so small (often a single-user desktop) that the search engine does not see nearly enough query strings to profile them successfully, and so can’t correlate search strings to content with any accuracy.

There is only one way out of this problem: make the doc set part of a larger data set so that it can be profiled more accurately, and put the search portal somewhere where it will receive enough search strings so that they can be meaningfully profiled.

It a doc set is too small to search, what remains is browsing. But browsing starts to become very cumbersome when the content reaches a certain size. It may be feasible to browse an ordinary book, but then you have the problem that the user has to browse multiple books, and that is an inconvenience that the Google generation does not have the patience to tolerate. They want all the information in one place.

So, you put all the information together in one place. There are then thousands of pages of content in a single container, and the only way to make it feasible to browse that much content is to arrange it somehow, and the usual method chosen is to make it into a hierarchy.

I think we have lost site of the fact that a book is not hierarchical by nature. The structured text movement has taught us to use hierarchy to encode books for processing, and from that we have seemed to made the leap to thinking that they read hierarchically. They don’t. There is little more frustrating that having to constantly be going up, down, back, or sideways, while trying to read. If there is anything to the theory that links are a threshold event that triggers the reader to forget what they have just read (as discussed in Are We Causing Readers to Forget?) then reading hierarchically is also going to seriously impair retention.

So, hierarchy is not something that is natural to an information set, but something we impose on it in an attempt to make a larger information set browsable. But the problem with hierarchies is that they impose artificial subordination of concepts. When you choose the groupings at each level of a hierarchy, you make an assumption about how the reader will narrow down their query in their own minds. The more items at each level, and the deeper the levels become, the worse the problem becomes, and you inevitably end up hiding information from the reader who has a different idea of what the primary concepts are, as well as sundering information that is closely related, but in a way that is low on the list of groupings you have chosen.

The artificial imposition of hierarchy on the reading experience, combined with the maze like browsing experience that is created by a large and complex hierarchy, is what turns so many doc sets into Frankenbooks.

People continue to agonize over how to improve the findability of these Frankenbook help systems, but I think the fundamental and unavoidable truth is that Frankenbooks are too big to browse; too small to search.

The way out of this dilemma:

  • Create Every Page is Page One topics that address a single user need in a single linear topic.
  • Richly link those topics so that people can browse and surf the locale around those topics (that is not too big to browse).
  • Put those topics on the web where they will become part of a content set that is large enough to search effectively.
  • Where long narrative exposition is genuinely needed, write a conventional book.





9 thoughts on “Too Big to Browse; Too Small to Search

  1. Tim Penner


    We’re clearly reading the same books – Weinberger and Morville. But I’m unhappy with your proposed solutions for the findability dilemma. I agree that’s it’s all we have for now, but I’m still unhappy.

    A problem with “books” (maybe THE problem that would makes them “bad”) is that a really good subject index is just too expensive to create and maintain. And such an artifact would never be “excellent” because it’s all but impossible to inject experience into it’s logic. A monstrously large public content set is the index answer because a mixture of time, computers and people invokes a miracle.

    Other problems: EPPO is not accessible to organizations who are already struggling cost-wise just to keep up. It’s too big a renovation. Also, large proprietary doc sets can’t be published into the ocean. SPFE would help somewhat to create some of that cross-linked, oceanic feeling, but it too is not an option for everyone and, frankly an unlikely direction for a great many.

    We need to test – at least dialectically – some new approaches.

    I’m thinking, for example, that an indexing tool that does an immense subject clustering trick to generate the “good” subject index might be an answer. It would lack the experiential component that people-over-time add, but perhaps the presence of the well-clustered index would amplify the value of a much smaller sample of users over a shorter time period. I had a few years as a software engineer in the text indexing business and I’ve sort of keep my ear to the ground in the domain since I left in ’97. Automated subject clustering works – in fact, it can be amazing – but I have yet to hear about a push-button clustering tool with hooks in the popular authoring environments.

    Anyway: I think findability is more than just the bete noire of techcomm. It’s the big nasty elephant that shows up uninvited at too many small doc team meetings.

    1. Mark Baker Post author

      Hi Tim. Thanks for the comment.

      I’m a big fan of Weinberger, but I haven’t read Morville. Sounds like I should put him on my reading list.

      In principle, I’m all for finding something better, but in this case I am skeptical. Indexes don’t scale well. At a certain point an index itself becomes too big to browse. The immense clustering trick that you propose would, like intelligent search, rely on statistical analysis, and would require huge amounts of data to be effective. But a single index of that much data would be too big to browse.

      Actually, I would suggest that Google search performs that immense clustering trick that you propose. This is what it is doing when it spiders the web and builds its databases. It is simply providing access to that index through a search box rather than a browse interface.

      It is easy to forget that in the book era, information finding was a multi-step process in which you first had to generate a list of books to consult (via footnotes or a card catalog), then physically locate the books (which could include an interlibrary-loan, which took weeks), and then consult the index of each book as we located it. When people compare the efficacy of search to indexes, they generally leave out all the prior steps and compare only on the basis of finding information in a single book using the index vs a search. This is, of course, the place where search is at its least effective — single books are much too small to search — but it is a complete distortion of the whole information-finding process, where, for most purposes today, search wins hands down, by searching the entire web and delivering the results instantly.

      You are certainly right that moving from books to true EPPO topics is a huge challenge. Despite the urging of virtually every pundit, most tech pubs groups are moving from books to Frankenbooks (written in fragments). The question is, is the emergence of Frankenbooks a transitional phase or the new normal? A lot will depend on how people write new content over the next several years. If people start writing new content in EPPO, then there will be a market for SPFE. If they write new content in fragments and build the fragments into Frankenbooks, then DITA will likely remain the tools of choice for that task. SPFE is not an appropriate tool for the creation of Frankenbooks.

      Unlike DITA, SPFE does not aim to be the solution for everybody. Wikis, for instance, provide a great platform of EPPO topics that will be effective for many organizations. And SPFE does not aim to be an exchange format either — one of the principles of SPFE is that you should never write in a exchange format. SPFE is an architecture for building structured writing systems, designed to be used by those organizations that can benefit from a high level of automation in their content production, particularly where rich linking, heterogeneous reuse, or integration of content from multiple sources are priorities. That certainly isn’t everyone, but there are quite a few organizations for which these things are, or should be, priorities.

      1. Tim Penner

        Peter Morville wrote “Ambient Findability”. I found it a really good read. He also recently released “Search Patterns”, which I didn’t find quite so exciting.

        Just a piece of news:

        Today, I heard of a clustering product by a company named “Cirilab”. They have a couple consumer-scale products for summarizing documents and document piles, plus enterprise tools for scaling-up the mass summarization process. If what these guys say is true, this might help somewhat with findability issues although how they’d help us out on the open sea is another matter.

        1. Mark Baker Post author

          Cirilab sounds interesting, though my immediate prejudice is that it is apt to fall just short, as so many attempts at this do.

          But my real question about such attempts is less about whether they provide reliable summaries, as whether reading time is actually the constraint on the path to understanding. My impression is that people initially assign new information to their existing categories, and only reluctantly break their existing mental model and form a new one. Successful communication (at least of a new idea) has to trigger that breaking of the current mental model, and I’m not sure what the reader’s digest version of a piece is going to preserve those triggers. I’m not sure if automated vs human summarization would make any difference in this. Human summarization does put the article through the filter of another person’s mental model, which the automated solution might avoid. But my suspicion is that summarization itself would create the problem, not matter how it was done.

  2. Anne Gentle

    Appreciate this post, Mark. This would be a great discussion piece for students in tech comm.

    1. Mark Baker Post author

      Anne, thanks for the comment. I do think this is an issue that tech comm students would benefit from thinking about, as it is counter-intuitive, at least for those of us educated in the 20th. century. It will be interesting to see how people educated in the 21st century approach these things.

  3. Pingback: Findability vs. Searchability | Every Page is Page One

  4. Jan Wright

    If you can combine “every page is page one” or what I call “online help portal pages” and human-based indexing, you can get a vast improvement in accessibility. For one of my clients, with an immense help system, we targeted the indexing to lead users to portal-like pages for conceptual topics, and to the lead-pages for command-based information. We cut the index size in half, and with XML-based writing platform, and having authors insert annotations for new or changed materials, we cut way back on the time needed to revise the documents for each release. The writers made sure that subsidiary and explanatory topics were linked on the portal page, and procedures were accessible as well. The index then could jump to the appropriate portal, and we were relying on “let search be search, and let the index do the rest.” Short easy finds were left to search. Browsing had the index as its mainstay. One solution doesn’t fit all cases, but this was a case where structuring the materials, and then streamlining the index made it a feasible and findable help system.

  5. Pingback: ANZSI 2013 conference – references from EPUB talks | Web Indexing

Leave a Reply