Hypothesis

We're going back to the basics today for the non-technical people to explain “what is an “index” and why they are important to making your search engine work cost effectively at scale. Imagine you walked into a library back in the day before computers and asked the librarian to find you every book that mentioned the word "gazebo". You would probably get some pretty weird looks because it would be horribly inefficient for the librarian to go through every single book in the library to satisfy your obscure query. It would likely take months or even years to do a single query. Now imagine you asked them for every book in the library by “Hunter S Thompson”. That would be a piece of cake, but why? That’s because the library maintains an index of all the books that come in by title, author & etc. Each index is just a list of possible values that people would be searching for. In our example, the author index is an alphabetical list of author names and the specific book name/locations where you can find the whole book so you can get all the other information contained in the book. The index is built before any search is ever made. When a new book comes into the library the librarian breaks out those old index cards and adds it to the related indexes before the book ever hits the shelves. We do this same technique when working with data at scale. Let’s circle back to that first query for the word "gazebo". Why wouldn’t the library maintain an index for literally every word ever? Imagine a library filled with more index cards than books? It would be virtually unusable. Common words like the word “the” would likely contain the names of every book in the library rendering that index completely useless. I have seen databases where the indexes are twice the size of the data actually being indexed and it quickly has diminishing returns. It is a delicate balance for people like me to engineer these giant scalable search engines to walk to get the performance we need without flooding our virtual library (the database) with unneeded indexes.

via u/schematical at https://reddit.com/user/schematical/comments/1oe41bx/what_is_a_database_index_as_explained_to_a_1930s/

Perhaps it's a question of the "long search" versus the "short search"? Long searches with proper connecting tissue are more often the thing that produces innovation out of serendipity and this is the thing of greatest value versus "What time does the Superbowl start?". How do you build a database index to improve the "long search"?

See, for example Keith Thomas' problem: https://hyp.is/DFLyZljJEe2dD-t046xWvQ/www.lrb.co.uk/the-paper/v32/n11/keith-thomas/diary

reply database indexes indexes long search search engines

Tags

Annotators

URL