The principle of locality states that in computer systems, data that has been used recently will likely be used again soon, or if a particular location in memory was needed recently, it will likely be referenced again soon. In more general terms, we could state the principle of locality as “What has happened recently will likely happen again.”

But how does this help you?

Well, for one, if you recently used a program like Google Chrome (and exited), your operating system might use this to assume that you’ll use it again and keep some properties about that program in cache so that the program can be quickly loaded again. Go ahead and try it yourself. Open up a program you haven’t used in some time that typically takes a little while to open. Then close the program. Then reopen it. Did the program load noticeably faster the second time?

In one of my classes recently we were discussing principles a web crawler might use when the topic of locality came up. The professor stated that, when building an inverted index for a search engine, we could exploit the principle of locality.

What does that mean? Here’s a brief primer on web crawling and search engines. If you’re already familiar with all this, simply skip to the section starting at “When my professor stated”

When you type a query into a search engine (like “United States” or something), what is the search engine doing? A simplistic engine will go through a database of documents it had a crawler previously crawl and scrape the text from. For all the terms in the database, an inverted index can be built. Think of an inverted index as the Glossary in the back of your textbook. For each term, you get a list of pages the term appears on. A normal (non-inverted) index would say “For each page, give a list of all the terms”. An inverted index makes much more sense to have, right?

For each term in your query, the inverted index can be searched to find pages that the term appears in. Each entry into the simple inverted index looks like this:

With “United States”, something like this:

The Document ID’s are simply what number the search engine’s crawler has assigned a particular page during its crawl.

As you can imagine, storing these lists quickly balloons into a huge memory consumer. Any techniques on data compression are welcomed. One of those is called Gap notation using delta encoding.

When my professor stated that we could use the principle of locality, she was referring to this technique of gap notation with delta encoding. Instead of storing document IDs as 1, 2, 4, 10, 12, 15, 18, 27, etc. we could instead just store the gaps between document IDs. Our list then becomes 1, 1, 2, 6, 2, 3, 3, 9.

To store these IDs in memory we could take one of two approaches: a fixed size variable or a variable of dynamic size. There are a couple problems of using a fixed size variable. To do so requires an assumption of the maximum amount of space needed. If this assumption is too small, then we’ll encounter a gap larger than what we can store in memory, and if the assumption is too large, then we’ve wasted space instead of saved it during our compression!

Therefore, it is a good idea to use variable-width storage. This is what delta encoding does; it allows us to store our gap numbers in a dynamic fashion. Delta encoding requires that we store the gap as a combination of a unary number and a binary number, called a gamma code.

A unary number can be thought of as “ticks” that one might use to count quantities of something. So 3 units of a widget looks like three tick marks: I I I. In unary, the number 3 would appear like this: 110. The 0 is a “stop-bit” indicating the end of the count.

The first half (unary) of the gamma code is equivalent to floor(log_{2} X) + 1. X is the gap width between IDs (I.E. if we were comparing document 1 and document 10, the gap is 9). The floor function rounds any decimal result down to the nearest integer, so floor(log_{2} 9) = 3 just like floor(log_{2} (8)) = 3. If our gap width was 9, then the first half of our gamma code becomes floor(log_{2} 9) + 1 = 4 (decimal) = 1110 (unary).

The second half of the gamma code is equivalent to X – 2 ^{(floor (log2 x))} where X is also the gap width. This equation will never give a result larger than X/2 or smaller than 0. Think of it as a more efficient way to represent gap widths that could be really large. We don’t need to worry about X being 0 in the equation because documents have unique IDs so the gap between them is always at least 1.

If our gap width is 9, then the second half of the gamma code (binary) is 9 – 2 ^{(floor(log2 (9)))} = 9 – 2^{3} = 9 – 8 = 1 (decimal) = 001 (binary). As you can see, I represented the binary number with a few 0s before the 1. Why? Calculating the first half of the code also determines the number of bits we’ll use to represent the second half. That is, floor(log_{2} X) will be the number of bits we use to represent the second half of the code. As floor(log_{2} (9)) = 3, we’ll only use three bits to represent the second half of the gamma code – 001.

Our final gamma code for a gap size of 9 is then 1110001, or [1110][001]. We can interpret it as follows: count the 1’s until you reach a 0. So we count three 1s. This count is the number of bits used to represent the second number in binary. The second number, 001, is 1. So you see that the first half of the code is merely a way to inform whoever is reading it how to interpret the second half of the code. How can we ‘reverse’ this code and obtain the actual gap size of 9?

Let’s compare the code for a gap size of 9 [1110][001] to the code of a gap size of 8 — [1110][000] and a gap size of 7 — [110][11]. Are you seeing a trend yet? 6 is [110][10], 5 is [110][01], and 4 is [110][00]. Let’s call Y the # of 1s in the unary number. In the case of a gap size of 9, Y = 3. Let’s call Z the binary number in the second half of the code. For gap size 9, this is 01 (binary) = 1. 2^{Y} + Z gives us our actual gap size. That is, for the gap size of 9, the code [1110][001] translates to 2^{3} + 1 = 9.Cool, right!?

Let’s think how useful this gamma code is for a second. 1110001 is a 7 bit number. If we had simply represented document 10 in binary as 1010 instead of as 9 (the gap between document 1 and 10), then we just used 7 bits to encode a document ID that could have been represented in only 4 bits! If every gap size was 9, we’d actually be using extra space to store all of our gaps!

But hang on for a minute. Consider the number of documents that a term could point to in an inverted index for, say, Google? A million would be a conservative estimate, and indeed searching “United States” turns up about 1.8 billion pages. What if the document IDs with gap sizes of 9 were actually 128 and 137? Using delta encoding we find a gamma code of 1110001, a 7 bit number, but not using any encoding at all to represent 137 we get 10001001, an 8-bit number! We just saved 1 bit of memory! If every gap size was of 9 or less, then we’d save at minimum about a billion bits for a term like United States, or about 125 megabytes. Over even just a few thousand different terms, storage savings add up fast.

But how can we be sure that using these gap sizes are going to result in memory savings? Or: How does the principle of locality work to save memory? I mean, how can we be so sure that our gap sizes aren’t going to vary wildly? Like be a million, which could result in inefficient delta codes?

According to the principle of locality in a spatial sense, document IDs are going to cluster around related words. Think about NASA’s website. Every page will likely have something in common with any other page, like space or stars. Therefore the gap size in pages for a term like “astronaut” is likely to be small. With semantic locality, we make the assumption that terms will be clustered according to their semantic meanings. Then our gap sizes for document ids of pages within the same websites is very small, with the larger gaps made between web sites, and there are far fewer web sites than there are pages in web sites.

For a small web, our memory savings will be insignificant, if any, but for humongous sets of data that require entire data warehouses to store, using the principal of locality applied with delta encoding results in less memory required and cost savings for you and your business.

Tags: Compression, Locality, Web Crawling, Web Mining

## Leave a Reply