| 01234567890123456789012345678901234567890123456789012345678901234567890123456789 |
| In finding the proximity of several terms, we first need to represent the |
| positions of the terms in the document. Let's say we use the following to |
| represent a set of term occurrences: |
| doc1 term1:1 term2:2 term3:3 |
| doc2 term1:30 term2:10 term3:20 |
| |
| if we performed a search query against these two documents using the query: |
| (term1 && term2 && term3), we would like the weight of results to reflect the |
| proximity of these terms. In document1, the three terms are clustered together |
| without separation and in document2 they are separated by 10 positions, so we |
| would expect the weight of search results to reflect this. |
| |
| One possible weight of the search would take the like of matching terms for each |
| document and construct a weight like: |
| Matching Weight = MAX(positions) - MIN(positions) |
| |
| Using this metric for our test case, document1 would have a weight of 2 and |
| document2 would have a weight of 20. We can easily construct a method to |
| determine the best match. |
| |
| But what if there are multiple positions at which the terms appear? For |
| instance, what if the term occurrences in documents 3 and 4 looked like this: |
| doc3 term1:1,40,4 term2:2 term3:3 |
| doc4 term1:1 term2:2,30,50 term3:3,10 |
| |
| Now the proximity grouping of term1,2,3 is obscured by our previous weight, |
| document 3 and 4 evaluate to weights of 39 and 49 for this search. |
| |
| ============================================ |
| A sparse matrix method for term proximity |
| ============================================ |
| If we consider the list of positions for each term as a "position vector" that |
| has an entry for each place the term occurs in a document, then we can write our |
| terms and frequencies like a matrix for each document where each column is a |
| position vector with a 1 where the term exists in the document and zeros where |
| it doesn't exist, like this: |
| doc1 |
| term1 term2 term3 term4 |
| count at pos1 0 1 0 0 |
| count at pos2 1 0 0 0 |
| count at pos3 0 0 1 0 |
| count at pos4 1 0 0 0 |
| ... |
| count at posN 0 0 1 0 |
| |
| The dimensions of this matrix are (Nterms x Npositions). |
| |
| If we were to look for the query (term1 && term3) in the above document, we |
| could construct a query vector with a 1 in each of the 1 and 3 term positions |
| and zeros elsewhere like this: |
| query |
| term1 1 |
| term2 0 |
| term3 1 |
| term4 0 |
| |
| Now we can get a histogram of term occurrences by evaluating the matrix x vector |
| product like this: |
| histogram = |doc1| %*% query |
| |
| This yields the following vector for this case (assuming we only have 4 positions): |
| histogram |
| term1 2 |
| term2 0 |
| term3 1 |
| term4 0 |
| |
| We desire a weighting by proximity: now that we have an arithmetic for |
| calculating intersections of queries with occurrences perhaps we can amend the |
| weighting of the positions to accomplish a proximity weight? |
| |
| Consider the following two vectors of term positions for terms 1 and 3 in |
| document 1: |
| term1 term3 |
| count at pos1 0 0 |
| count at pos2 0 0 |
| count at pos3 0 1 |
| count at pos4 1 0 |
| ... |
| count at posN 0 1 |
| |
| In this case, we have two occurrences of term3, but only one is close to term1. |
| If we evaluate the histogram, we'll simply get two matches of term3 and one |
| match of term1 without any bias or consideration that we have terms close to |
| each other. |
| histogram |
| term1 1 |
| term2 0 |
| term3 2 |
| term4 0 |
| |
| What if we were to amend the term positions, allowing them to "bleed" across the |
| position boundaries like this: |
| term1 term3 |
| count at pos1 0 0 |
| count at pos2 0 .1 |
| count at pos3 .1 1 |
| count at pos4 1 .1 |
| count at pos5 .1 0 |
| ... |
| count at posN-1 0 .1 |
| count at posN 0 1 |
| |
| We can then compose a "blended term vector" that is the scalar multiplication of |
| these weighted term position vectors (insert proper matrix multiplication here) |
| so that we have a reinforcement of overlapping positions. Where the two terms |
| are separated by one position, there will be a value of 1.1. |
| |
| We can extend this blending to provide a distribution of values across a larger |
| range of positions. For example if we use a distribution like: |
| 0.2,0.4,0.8,1.0,0.8,0.4,0.2 |
| |
| If we have a value of 1.8 we know that the terms are next to each other and if |
| we have 1.2 we know they are 3 characters apart. |
| |
| With multiple search terms, this becomes a continuous weighting factor that can |
| be normalized by the number of search terms. |