Chapter 3: SELECTING A BLOCKING SCHEME

Generally the files of records to be linked are too large to allow the comparison of every record with every other record. The use of an index is usually unavoidable. Probabilistic record linkage calls this blocking the records. Blocking cuts down on the number of comparisons that need be made. All the record comparisons made are among records in a block (sharing the same key value), but there is the question of what field(s) would be suitable as a key. For this purpose we will want to choose one or more fields that will result in the most matches brought back rather than lost, and at the same time will not bring back so many possibilities that we spend inordinate amounts of time comparing non-matches. These are the two measures of efficiency, respectively recall and precision. The actual comparing of the records in the block is called weighting.

It is possible knowing the definition of blocking efficiency to run a test and come up with the measures empirically. Having a set of matched duplicates, we simple take one at random from each group, use it as a query and count how many other records in its group are in the block that it defines and how many are not. As good as these numbers are, we can't tell their certainty without doing the same thing several times using other sets of queries chosen at random. However, we can also determine the blocking efficiency from other measures that we need to take on our sample anyway. As mentioned above in ¶ 2-2.7 there are three principle field characteristics from the database of interest for which we need measures: presence, reliability, and coincidence. Based on these measures we may select certain fields as best suited for either blocking or weighting. These are the blocking and weighting schemes. In order to optimize the blocking scheme we would need an algorithm that maximizes recall and precision based on the cost of calculating the weight of a comparison. The greater the noise (less precision), the less desirable the blocking scheme. In this chapter we also develop the equations for duplication rate, which is important in getting at the blocking precision.

Section 3-1BLOCKING RECALL
Section 3-2DUPLICATION RATE
Section 3-3BLOCKING PRECISION