Section 3-2 DUPLICATION RATE

In order to get at precision we need to estimate the noise and the noise depends on how many duplicates there are in the test database. Duplicates come in groups. Most of the time a duplicate group is a pair, but sometimes it is a triple, a quadruple, or an even larger group. The distribution of pairs, triples, etc., depends on the duplication rate and the size of the file. The larger the sample file and the higher the duplication rate, the more likely there will be larger sized duplicate groups.

3-2.1Two definitions for duplication rate.
3-2.2Proportions of records.
3-2.3Calculating an entity duplication rate.
3-2.4Estimating numbers of duplicate n-tuples.
3-2.5Anomalous duplication rates.