Skip to content

Outline Document: Record Linking Approach Using MCDA

1. Objective

To unify campaign data from disparate source systems (Prisma, CARIA, Tag Sheets) by resolving and linking related records to a single canonical campaign entity. This will support automation, data integrity, and reduce manual reconciliation.


2. Overview of the Record Linking Challenge

Campaign-related data originates from multiple systems with:

  • Inconsistent naming conventions
  • Partial overlaps in metadata
  • Human-generated input leading to typos or variability

Traditional deterministic joins (e.g., exact matches on campaign name or advertiser) are insufficient. Therefore, a multi-criteria probabilistic matching approach is required.


3. Strategy: Multi-Criteria Decision Analysis (MCDA)

We adapt an MCDA-based record linking framework. Each field contributes to the overall match score based on:

  • Its reliability across systems
  • The degree of expected variation
  • Its availability in all systems

Each field is paired with an appropriate matching strategy, and weighted based on its importance and consistency.


4. Fields Considered for Matching

The following metadata fields are used to establish a link between records across systems:

Field Systems Present In Description
Advertiser Prisma, CARIA, Tag Sheets Client or brand name
Campaign Name Prisma, CARIA, Tag Sheets Human-readable campaign title
Start Date Prisma, CARIA, Tag Sheets Campaign launch date
End Date Prisma, CARIA, Tag Sheets Campaign termination date
Agency Prisma, CARIA Media or creative agency responsible

5. Matching Techniques per Field

Each field uses a strategy tailored to its data quality and semantic variability:

Field Matching Strategy Justification
Advertiser Fuzzy matching (Levenshtein, Jaro-Winkler) + NER normalization Variants like "Tesco", "Tesco PLC", "TESCO UK"
Campaign Name Sentence embeddings + cosine similarity Free-form strings with inconsistent formatting
Start/End Date Date window matching with decay function Dates help constrain matching by temporal proximity
Agency Fuzzy matching Typos and naming variance between "OMD UK", "OMD"

6. Weighting Model (MCDA Scoring)

Each field’s match is weighted based on its predictive utility and availability:

Field Weight Notes
Campaign Name 0.35 High variability but semantically rich
Advertiser 0.30 Common and stable, but occasionally typo-prone
Agency 0.15 Less reliable, but helpful in combination
Start Date 0.10 Used to confirm closeness in flighting
End Date 0.10 Used in tandem with start date

The overall match score is calculated as a weighted sum of individual field match scores.


7. Record Linking Workflow

Step 1: Extract relevant fields from all source systems:

  • Normalize date formats, name casing, remove punctuation.

Step 2: Compute match scores per field:

  • Apply field-specific matching functions.
  • Normalize scores between 0 (no match) and 1 (perfect match).

Step 3: Combine using MCDA:

  • Multiply each field match score by its weight.
  • Sum to obtain a final composite match score.

Step 4: Apply thresholding:

  • Scores above a threshold (e.g., 0.75) are treated as linked.
  • Scores below may require manual review or fuzzy linking with fallback.

8. Implementation Recommendations

  • NLP Toolkits: Use libraries like spaCy, sentence-transformers (SBERT) for semantic matching.
  • Fuzzy Matching Libraries: RapidFuzz, FuzzyWuzzy, or difflib for string-based comparisons.
  • Scoring Infrastructure: Build a scoring engine that operates on pandas DataFrames for scalability.

9. Outcome

The result will be a single, unified campaign identity with reconciled data from all source systems, enabling:

  • Seamless automation of creative and tag validation
  • Full lifecycle visibility for campaign execution
  • Reduction in human-in-the-loop processing

Multi-Criteria Decision Analysis (MCDA)

MCDA Overview

Multi-Criteria Decision Analysis (MCDA) is a decision-making framework that evaluates multiple criteria in complex situations. The key component of MCDA is the pairwise comparison matrix, where the importance of each criterion is compared directly to every other criterion. This process involves a decision-maker or expert assigning values based on how much more important one element is compared to another. These comparisons result in a matrix where the diagonal is typically zero (as an element compared with itself has equal importance) and the off-diagonal elements represent the relative importance.

Pairwise comparison is a fundamental component of MCDA and is used to determine the relative importance of each criterion (or method, in the context of this example). Here's how it fits into the process, with a focus on the pairwise comparison aspect:

  1. Pairwise Comparison of Criteria: The table represents a pairwise comparison matrix. For each pair of methods, a decision maker has determined how important one is compared to the other in contributing to the overall goal, which in this case is the extraction of terms from text. For instance, 'token' is considered equally important when compared to itself (hence, the score is 0), but it is considered 4 times as important as 'semantic'. This process is repeated for every pair of methods, resulting in the upper triangle of the matrix.

  2. Scoring the Pairwise Comparisons: The numbers in the matrix reflect the relative importance between pairs. A score of 1 indicates that both methods are of equal importance. Scores above 1 indicate that the row method is more important than the column method, with the magnitude of the number reflecting the strength of the preference. Scores between 0 and 1 would indicate the opposite.

  3. Summing the Pairwise Comparisons: After all pairwise comparisons are made, each method's importance scores are summed to provide a total score reflecting its overall importance relative to all other methods.

  4. Normalizing the Weights: To make these sums usable as weights, they are normalized to sum up to 1 (or 100%). This is done by dividing each method's sum by the grand total of all sums. The resulting normalized weights reflect the proportionate importance of each method.

  5. Application of Weights: These normalized weights are then used in the decision-making process. For the term extraction example, when each provider (token, semantic, noun_phrase, noun, sentence) generates a score for a term, that score is multiplied by the method's weight to obtain a score that reflects both the term's relevance and the method's importance as determined by the pairwise comparison process.

In summary, the pairwise comparison matrix is a structured way to capture and quantify subjective judgments about the relative importance of each criterion. These judgments are then converted into a set of weights through normalization, which can be applied to the decision-making process. The method ensures that the final decision reflects the considered opinion of experts or decision-makers about the relative importance of the criteria involved.

MCDA Legend

Value Meaning
0 No effect (on the diagonal)
1 As important
2 More important
4 Significantly more important
0.5 Less important
0.25 Significantly less important

MCDA Weights

token semantic noun_phrase noun sentence
token 0 0.5 1 1 0.25
semantic 4 0 2 2 0.5
noun_phrase 4 0.5 0 1 0.5
noun 4 0.5 2 0 0.25
sentence 4 2 4 2 0
sums 16 3.5 9 6 1.5
weight 0.44444444 0.09722222 0.25 0.16666667 0.04166667

mcda_weights = {
    "token": 0.44444444,
    "semantic": 0.09722222,
    "noun_phrase": 0.25,
    "noun": 0.16666667,
    "sentence": 0.04166667
}

MCDA Pseudo Implementation

The Python implementation would look something like this (assuming we have the data structures available):


import numpy as np

# This is an example, you would actually load this data from your source
# terms_scores: dict where key is the term_id and the value is a list of scores from each provider

terms_scores = {
    "d2ac602d-298f-4979-af43-63d189067136": [1, 0.78, 0, 1, 1],
    "423c405e-1bce-4ffa-884e-39dca3911436": [1, 0.65, 1, 0, 1],
    "b984dcc9-d71d-4c4b-a1e6-064bd823e48e": [0, 0.67, 1, 1, 0],
    "4d951431-4e24-4ee1-9d06-edbca35857e2": [1, 0.99, 1, 0, 1],
    "46b0a20a-3087-446d-8b9c-eb756c151b1c": [1, 0.67, 0, 1, 0],
    "f83b321d-c543-49d1-8ba7-2cc95d7003ae": [0, 0.12, 1, 0, 1],
    "1ec4693b-f57e-4179-bda2-db3004f6130c": [1, 0.67, 0, 1, 0]
}


# mcda_weights: dict where key is the method and the value is the weight

mcda_weights = {
    "token": 0.44444444,
    "semantic": 0.09722222,
    "noun_phrase": 0.25,
    "noun": 0.16666667,
    "sentence": 0.04166667
}

# Convert scores and weights into arrays for easier manipulation
# Assuming the order of the scores aligns with the order of methods in mcda_weights
scores_array = np.array(list(terms_scores.values()))
weights_array = np.array(list(mcda_weights.values()))

# Apply MCDA weights to the scores
weighted_scores = scores_array * weights_array

# Calculate the final weighted score for each term by summing across the methods
final_scores = weighted_scores.sum(axis=1)

print(final_scores)

# Apply a threshold to filter the terms
threshold = 0.7 # Define your threshold here
filtered_terms = {term_id: score for term_id, score in zip(terms_scores.keys(), final_scores) if score >= threshold}

# The filtered_terms now contains the term IDs and their weighted score above the threshold
print("Filtered Terms IDs with scores above the threshold:")
for term_id, score in filtered_terms.items():
    print(f"Term ID: {term_id}, Score: {score}")

In the code above:

  • terms_scores would be replaced by your actual data structure that holds the term IDs and their associated scores.
  • mcda_weights should contain the weights from the MCDA analysis for each method.
  • The scores are weighted using the MCDA weights, and then a threshold is applied to select the terms.

Example output:


Filtered Terms IDs with scores above the threshold:
Term ID: d2ac602d-298f-4979-af43-63d189067136, Score: 0.7286111116
Term ID: 423c405e-1bce-4ffa-884e-39dca3911436, Score: 0.7993055530000001
Term ID: 4d951431-4e24-4ee1-9d06-edbca35857e2, Score: 0.8323611078