Outline Document: Record Linking Approach Using MCDA

1. Objective

To unify campaign data from disparate source systems (Prisma, CARIA, Tag Sheets) by resolving and linking related records to a single canonical campaign entity. This will support automation, data integrity, and reduce manual reconciliation.

2. Overview of the Record Linking Challenge

Campaign-related data originates from multiple systems with:

Inconsistent naming conventions
Partial overlaps in metadata
Human-generated input leading to typos or variability

Traditional deterministic joins (e.g., exact matches on campaign name or advertiser) are insufficient. Therefore, a multi-criteria probabilistic matching approach is required.

3. Strategy: Multi-Criteria Decision Analysis (MCDA)

We adapt an MCDA-based record linking framework. Each field contributes to the overall match score based on:

Its reliability across systems
The degree of expected variation
Its availability in all systems

Each field is paired with an appropriate matching strategy, and weighted based on its importance and consistency.

4. Fields Considered for Matching

The following metadata fields are used to establish a link between records across systems:

Field	Systems Present In	Description
Advertiser	Prisma, CARIA, Tag Sheets	Client or brand name
Campaign Name	Prisma, CARIA, Tag Sheets	Human-readable campaign title
Start Date	Prisma, CARIA, Tag Sheets	Campaign launch date
End Date	Prisma, CARIA, Tag Sheets	Campaign termination date
Agency	Prisma, CARIA	Media or creative agency responsible

5. Matching Techniques per Field

Each field uses a strategy tailored to its data quality and semantic variability:

Field	Matching Strategy	Justification
Advertiser	Fuzzy matching (Levenshtein, Jaro-Winkler) + NER normalization	Variants like "Tesco", "Tesco PLC", "TESCO UK"
Campaign Name	Sentence embeddings + cosine similarity	Free-form strings with inconsistent formatting
Start/End Date	Date window matching with decay function	Dates help constrain matching by temporal proximity
Agency	Fuzzy matching	Typos and naming variance between "OMD UK", "OMD"

6. Weighting Model (MCDA Scoring)

Each field’s match is weighted based on its predictive utility and availability:

Field	Weight	Notes
Campaign Name	0.35	High variability but semantically rich
Advertiser	0.30	Common and stable, but occasionally typo-prone
Agency	0.15	Less reliable, but helpful in combination
Start Date	0.10	Used to confirm closeness in flighting
End Date	0.10	Used in tandem with start date

The overall match score is calculated as a weighted sum of individual field match scores.

7. Record Linking Workflow

Step 1: Extract relevant fields from all source systems:

Normalize date formats, name casing, remove punctuation.

Step 2: Compute match scores per field:

Apply field-specific matching functions.
Normalize scores between 0 (no match) and 1 (perfect match).

Step 3: Combine using MCDA:

Multiply each field match score by its weight.
Sum to obtain a final composite match score.

Step 4: Apply thresholding:

Scores above a threshold (e.g., 0.75) are treated as linked.
Scores below may require manual review or fuzzy linking with fallback.

8. Implementation Recommendations

NLP Toolkits: Use libraries like spaCy, sentence-transformers (SBERT) for semantic matching.
Fuzzy Matching Libraries: RapidFuzz, FuzzyWuzzy, or difflib for string-based comparisons.
Scoring Infrastructure: Build a scoring engine that operates on pandas DataFrames for scalability.

9. Outcome

The result will be a single, unified campaign identity with reconciled data from all source systems, enabling:

Seamless automation of creative and tag validation
Full lifecycle visibility for campaign execution
Reduction in human-in-the-loop processing

Multi-Criteria Decision Analysis (MCDA)

MCDA Overview

Multi-Criteria Decision Analysis (MCDA) is a decision-making framework that evaluates multiple criteria in complex situations. The key component of MCDA is the pairwise comparison matrix, where the importance of each criterion is compared directly to every other criterion. This process involves a decision-maker or expert assigning values based on how much more important one element is compared to another. These comparisons result in a matrix where the diagonal is typically zero (as an element compared with itself has equal importance) and the off-diagonal elements represent the relative importance.

Pairwise comparison is a fundamental component of MCDA and is used to determine the relative importance of each criterion (or method, in the context of this example). Here's how it fits into the process, with a focus on the pairwise comparison aspect:

Pairwise Comparison of Criteria: The table represents a pairwise comparison matrix. For each pair of methods, a decision maker has determined how important one is compared to the other in contributing to the overall goal, which in this case is the extraction of terms from text. For instance, 'token' is considered equally important when compared to itself (hence, the score is 0), but it is considered 4 times as important as 'semantic'. This process is repeated for every pair of methods, resulting in the upper triangle of the matrix.
Scoring the Pairwise Comparisons: The numbers in the matrix reflect the relative importance between pairs. A score of 1 indicates that both methods are of equal importance. Scores above 1 indicate that the row method is more important than the column method, with the magnitude of the number reflecting the strength of the preference. Scores between 0 and 1 would indicate the opposite.
Summing the Pairwise Comparisons: After all pairwise comparisons are made, each method's importance scores are summed to provide a total score reflecting its overall importance relative to all other methods.
Normalizing the Weights: To make these sums usable as weights, they are normalized to sum up to 1 (or 100%). This is done by dividing each method's sum by the grand total of all sums. The resulting normalized weights reflect the proportionate importance of each method.
Application of Weights: These normalized weights are then used in the decision-making process. For the term extraction example, when each provider (token, semantic, noun_phrase, noun, sentence) generates a score for a term, that score is multiplied by the method's weight to obtain a score that reflects both the term's relevance and the method's importance as determined by the pairwise comparison process.

In summary, the pairwise comparison matrix is a structured way to capture and quantify subjective judgments about the relative importance of each criterion. These judgments are then converted into a set of weights through normalization, which can be applied to the decision-making process. The method ensures that the final decision reflects the considered opinion of experts or decision-makers about the relative importance of the criteria involved.

MCDA Legend

Value	Meaning
0	No effect (on the diagonal)
1	As important
2	More important
4	Significantly more important
0.5	Less important
0.25	Significantly less important

MCDA Weights

	token	semantic	noun_phrase	noun	sentence
token	0	0.5	1	1	0.25
semantic	4	0	2	2	0.5
noun_phrase	4	0.5	0	1	0.5
noun	4	0.5	2	0	0.25
sentence	4	2	4	2	0
sums	16	3.5	9	6	1.5
weight	0.44444444	0.09722222	0.25	0.16666667	0.04166667


mcda_weights = {
    "token": 0.44444444,
    "semantic": 0.09722222,
    "noun_phrase": 0.25,
    "noun": 0.16666667,
    "sentence": 0.04166667
}

MCDA Pseudo Implementation

The Python implementation would look something like this (assuming we have the data structures available):


import numpy as np

# This is an example, you would actually load this data from your source
# terms_scores: dict where key is the term_id and the value is a list of scores from each provider

terms_scores = {
    "d2ac602d-298f-4979-af43-63d189067136": [1, 0.78, 0, 1, 1],
    "423c405e-1bce-4ffa-884e-39dca3911436": [1, 0.65, 1, 0, 1],
    "b984dcc9-d71d-4c4b-a1e6-064bd823e48e": [0, 0.67, 1, 1, 0],
    "4d951431-4e24-4ee1-9d06-edbca35857e2": [1, 0.99, 1, 0, 1],
    "46b0a20a-3087-446d-8b9c-eb756c151b1c": [1, 0.67, 0, 1, 0],
    "f83b321d-c543-49d1-8ba7-2cc95d7003ae": [0, 0.12, 1, 0, 1],
    "1ec4693b-f57e-4179-bda2-db3004f6130c": [1, 0.67, 0, 1, 0]
}


# mcda_weights: dict where key is the method and the value is the weight

mcda_weights = {
    "token": 0.44444444,
    "semantic": 0.09722222,
    "noun_phrase": 0.25,
    "noun": 0.16666667,
    "sentence": 0.04166667
}

# Convert scores and weights into arrays for easier manipulation
# Assuming the order of the scores aligns with the order of methods in mcda_weights
scores_array = np.array(list(terms_scores.values()))
weights_array = np.array(list(mcda_weights.values()))

# Apply MCDA weights to the scores
weighted_scores = scores_array * weights_array

# Calculate the final weighted score for each term by summing across the methods
final_scores = weighted_scores.sum(axis=1)

print(final_scores)

# Apply a threshold to filter the terms
threshold = 0.7 # Define your threshold here
filtered_terms = {term_id: score for term_id, score in zip(terms_scores.keys(), final_scores) if score >= threshold}

# The filtered_terms now contains the term IDs and their weighted score above the threshold
print("Filtered Terms IDs with scores above the threshold:")
for term_id, score in filtered_terms.items():
    print(f"Term ID: {term_id}, Score: {score}")

In the code above:

terms_scores would be replaced by your actual data structure that holds the term IDs and their associated scores.
mcda_weights should contain the weights from the MCDA analysis for each method.
The scores are weighted using the MCDA weights, and then a threshold is applied to select the terms.

Example output:


Filtered Terms IDs with scores above the threshold:
Term ID: d2ac602d-298f-4979-af43-63d189067136, Score: 0.7286111116
Term ID: 423c405e-1bce-4ffa-884e-39dca3911436, Score: 0.7993055530000001
Term ID: 4d951431-4e24-4ee1-9d06-edbca35857e2, Score: 0.8323611078