Outline Document: Record Linking Approach Using MCDA
1. Objective
To unify campaign data from disparate source systems (Prisma, CARIA, Tag Sheets) by resolving and linking related records to a single canonical campaign entity. This will support automation, data integrity, and reduce manual reconciliation.
2. Overview of the Record Linking Challenge
Campaign-related data originates from multiple systems with:
- Inconsistent naming conventions
- Partial overlaps in metadata
- Human-generated input leading to typos or variability
Traditional deterministic joins (e.g., exact matches on campaign name or advertiser) are insufficient. Therefore, a multi-criteria probabilistic matching approach is required.
3. Strategy: Multi-Criteria Decision Analysis (MCDA)
We adapt an MCDA-based record linking framework. Each field contributes to the overall match score based on:
- Its reliability across systems
- The degree of expected variation
- Its availability in all systems
Each field is paired with an appropriate matching strategy, and weighted based on its importance and consistency.
4. Fields Considered for Matching
The following metadata fields are used to establish a link between records across systems:
| Field | Systems Present In | Description |
|---|---|---|
| Advertiser | Prisma, CARIA, Tag Sheets | Client or brand name |
| Campaign Name | Prisma, CARIA, Tag Sheets | Human-readable campaign title |
| Start Date | Prisma, CARIA, Tag Sheets | Campaign launch date |
| End Date | Prisma, CARIA, Tag Sheets | Campaign termination date |
| Agency | Prisma, CARIA | Media or creative agency responsible |
5. Matching Techniques per Field
Each field uses a strategy tailored to its data quality and semantic variability:
| Field | Matching Strategy | Justification |
|---|---|---|
| Advertiser | Fuzzy matching (Levenshtein, Jaro-Winkler) + NER normalization | Variants like "Tesco", "Tesco PLC", "TESCO UK" |
| Campaign Name | Sentence embeddings + cosine similarity | Free-form strings with inconsistent formatting |
| Start/End Date | Date window matching with decay function | Dates help constrain matching by temporal proximity |
| Agency | Fuzzy matching | Typos and naming variance between "OMD UK", "OMD" |
6. Weighting Model (MCDA Scoring)
Each field’s match is weighted based on its predictive utility and availability:
| Field | Weight | Notes |
|---|---|---|
| Campaign Name | 0.35 | High variability but semantically rich |
| Advertiser | 0.30 | Common and stable, but occasionally typo-prone |
| Agency | 0.15 | Less reliable, but helpful in combination |
| Start Date | 0.10 | Used to confirm closeness in flighting |
| End Date | 0.10 | Used in tandem with start date |
The overall match score is calculated as a weighted sum of individual field match scores.
7. Record Linking Workflow
Step 1: Extract relevant fields from all source systems:
- Normalize date formats, name casing, remove punctuation.
Step 2: Compute match scores per field:
- Apply field-specific matching functions.
- Normalize scores between 0 (no match) and 1 (perfect match).
Step 3: Combine using MCDA:
- Multiply each field match score by its weight.
- Sum to obtain a final composite match score.
Step 4: Apply thresholding:
- Scores above a threshold (e.g., 0.75) are treated as linked.
- Scores below may require manual review or fuzzy linking with fallback.
8. Implementation Recommendations
- NLP Toolkits: Use libraries like spaCy, sentence-transformers (SBERT) for semantic matching.
- Fuzzy Matching Libraries: RapidFuzz, FuzzyWuzzy, or difflib for string-based comparisons.
- Scoring Infrastructure: Build a scoring engine that operates on pandas DataFrames for scalability.
9. Outcome
The result will be a single, unified campaign identity with reconciled data from all source systems, enabling:
- Seamless automation of creative and tag validation
- Full lifecycle visibility for campaign execution
- Reduction in human-in-the-loop processing
Multi-Criteria Decision Analysis (MCDA)
MCDA Overview
Multi-Criteria Decision Analysis (MCDA) is a decision-making framework that evaluates multiple criteria in complex situations. The key component of MCDA is the pairwise comparison matrix, where the importance of each criterion is compared directly to every other criterion. This process involves a decision-maker or expert assigning values based on how much more important one element is compared to another. These comparisons result in a matrix where the diagonal is typically zero (as an element compared with itself has equal importance) and the off-diagonal elements represent the relative importance.
Pairwise comparison is a fundamental component of MCDA and is used to determine the relative importance of each criterion (or method, in the context of this example). Here's how it fits into the process, with a focus on the pairwise comparison aspect:
-
Pairwise Comparison of Criteria: The table represents a pairwise comparison matrix. For each pair of methods, a decision maker has determined how important one is compared to the other in contributing to the overall goal, which in this case is the extraction of terms from text. For instance, 'token' is considered equally important when compared to itself (hence, the score is 0), but it is considered 4 times as important as 'semantic'. This process is repeated for every pair of methods, resulting in the upper triangle of the matrix.
-
Scoring the Pairwise Comparisons: The numbers in the matrix reflect the relative importance between pairs. A score of 1 indicates that both methods are of equal importance. Scores above 1 indicate that the row method is more important than the column method, with the magnitude of the number reflecting the strength of the preference. Scores between 0 and 1 would indicate the opposite.
-
Summing the Pairwise Comparisons: After all pairwise comparisons are made, each method's importance scores are summed to provide a total score reflecting its overall importance relative to all other methods.
-
Normalizing the Weights: To make these sums usable as weights, they are normalized to sum up to 1 (or 100%). This is done by dividing each method's sum by the grand total of all sums. The resulting normalized weights reflect the proportionate importance of each method.
-
Application of Weights: These normalized weights are then used in the decision-making process. For the term extraction example, when each provider (token, semantic, noun_phrase, noun, sentence) generates a score for a term, that score is multiplied by the method's weight to obtain a score that reflects both the term's relevance and the method's importance as determined by the pairwise comparison process.
In summary, the pairwise comparison matrix is a structured way to capture and quantify subjective judgments about the relative importance of each criterion. These judgments are then converted into a set of weights through normalization, which can be applied to the decision-making process. The method ensures that the final decision reflects the considered opinion of experts or decision-makers about the relative importance of the criteria involved.
MCDA Legend
| Value | Meaning |
|---|---|
| 0 | No effect (on the diagonal) |
| 1 | As important |
| 2 | More important |
| 4 | Significantly more important |
| 0.5 | Less important |
| 0.25 | Significantly less important |
MCDA Weights
| token | semantic | noun_phrase | noun | sentence | |
|---|---|---|---|---|---|
| token | 0 | 0.5 | 1 | 1 | 0.25 |
| semantic | 4 | 0 | 2 | 2 | 0.5 |
| noun_phrase | 4 | 0.5 | 0 | 1 | 0.5 |
| noun | 4 | 0.5 | 2 | 0 | 0.25 |
| sentence | 4 | 2 | 4 | 2 | 0 |
| sums | 16 | 3.5 | 9 | 6 | 1.5 |
| weight | 0.44444444 | 0.09722222 | 0.25 | 0.16666667 | 0.04166667 |
mcda_weights = {
"token": 0.44444444,
"semantic": 0.09722222,
"noun_phrase": 0.25,
"noun": 0.16666667,
"sentence": 0.04166667
}
MCDA Pseudo Implementation
The Python implementation would look something like this (assuming we have the data structures available):
import numpy as np
# This is an example, you would actually load this data from your source
# terms_scores: dict where key is the term_id and the value is a list of scores from each provider
terms_scores = {
"d2ac602d-298f-4979-af43-63d189067136": [1, 0.78, 0, 1, 1],
"423c405e-1bce-4ffa-884e-39dca3911436": [1, 0.65, 1, 0, 1],
"b984dcc9-d71d-4c4b-a1e6-064bd823e48e": [0, 0.67, 1, 1, 0],
"4d951431-4e24-4ee1-9d06-edbca35857e2": [1, 0.99, 1, 0, 1],
"46b0a20a-3087-446d-8b9c-eb756c151b1c": [1, 0.67, 0, 1, 0],
"f83b321d-c543-49d1-8ba7-2cc95d7003ae": [0, 0.12, 1, 0, 1],
"1ec4693b-f57e-4179-bda2-db3004f6130c": [1, 0.67, 0, 1, 0]
}
# mcda_weights: dict where key is the method and the value is the weight
mcda_weights = {
"token": 0.44444444,
"semantic": 0.09722222,
"noun_phrase": 0.25,
"noun": 0.16666667,
"sentence": 0.04166667
}
# Convert scores and weights into arrays for easier manipulation
# Assuming the order of the scores aligns with the order of methods in mcda_weights
scores_array = np.array(list(terms_scores.values()))
weights_array = np.array(list(mcda_weights.values()))
# Apply MCDA weights to the scores
weighted_scores = scores_array * weights_array
# Calculate the final weighted score for each term by summing across the methods
final_scores = weighted_scores.sum(axis=1)
print(final_scores)
# Apply a threshold to filter the terms
threshold = 0.7 # Define your threshold here
filtered_terms = {term_id: score for term_id, score in zip(terms_scores.keys(), final_scores) if score >= threshold}
# The filtered_terms now contains the term IDs and their weighted score above the threshold
print("Filtered Terms IDs with scores above the threshold:")
for term_id, score in filtered_terms.items():
print(f"Term ID: {term_id}, Score: {score}")
In the code above:
terms_scoreswould be replaced by your actual data structure that holds the term IDs and their associated scores.mcda_weightsshould contain the weights from the MCDA analysis for each method.- The scores are weighted using the MCDA weights, and then a threshold is applied to select the terms.
Example output:
Filtered Terms IDs with scores above the threshold:
Term ID: d2ac602d-298f-4979-af43-63d189067136, Score: 0.7286111116
Term ID: 423c405e-1bce-4ffa-884e-39dca3911436, Score: 0.7993055530000001
Term ID: 4d951431-4e24-4ee1-9d06-edbca35857e2, Score: 0.8323611078