FuzzyMatchPro

Published on Jun 19, 2023

Scene 1 (0s)

[Audio] Hi, we are excited to present our experimental product 'FuzzyMatchPro'. Decode datacenter Asset Naming Match with AI..

Scene 2 (49s)

[Audio] We individuals with expertise in data science, software engineering, and program management, are all united in our common goal to leverage AI in order to achieve consistent asset name matching across various tools, with the odds of data accuracy..

Scene 3 (4s)

[Audio] Today, Field Operations follow a set procedure for mitigating incidents or running maintenance tasks. Users verify if any related Incidents have been created using the IcM tool. Next, users log in to the WebTMA CMMS tool to validate any open work orders. In addition, they review trouble shooting guides or procedures, and asset details in the GDCO inventory with any hierarchy information that will support in establishing a conclusive root cause of the issue..

Scene 4 (50s)

[Audio] User takes about 14 minutes to greater than 30 minutes to identify the issue and mitigate the alert and multiple days to complete maintenance task. Each passing minute increases the potential risk to customer impact..

Scene 5 (1m 5s)

[Audio] The multitude of software tools has led to a challenge with asset name nomenclature disparity across tools making it arduous to identify, integrate, correlate assets across the system..

Scene 6 (1m 23s)

[Audio] To address the disparate asset name issue, we conducted experiments using normalization, pattern-based ML model, and Chat GPT semantic kernel techniques..

Scene 7 (1m 34s)

[Audio] The integration on asset name will allow users will have all the details at their hand improving accuracy and speed of issue resolution and maintenance tasks, including scaling with datacenter growth..

Scene 8 (1m 48s)

[Audio] The impact of fuzzy asset name matching across datacenter tools is far-reaching effecting regional operations initiative, IRIS correlations at device level, lease provider coordination and innovation initiatives like DC tool Box..

Scene 9 (2m 14s)

Procedural Approach. Normalization Strategy. Microsoft Confidential.

Scene 10 (2m 22s)

Regex to replace all non-alphanumeric seperators in name with a standard seperator.

Scene 11 (2m 36s)

[Audio] The slide addresses the need for normalizing the naming process of EPMS/BAS asset tags. Currently, there is a high degree of variation in both format and naming conventions. The existing EPMS naming examples demonstrate the diverse formats used, such as "COLO1_CE2_UPS01_CTL01," "DB10-C3B-AH3A," "DB3-COLO3-MON," and "DB3-C4-CC." By normalizing the naming process, we can achieve several benefits, including consistency and improved clarity in asset tag names. One advantage of standardized naming is the ability to extract useful information from the asset tags, such as the name and location of the device within the data center. This standardization effort aligns with ongoing projects like the DC Toolbox, where having consistent and structured asset tag names can facilitate efficient data management and analysis..

Scene 12 (3m 43s)

[Audio] This slide highlights the process of using a regex to extract non-alphanumeric tokens from the EPMS asset tags and subsequently analyzing and categorizing them into different "buckets" representing naming conventions. By going through each asset tag, the tokens were classified into five classes: Data Center, Colo, Cell, Device, and Other. The "Other" class represents tokens that couldn't be categorized into any of the previous four classes. This classification process provided key bits of information for each asset tag, allowing for better understanding and analysis of the naming conventions used. Using these classes, a normalized version of the asset tag names was created, which helped establish a similar standard across different data centers and regions. The pre-normalized EPMS example showcases a complex asset tag with multiple tokens, while the post-normalized version demonstrates the standardized format achieved through the normalization process. The post-normalized format follows the pattern: DataCenter-Colo-Cell-Device-Other, as shown in the example "SN7-COLO1-CELL4-PDU06-PDU-FCB01-A-E-Wh-3PTot.".

Scene 13 (5m 7s)

Uses AI/ML strategy best suited to situation:. Pattern-based ML Model Strategy.

Scene 14 (5m 19s)

[Audio] Our Pattern-Based Asset Matcher utilizes a traditional machine learning approach to automate the process of identifying the best matches between two sets of asset names. This model is built upon key string features such as edit distance, common substring, and N-grams to evaluate the similarity of name pairs. The strength of this approach is not only reflected in our performance metrics but also in the robustness of the model. Even though the model was trained on SN7 data, it managed to generalize impressively well to unseen data, achieving a precision of 81% and recall of 87% on DUB13 between EPMS and CIH names.".

Scene 15 (6m 2s)

[Audio] In this slide, we walk you through the key steps of our Pattern Learning Model. Starting from the 'Input Data', which consists of both matched and unmatched pairs of asset names, leveraging our labeled dataset , we move onto 'Data Preprocessing' where we clean and normalize the data. Next, during 'Feature Extraction', we compute a set of features such as Edit Distance, Common Substring, and N-gram among others. The 'Model Training' phase leverages XGBoost to learn from these features and generate a model that can predict the likelihood of a pair being a good match. Finally, in the 'Model Evaluation' step, we use metrics like precision and recall to assess the model's performance on unseen data. The end result is a robust pipeline that can identify good matches across diverse naming patterns with high scalability and strong performance. This is the essence of our Pattern Learning Model..

Scene 16 (7m 2s)

[Audio] In this slide, we're exploring the parallels between the way our Pattern Learning Model works and the way humans intuitively judge whether two names match. When humans compare two strings, they ask certain intuitive questions, like - Do they share common words or tokens? Are the order of characters or words similar? Are the lengths of the two names similar? Do they have similar character or word patterns? How many changes would I need to make for these two names to match?' These factors mirror the features our model evaluates. In the table, you see a side-by-side comparison of these features for a good match and a bad match. The takeaway here is that our Pattern Learning Model emulates the intuitive human judgment process, leveraging a combination of features to accurately determine good matches and ensure high precision and recall.

Scene 17 (7m 56s)

[Audio] We trained the Pattern Learning Model on the SN7 dataset and tested its performance on the unseen DUB13 dataset. Despite the diversity in datasets, our model achieved a precision of 81%, meaning it correctly identified 81 out of every 100 matches. Further, it achieved a recall of 87%, indicating that it captured 87 out of every 100 actual matches. This performance, as shown in the confusion matrix, validates the robustness and adaptability of our model. This is crucial as it means the model can easily extend to other datacenters, even without any labeled data. Our key takeaway here is that the Pattern Learning Model not only performs well but can also adapt effectively to unseen data..

Scene 18 (8m 47s)

Transform a name using a few examples. Strategy Uses:.

Scene 19 (9m 10s)

ChatGPT Strategy Advantages. Developed once; improved as ChatGPT and examples in database improve Both humans and ChatGPT can author examples Humans can validate examples but very few are required Roughly 10 validated examples needed from humans per major naming convention in fleet Examples are provided in natural language: name1 = name2 = name3 = name4, etc. paragraph describing how name1 was transformed into other names ChatGPT can provide additional examples that humans can validate quicker Tokenization to Embedding Vector processing benefits from Normalization Strategy Vector Database quickly identifies most contextually relevant examples Semantic Kernal + examples generates ReAct style prompt.

Scene 20 (9m 35s)

[Audio] The proposed architecture leverages Azure OpenAI's Text Prompts and the text-davinci-003 Language Model (LLM) to implement a fuzzy matching system. The following steps outline the execution flow of the system: Step 1: Preparation of Knowledge Data In this step, the system prepares the necessary knowledge data for pattern matching and searching. This involves creating verified tag mappings for EPMS (Electrical Power Monitoring System) to CIH per datacenter. These mappings serve as the context for the prompt generation. Step 2: User/API Client Query The user or an API client submits a query containing EPMS/CIH datapoint names to the system's API. Step 3: Task Decomposition and Prompt Preparation Upon receiving the query, the system's kernel decomposes the task and prepares a basic prompt. This prompt forms the foundation for further processing. Step 4: Sample Integration and Prompt Construction To enhance the prompt's effectiveness, the application retrieves samples from tagged data. These samples are then integrated into the prompt, enriching it with relevant information. Step 5: Semantic Kernel Prompt Submission The application sends the constructed Semantic Kernel prompt to Azure Open AI's platform, which specializes in pattern matching and semantic understanding. Step 6: Response from Azure Open AI Azure Open AI processes the prompt and generates a response that contains the best pattern match for the given query. Step 7: Formatting by Semantic Kernel Upon receiving the response from Azure Open AI, the Semantic Kernel component of the system utilizes it to format the output in a structured and organized manner. Step 8: Response Delivery to User/API Client The formatted response is then sent back from the application to the user or API client, providing them with the relevant information they requested. By following this architecture, the system effectively utilizes device/datapoint name fuzzy matching capabilities powered by Azure OpenAI's Text Prompts and the text-davinci-003 LLM model to deliver accurate and contextually rich responses to user queries..

Scene 21 (12m 24s)

Demo Semantic Kernal with Text Prompt. A picture containing text Description automatically generated.

Scene 22 (13m 6s)

[Audio] The proposed architecture utilizes Azure OpenAI's text-davinci-003 Language Model (LLM) for prompting and text-embedding-ada-002 for creating embeddings to enhance search capabilities. The system's execution flow is outlined in the following steps: Step 1: Preparation of Knowledge Data The system prepares the necessary knowledge data, including verified tags and relevant information, to facilitate pattern matching and searching. These mappings establish the context for prompt relevance. Step 2: Creation of Embeddings The system employs the text-embedding-ada-002 model to create embedding for the knowledge data. These embeddings enable efficient search and vector-based pattern matching. Step 3: Storage of Embeddings The system stores the generated embeddings in a vector database such as PostgreSQL or SQLite. This allows for persistent storage and improved performance by avoiding repeated embedding generation. Step 4: User/API Client Query The user or an API client submits a query to the system's API, providing information related to the anomaly or requested data. Step 5: Task Decomposition and Prompt Preparation The system's kernel decomposes the query into specific tasks and prepares a basic prompt that serves as the foundation for further processing. Step 6: Embedding Preparation for Task Decomposition The Semantic Kernel component feeds task information to Azure OpenAI for pattern matching. Azure OpenAI decomposes the tasks into basic embeddings for efficient search. Step 7a/b: Embedding Search for Response The Semantic Kernel utilizes Azure OpenAI's prompt to issue searches in the stored embeddings' memory, aiming to find relevant matches based on the queried task. Step 8: Formatting by Semantic Kernel Once the response is received from Azure OpenAI, the Semantic Kernel component of the system formats the output in a structured and organized manner, ensuring readability and clarity. Step 9: Response Delivery to User/API Client The formatted response is then delivered from the application to the user or API client, providing them with the requested information and relevant insights. By leveraging Azure OpenAI's text-davinci-003 Language Model for prompting and text-embedding-ada-002 for creating efficient embeddings, this architecture enables accurate pattern matching, fast search, and effective information retrieval, enhancing the overall performance of the system..

Scene 23 (16m 14s)

[Audio] The output of this matching process is an API that can be utilized across various processes for data integration and the deployment of machine learning (ML) models. While a supervised ML model requires a feedback loop for refinement, it is also beneficial to have a similar feedback loop for the semantic kernel approach. In the meantime, an unsupervised approach will be utilized and tested using with ROC program to begin with..

Scene 24 (16m 43s)

Thank you. Handshake. 9. Microsoft Confidential.

Scene 25 (16m 51s)

Microsoft Confidential. 25. Appendix Slides not for use beyond this point.

Scene 26 (16m 59s)

Resources. 9. Microsoft Confidential. DC Naming Standards Data Entry Procedure.

Scene 27 (17m 7s)

Microsoft Confidential. 27. Appendix. Sample Data SN7 # of Asset Names 100 Similar 60 (60%) Not Similar 40 (40%).

Scene 28 (17m 17s)

Q&A. Thank You. Microsoft Confidential. 28. iZWebTMA Equipment Dw09COL03CELLCLPS01A DescrWion VX 1200kW ups GVX1soo KD serial • u7191'001376 Out*' Oue c.mø.sA Dub*' DUB-09 DIJ89.t2 CE Cat"" Tag Served' O Dec Aceant Oat. seafr ups.

Scene 29 (17m 28s)

Identifying related device IDs using Fuzzy Match Model & Heuristics.

Scene 30 (17m 55s)

Jacquard Similarity Model Strategy Considerations.

Scene 31 (18m 10s)

Microsoft Confidential. 31. COL02. CELL4. CELL4. COLO 2 COLO 2 CELL 3 AHLD4 CELL4. CELL3. CELL4. CELL4. COL02 AHU07 ELEC 4 ELEC 3 BATT COLO 2 COLO AHU WILL BE PROVIDED WITH INTERNAL OONTROLS AND CIRCUIT SETTER (TYP. FORALL) BATT BATT COL02_ AHUOI COLO 2 ELEC 2 COLO 2 COL02. CELLI..

Scene 32 (18m 34s)

[Audio] Our product, 'FuzzyMatchPro', a smart solution aims to run a fuzzy match on the asset names that will allow data integration amongst multiple tools such as work orders from CMMS, Request for change, EPMS/BAS alarms, equipment and spares details from GDCO and operational states and incident data and hierarchies from CIH data..

Scene 33 (19m 8s)

Microsoft Confidential. 33. Hyphen -, Backslash /, Dots . ADMIN_ATS01_ATS\D\Mode\Pref\S1 SN7-ADMIN-ATS01.S1.Status.Preferred.

Scene 34 (19m 19s)

Microsoft Confidential. 34.

Scene 35 (19m 25s)

[Audio] Richard Update Assemble lists of device names From a variety of different sets of device names. Maybe start with just 2. 2. Get examples of device names that do match 3. Have an SME explain how to do this matching I'm guessing it involves domain knowledge that says for example this is a generator, generator number 2, in cell B. I'm guessing there is a combination of attributes about a device which can be represented which we can use to match up devices that come from different naming conventions. 4. Try teaching a computer to do this Imagine a function that says given a device name, tell us what you know about it: Example: DB3-AHU-4-8 Datacenter = db3 Type = AHU Cell = 4 Unit number = 8 I imagine the list of attributes would be different for different device types. Hypothesis: if we could make such a function, we could use the resulting list of attributes to do the matching. When certain attributes are unclear, we can flag those as requiring SME input 5. Long term suggestion: Think of this like assembling a database of the One True Device Name, plus synonyms • Let's get HWIS involved My understanding is that HWIS should own the idea of which assets/devices are contained within each other, including CE devices. Long-term this should be in their wheelhouse. HWIS is who needs to own the translation database. HWIS does what asset is contained within what, Facility Master handles what asset is connected to what.

Scene 36 (21m 12s)

Microsoft Confidential. 36. User Scenario 1.

Scene 37 (21m 19s)

[Audio] In 2023 (Jan-Till Date), X of total Y device IDs sampled were found to be matching. Scoping a new rule based on strongly related fuzzy device ID similarity - a match score of Z (e.g., 0.8, 0.9 or above) – in a product set would identify M of these device IDs and improve our integration between tools by N%. The biggest lift from scoping heuristics based on fuzzy device id matching would occur in 'R' region, with P % integration..

Scene 38 (21m 54s)

Jacquard Similarity Model Strategy Success. Out of a sample of X device IDs for duration … .model / heuristics tags about false positive rate of FP% More research needed??? How was pattern matching??? Did the sample have enough fidelity to determine if these conventions generate more false positives for certain types of device names in certain tools???.

Scene 39 (22m 12s)

Normalization Strategy Examples. 39. Microsoft Confidential.