[Audio] Background Human Errors in Operations often results in a Major Incident (MI) or an Outage and it severely impacts our customers' business ! Major Incident or outage in production may cause a direct impact on Revenue Trust Reputation of our customers and it deteriorate TCS relationship due to credibility impact Most of these Human Errors are caused by incorrect execution or omission of critical action, which are avoidable with the required process controls and best practices enablement. 2 TCS Confidential.
[Audio] Prevent Human Errors by Design – to be eliminated with RiO best practices adoption The 10 most common human errors in operations are detailed to help identify and eliminate in operations Sustain & Share Eliminate with Action Learn How to Avoid Understand the Cause Top 10 Human Errors Perform PAV / RAQs Assessment Know Top 10 Common Learn best practices & Eliminate causes by implementing regularly to mitigate risks Human Errors in Operations process controls to avoid it Permanent Fix/Actions (Check RiO Norms for details) Understand why it happens with common scenarios / examples (RiO Network will help with details) and share best practices with others These top 10 Human Errors have occurred repeatedly in Operations (AO/IS), and all of these are avoidable with RiO & PAV practices. 3 TCS Confidential.
[Audio] Prevent Human Errors by Design: Top 10 Human Errors in Operations 2 Accidental change in production 1 Missing Critical Alert Notification Missing critical alert notification in production resulting in business impacting Major Incident or unplanned outage Unintended accidental change in production assuming non-production environment by privileged admin users 3 Direct SQL with incorrect 'where' clause 4 Incorrect Trigger of Batch Job 2 Jobs Failed Incorrect trigger or mishandling of batch job due to manual batch processing Unexpected data update due to incorrect SQL execution in command window without 'where' clause or parameters Misuse of 'unnecessary' privileged access 5 Error in manual command execution 6 Misuse of unwanted privileged access provisioned in production to non-admin users Manual command execution without peer review during change execution in production Inadequate Change Impact Analysis 7 Non-Adherence to SOPs 8 Inadequate change impact analysis leading to change induced issues in production Change failure due to missing a critical step of Standard Operating Procedure (SOP) during a change in production Not following 'Maker Checker' process 9 Missing Server certificate renewal 10 Unplanned outage due to a miss in timely renewal of server certificate/license due to manual tracking of expiry dates Incorrect change in production without review due to Maker Checker process not followed for critical operations activity Note: Identified Top 10 Human Errors listed above are based on the Human Errors reported over last 2 years by AO/IS Operations engagement across Units. 4 Delivery Excellence Group TCS Confidential.
[Audio] Accidental change in production assuming Non-Production 1 Accidental change in production assuming Non-Production Common Scenarios: Human Error Risk Category: Very High Scenario 1: Support Executive or DBA updates/deletes data in production environment assuming it as a non-Prod database server Frequency: High; Impact: High Frequency Impact Risk Identifiable: Yes, with PAV/RAQs Assessment Scenario 2: Infrastructure Support executive update production resource (network / firewall / router / storage drive etc.) assuming it is non-production resource Risk Ownership: Internal (TCS) Avoidable: Yes, with process controls Why does it happen? How to avoid it? ➢ Enable Visible Demarcation or Alert users while login into Production ➢ Unable to identify Production and Non-Production visually ➢ Remove permanent direct Access, enable multi-layered (VDI / VPN / 2-factor) ➢ Permanent Write Access provisioned in Production authentication in production ➢ Multiple Windows / Session opened side by side ➢ Do not work with production and non-production windows open side-by-side ➢ Maintain Log entry for Production server access, and enable audit checks Replace 'write' with 'read' access in production, if only 'read' is required Leverage Privileged Access Management (PAM) tool for on-demand access Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 5 Delivery Excellence Group TCS Confidential.
[Audio] Missing Critical Alert Notification in production resulting in Major Incidents 2 Missing critical alert notification in production Common Scenarios: Missing critical alert notification in production resulting in Major Incident resulting in Major Incident Scenario 1: Support executive miss critical production alerts received via email Human Error Risk Category: Very High Frequency Frequency: High; Impact: High/Medium Scenario 2: Command Center / Monitoring team missed to escalate critical alert timely to support team Impact Risk Identifiable: Yes, with RAQs/RiO WB Assessment Scenario 3: Delayed Alerts / No Alert on critical system or infrastructure failure due to monitoring alert configuration without a right threshold Risk Ownership: Internal (TCS) / Shared Avoidable: Yes, with monitoring best practices Scenario 4: Missing out critical alert assuming duplicate/false alert while support team is focusing on resolving an ongoing critical issue Why does it happen? How to avoid it? ➢ Alert monitoring process over email at a very basic level of maturity ➢ Enable monitoring alerts for critical Apps / CIs / systems with right threshold ➢ Missing of critical alert due to too many parameters and thresholds monitored ➢ Enable a dedicated L1 / Command center (CC) team for monitoring ➢ Default Alert criticality used across all tiers of services / component Enable automated alert to ticket in ITSM tool configured for critical alerts ➢ Multiple decentralized monitoring systems and/or ticketing systems by monitoring tools or by command center team ➢ Monitoring Alert not configured adequately for required coverage of system / Enabling duplicate Alert suppressions techniques & elevate true Alerts infra resources or threshold are not defined correctly Automated time-bound escalations to support group on repeat alerts ➢ Too many false alert with the lack of duplicate alert suppression mechanism Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 6 Delivery Excellence Group TCS Confidential.
[Audio] Incorrect Trigger of Batch Job process 3 Missing critical alert notification in production Common Scenarios: Incorrect Trigger of Batch Job process resulting in Major Incident Human Error Risk Category: Very High Scenario 1: Batch Job triggered during business hours by support executive due to miss in manual configuration of batch job schedule resulting in server slow down and impacting business transactions Frequency Frequency: High/Medium; Impact: High/Medium Impact Risk Identifiable: Yes, with RAQs, RiO WB Assessment Scenario 2: Delay in a manual batch job scheduling by back-up support executive while primary support executive on leave resulting in delayed availability of key business report 2 Jobs Failed Risk Ownership: Internal (TCS) Avoidable: Yes, with Batch Operations best practices Scenario 3: Missing the rollback activities of processed transactions post force cancelling of a failed batch job by support executive, resulting in incorrect data processing due to lack of SOP adherence Why does it happen? How to avoid it? ➢ Manual Batch job schedule / trigger instead of automated batch job ➢ Define Standard Operating Procedure (SOP) and checklist for Batch process scheduling tools usage (like control-M) ➢ Enable Batch jobs with automatic execution through standard schedulers ➢ Adhoc Batch Job to fix issue or Month / Quarter end / Year end processing ➢ Enable Alerts on critical batch job failure/delay with auto-ticket to ITSM tools resulting in errors ➢ Incorrect Batch Job start/stop due to lack of documented Batch Job scheduling steps or standard operating procedure (SOP) not followed Automated Batch Job Schedule with Cron job or tools (Control-M) Automated Script / tools (Ctrl-M) for self heal or rollback on Batch failure ➢ Miss or delay in noticing Batch Job failure due to manual monitoring of Auto reconciliation / checks on batch trigger monitoring and log Batch job processing monitoring to ensure batch success timely with auto alert on failure Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 7 Delivery Excellence Group TCS Confidential.
[Audio] Direct Database (SQL) execution with incorrect 'where' clause in prod command window 4 Missing critical alert notification in production Common Scenarios: Direct SQL execution with incorrect 'where' clause in production command window resulting in Major Incident Human Error Risk Category: Very High Scenario 1: DBA executed SQL execution in Production without 'where' clause due to miss in copy-paste of SQL in command window resulting into accidental delete of all records Frequency: High/Medium; Impact: High /Medium Frequency Impact Risk Identifiable: Yes, with PAV/RAQs Assessment Scenario 2: Support executive accidentally updated all data records in transaction table due miss in a 'where' clause in SQL while copy-paste of SQL in the command window Risk Ownership: Internal (TCS) delete * from prod_table; 98,247,595 records deleted! SQL Avoidable: Yes Scenario 3: Data corruption in production resulted in business impact during a data fix due to miss of all required parameters while copy-paste of SQLs by a DB maintenance support executive Why does it happen? How to avoid it? ➢ Disable direct SQL execution using command window ➢ Direct command line execution of SQL by accidental copy paste of partial ➢ Enable visual demarcation in production SQL command with Auto-commit enabled ➢ Follow the maker checker process during execution Use stored program / script / tools for data fix in prod with version control of stored programs / script Ensure database scripts are reviewed with the DB Administrator and approved before execution Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 8 Delivery Excellence Group TCS Confidential.
[Audio] Misuse of 'unnecessary' privileged access 5 Missing critical alert notification in production Common Scenarios: Misuse of 'unnecessary' provision privileged access resulting in Major Incident Human Error Risk Category: Very High Scenario 1: Development team member has unwanted (excessive) write access to production, and performs accidental change in production using production write access Frequency Frequency: High/Medium; Impact: High Impact Risk Identifiable: Yes, with PAV/RAQs Assessment Scenario 2: Server Migration Engineer accidentally update production server configuration during application server migration assuming non-prod as they have unwanted (excessive) access to production environment Risk Ownership: Internal (TCS) / Shared All records deleted ! Avoidable: Yes, with process controls Scenario 3: Unauthorized use of production access due to delay in revocation of production access post off-boarding of the associates Why does it happen? How to avoid it? ➢ Remove production access with-in 24 hours of the day of offboarding ➢ Unwanted Permanent Access provisioned in Production ➢ Enable Visible Demarcation in Production ➢ Wrong Access level provisioned not aligned with production role ➢ Maintain ACL as per segregation of duties & Role ➢ Usage of Generic Ids or System Ids for routine production activities ➢ Adequate coverage of required privilege access across shifts ➢ Access Control List (ACL) not maintained for Production access as per Segregation of Duties ➢ Reconcile production rights as per ACL and audit production access logs ➢ No regular Access reconciliation as per rights defined in ACL ➢ Delayed removal of production access post off-boarding ➢ No Visible Demarcation in Production Remove unwanted Access and Generic Ids in Production immediately Replace 'write' with 'read' access in production, if only 'read' is required Leverage Privileged Access Management (PAM) tool for on-demand access Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 9 Delivery Excellence Group TCS Confidential.
[Audio] Error in manual command execution 6 Missing critical alert notification in production Common Scenarios: Error in manual command execution resulting in Major Incident Human Error Risk Category: Very High Scenario 1: Production Storage SME accidently deleted production data copy instead of DR database while performing sync-up of DR database, due to command syntax issue resulted into permanent loss of data. Frequency Frequency: High/Medium; Impact: High /Medium Impact Risk Identifiable: Yes, with PAV/RAQs Assessment Scenario 2: Network support executive accidentally disabled switch by executing command with incorrect port resulting into outage for multiple applications due to wrong command syntax executed with out peer review Risk Ownership: Internal (TCS) Scenario 3: Infra support executive accidentally re-route traffic to faulty path or wrong server results in outage, as the change executed without peer review Avoidable: Yes, with process control Why does it happen? How to avoid it? ➢ Avoid Manual typing / copy-paste of command for changes in production ➢ Wrong command Syntax due to accidental copy paste or typo of command syntax while executing command in production window ➢ Ensure back-up or rollback plan in place before change execution Use predefined and reviewed scripts for routine maintenance activities ➢ Not using standard Command Script(s) for routine maintenance activities ➢ Maker Checker process not followed Ensure maker checker process for command-based changes in production Do post implementation checks to ensure no accidental changes Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 10 Delivery Excellence Group TCS Confidential.
[Audio] Inadequate Change Impact Analysis leading to change induced issues 7 Missing critical alert notification in production Common Scenarios: Inadequate change impact analysis leading to resulting in Major Incident Human Error Risk Category: Very High change induced issues in production Scenario 1: Application feature rolled-out by technical lead with new data calculation without considering the impact on the downstream system of the data flow post change rollout. This resulted in major incident due to failure of downstream system unable to process changed data post release. Frequency: High/Medium; Impact: High /Medium Frequency Impact Risk Identifiable: Yes, with PAV/RAQs & RiO WB Assessment Risk Ownership: Internal (TCS) Scenario 2: Data updated in core master tables by DB maintenance executive in production, without updates in respective transaction tables for historical data, due to improper impact analysis of requested data change by business resulting in business impacting major incident Avoidable: Yes Why does it happen? How to avoid it? ➢ Ensure change impact analysis performed and reviewed during Change ➢ Change impact evaluation is subject based on implementor's interpretations approval process for all major changes in Operations before rollout in prod ➢ Pseudo sense of urgency created by customer to perform change faster ➢ Ensure adequate comprehensive testing in pre-prod before rollout in prod ➢ Change impact Analysis is not planned for 'Normal' change due to lack of Change management process adherence ➢ Establish appropriate lead time for change assessment and implementation Maintain traceability metrics of features, requirement specifications and design elements Up-to-date, use it for an in-depth change impact analysis ➢ Change impact evaluation without using traceability metrics of features, requirement specification and design elements Evaluate change impact on downstream systems and dependent system Leverage and maintain CMDB for all dependent CI metrics for in-depth change impact analysis Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 11 Delivery Excellence Group TCS Confidential.
[Audio] Non-Adherence to Standard Operating Procedure (SOP) 8 Missing critical alert notification in production Common Scenarios: Missing critical step due to lack of adherence to resulting in Major Incident Human Error Risk Category: Very High Standard Operating Procedure (SOP) Scenario 1: Hundreds of users not receiving emails in outlook inbox, as support executive miss to add certificate to 'trusted root certificate' due to miss in following SOP while renewing certificate on mail server Frequency: High/Medium; Impact: High /Medium Frequency Impact Risk Identifiable: Yes, with PAV, RAQ & RiO WB Assessment Scenario 2: Missing of an important step in production standard change implementation due to lack of adherence to documented SOP by support executive leading to business impacting major incident Risk Ownership: Internal (TCS) Scenario 3: Delay in batch job failure recovery due to miss in leveraging written knowledge article recovery step by ETL admin resulting into extended outage Avoidable: Yes Why does it happen? How to avoid it? ➢ Develop and maintain SOPs / Knowledge Articles / KEDB / Checklist and tag it to ➢ Standard Operating Procedure (SOPs) are not referred during task execution ITSM tool for leverage ➢ SOPs / Knowledge Articles / KEDB / Checklist are not maintained Up-to-date ➢ SOPs to be maintained in a centralized single location and to be audited periodically. Do not maintain or use local versions of SOPs ➢ SOPs are documented in generalized manner without specific action/details ➢ Tag the SOPs with tickets / SRs with evidence of execution of SOPs task to be captured in ITSM tools ➢ Multiple SOPs repositories/versions and Outdated SOPs are not updated ➢ Track metrics on the leverage of Knowledge Articles / SOPs / Checklist while ➢ Leverage of SOPs/Knowledge Articles/KEDB / Checklist is not being tracked handling repeat incidents, SRs, Standard Changes or Minor Enhancements Identify Knowledge Champion, create standard template for SOPs / Knowledge Articles for recurring issues, service requests, activities etc. Automate SOPs / Checklist to scripts for routine maintenance activities Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 12 Delivery Excellence Group TCS Confidential.
[Audio] Not following 'Maker Checker' process 9 Missing critical alert notification in production Common Scenarios: Maker Checker process not followed for critical resulting in Major Incident Human Error Risk Category: Very High operations activity in production Scenario 1: Production maintenance Team has accidently included 'All Servers' to the Patch deployment instead of selecting a collection of 10 servers' group, while executing patch deployment in production single handedly Frequency: High/Medium; Impact: High /Medium Frequency Impact Scenario 2: Accidental delete of thousands of records due to poorly written SQL executed in the production server due to lack of SQL review by DBA Risk Identifiable: Yes, with PAV/RAQs/RiO WB Assessment Risk Ownership: Internal (TCS) Avoidable: Yes Why does it happen? How to avoid it? ➢ Maker Checker process is to be mandated for all critical change in production ➢ Peer review or Maker Checker process not planned while performing a ➢ A competent Sr. SMEs to be assigned as checker of critical change rollout and critical change in production let checker to review & approve the steps performed by maker before rollout ➢ Availability of Checker not planned during lean shift while executing critical Enable Maker Checker process in ITSM tool for change rollout & track with changes in production change success KPIs Leverage multiple channels and mode for Maker Checker adherence during the lean shifts Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 13 Delivery Excellence Group TCS Confidential.
[Audio] Missing Server certificate renewal 10 Missing critical alert notification in production Common Scenarios: Timely renewal of server certificate/license missed in resulting in Major Incident Human Error Risk Category: Very High production impacting application availability Scenario 1: Production support team missed on timely renewal of certificates in the customer IT environment proactively led to major incidents and impact to customer's business Frequency: High/Medium; Impact: High /Medium Frequency Impact Risk Identifiable: Yes, with RAQs Assessment Risk Ownership: Internal (TCS) / Shared Avoidable: Yes, with monitoring best practices Why does it happen? How to avoid it? ➢ Certificate maintenance plan is not available / not detailed in architecture ➢ Capture all Certificate maintenance during Transition and cover it in checklist ➢ No centralized repository of all server certificates with information on expiry ➢ Define clear ownership to specific role for certificate renewals date ➢ Establish repository of all the certificates in the IT environment, in the ITSM ➢ Ownership of certificate maintenance is not defined well tool / CMDB software to maintain certificates lists ➢ No proactive reminder alerts to support team or command center to ➢ Enable proactive alerting of upcoming certificate expiry to the relevant team monitor expiry of server certificate proactively ➢ Track and escalate risk to relevant stakeholders till certificate is renewed ➢ No ongoing maintenance checklist of server certificate renewal for support before expiry team to periodically check upcoming expiry of server certificates Enable automated renewals and deployment to avoid the outage Action Legends: ➢ Workaround/Best Practice Permanent Solution/Fix 14 Delivery Excellence Group TCS Confidential.
[Audio] Did you know? Rigor in Operations (RiO) focus and RiO best practices helps to avoid these known Human Errors and common issues in operations. TCS confidential 15 Delivery Excellence Group.
[Audio] Questions ? To know more about Rigor in Operations (RiO) and its Practices, please connect with your Account's RiO Champion. TCS confidential 16 Delivery Excellence Group.
[Audio] Thank you A Delivery Excellence Group (DEG) Presentation Reach us at [email protected] Copyright©2024 Tata Consultancy Services Limited.