Subscribe to the latest remote jobs:

Incident Response Analyst II

🇸🇬 Singapore

Management

Analyst

Incident Response Analyst II

from 🇸🇬 Singapore

Key Responsibilities

1. Real-Time Infrastructure Monitoring

  • Perform 24x7 monitoring of critical facility systems across global data centers, including:
    • Electrical power systems
    • Mechanical systems
    • HVAC and cooling infrastructure
    • Fire detection and suppression systems
    • Water systems and supporting infrastructure
  • Continuously monitor EPMS, BMS, DCIM, and centralized monitoring platforms.
  • Detect abnormal operating conditions and alarms.
  • Acknowledge and investigate alarms promptly.
  • Track incidents and issues through to closure.
  • Identify monitoring gaps and recommend improvements to monitoring coverage.


2. Incident Response and Coordination

  • Provide first-level incident triage and technical assessment.
  • Respond to facility alarms and operational events in real time.
  • Execute escalation procedures according to defined protocols.
  • Coordinate with internal teams, site personnel, vendors, and regional stakeholders to ensure timely issue resolution.
  • Support major incident management activities for events such as:
    • Utility power failures
    • UPS and generator events
    • Cooling/HVAC failures
    • Fire alarm activations
    • Water leakage events
    • Security and environmental alerts
  • Maintain end-to-end ownership of incidents until resolution.


3. Ticket Management and Change Coordination

  • Create, update, and manage event tickets within established SLA targets.
  • Process work orders and monitor completion quality.
  • Track maintenance activities and change requests.
  • Support change management processes and ensure operational compliance.
  • Maintain accurate records of facility maintenance activities and change windows.


4. Compliance and Operational Governance

  • Monitor and follow up on preventive maintenance activities and routine operational changes.
  • Review technical documentation submitted by vendors and service providers, including:
    • Method of Procedure (MOP)
    • Risk Assessment (RA)
    • Standard Operating Procedure (SOP)
  • Ensure maintenance activities comply with operational standards and freeze-period requirements.
  • Support risk management and operational audit activities.


5. Monitoring Platform and Data Administration

  • Maintain monitoring platform master data and infrastructure records.
  • Ensure the accuracy, completeness, and timeliness of asset and alarm information.
  • Support platform optimization and continuous improvement initiatives.
  • Maintain facility logs, event records, and operational documentation.


6. Reporting and Data Analysis

  • Analyze facility operational data and identify trends or recurring issues.
  • Prepare operational reports and performance summaries.
  • Provide recommendations to improve reliability and operational efficiency.
  • Maintain records required for audit, compliance, and management reporting.


7. Operational Support and Continuous Improvement

  • Participate in after-hours support and emergency escalations.
  • Provide remote support for overseas data center operations when required.
  • Support centralized cross-regional operations and collaboration.
  • Contribute to process improvements and monitoring platform enhancements.
  • Perform other duties as assigned to support business continuity and operational excellence.


Minimum Qualifications

  • Associate Degree, Diploma, or higher in Engineering, Information Technology, Facilities Management, or related disciplines.
  • Minimum 2 years of experience in data center operations, facility monitoring, NOC, command center, or mission-critical environments.
  • Working knowledge of:
    • Electrical systems
    • Mechanical systems
    • HVAC and cooling infrastructure
    • Fire detection and suppression systems
    • Building Management Systems (BMS)
    • Electrical Power Monitoring Systems (EPMS)
    • DCIM or centralized monitoring platforms
  • Experience working with incident management and escalation procedures.
  • Strong communication and coordination skills.
  • Ability to work in a 24x7 rotating shift environment.
  • Ability to manage multiple priorities in high-pressure situations.
  • Fluent in English.
  • Chinese language proficiency (reading, writing, and verbal communication) is preferred to support Chinese alarm messages, documentation, and communications.


Preferred Qualifications

  • Experience in:
    • Network Operations Center (NOC)
    • Facility Operations Center (FOC)
    • Data Center Operations
    • Critical Environment Operations
    • Mission Critical Facilities
  • Experience supporting global or cross-regional operations.
  • Familiarity with structured incident, change, and problem management processes.
  • Understanding of data center capacity management (space, power, cooling).
  • Experience working with CMMS, DCIM, EPMS, BMS, or ticketing platforms.
  • Ability to perform root cause analysis and drive issue resolution.


Desired Competencies

  • Strong sense of ownership and urgency.
  • Excellent communication and stakeholder management skills.
  • Detail-oriented with strong documentation practices.
  • Analytical and problem-solving mindset.
  • Ability to learn quickly and adapt to changing operational environments.
  • Team-oriented with a proactive and customer-focused attitude.


Preferred Certifications

Candidates with the following certifications will have an advantage:

  • CDCP – Certified Data Centre Professional
  • CDCS – Certified Data Centre Specialist
  • FSM – Facilities Systems Management
  • Uptime Institute ATD
  • ITIL Foundation
  • DCCA or DCT certifications
  • Electrical or Mechanical engineering certifications


Shift Requirements

  • Must be willing to work a 24x7 rotating shift schedule.
  • Participate in weekends, public holidays, and on-call duty rotations when required.
  • Support emergency response activities and major incidents.


Key Performance Indicators (KPIs)

The successful candidate is expected to consistently achieve:

  • 100% shift attendance and handover compliance.
  • 24x7 continuous monitoring coverage.
  • Alarm acknowledgement within 1 minute.
  • Immediate notification generation within 2 minutes.
  • Event ticket creation within 10 minutes.
  • Compliance with escalation and incident management SLAs.
  • Zero service-impacting human errors.
  • Accurate documentation and reporting.
  • Continuous improvement contributions to operational processes and monitoring platforms.
by @maxrusakovic