Technical Incident Manager

Location: San Diego, CA, USA

Notice

This position is no longer open.

Requisition Number: 204081

External Description:

Teradata Global Support Organization is expanding and building capabilities to further drive our company and function’s transformation. We are seeking a highly-skilled and technology-focused leader to drive the evolution of Teradata’s Critical Incidents Technical Response Group handling all deployments types.

As the Technical Incident Manager (TIM) you are responsible for major incident management. In this role, you will be the key decision maker and authority to direct the problem resolution path for fastest restoration to service. As Major Incident Manager you are responsible for managing the restoration of an impacted service affected by real or potential interruptions which may have an impact upon the quality or availability of that service. When a major or critical incident occurs, the right technical resources will be activated, you technically lead major Incident calls, determine the client impact, agree on resolution actions with everybody involved, manage the technical communication channel for focus on return-to-service. This will include managing technical sub-channels with tech leads who will take point for sub-channels and isolate issues contributing to return-to-service. The TIM will work hand-in-hand with Incident Communication Manager (ICM) who is responsible for internal and external communication leaving the TIM focused on the return to service.

The Major Incident Manager is responsible for the quality and integrity of the Major Incident Management process and is the interface to the other process managers.

This position will be located in San Diego, but we will consider virtual US locations for the right candidate. This is a fast-paced high-tech environment and may require extended hours and after-hours follow up given the nature of the changes occurring 7x24.

Key Responsibilities of the TIM / Teradata Critical Support Office:

* Handle severity 1s, critical customers, and complex problems impacting customer systems

* Central-point for incident declaration, classification (priority) and triggers for SWARM

* Ensuring appropriate skill sets and incoming signals are aligned with SWARM triggers and automated with tools for notifications (this includes Security teams, Vendor teams, and, in some cases, client technical teams may also be involved)

* Technically leads all aspects of critical incidents focused on fastest service restoration/recovery using a SWARM approach – bridge, SLACK channels, sync-points for sub-tech teams leading investigations (including 3rd party vendors and cloud providers).

* Responsible for the quality and integrity of Major Incident Management process and is the interface with SWARM members, communication manager, and problem manager.

* Support all SWARM activities globally when problems occur requiring deep technical and problem resolution skills of the team, this may include across regions working with other TIMs to support 24x7 coverage.

* Performing post-Major Incident follow up via Post Incident Review on Major Incidents Post Mortem in concert with Problem Managers

* Enforce regular and systemic process control mechanisms to improve SWARM

* Participate in proactive design of new and innovative ways of simplifying future SWARM activities

* Clearly identifies and drives backlog prioritization for availability and reducing mean-time-between-failures with appropriate development teams

* Success factors and metrics will be visibly focused on mean-time-to-restore-service (MTTRS) and mean-time-between-failure (MTBF) and underlying KPIs in the SWARM, RCA, and improvement backlogs.

* Acts as the voice/conscience of the customer experience, and administers problem-solving with customer advocacy front of mind.

Key Functions of the Technical Incident Manager:

* Communicate and advocate effectively with all levels of roles, in all geographies, across the entire company.

* Assume leadership responsibility during an S1 to direct the SWARM team as they work towards service restoration

* Lead S1 team calls, determine SMEs needed, identify problem and release/de-escalate after diagnosis

* Ensure incident management processes is efficient and automated for triggers, data collections, diagnostics, streamline artifact into incident including timelines and decision trees.

* Ensure SWARM team meets resolution specifications as designed in the SLA while also enabling reduction of mean time to resolution

* Participate in with problem managers as required to evolve monitoring/logging systems and appropriate development teams.

* Identify failure points driving availability and accelerating mean-time-to-repair including architectures, design, process improvements, software disciplines, test, etc…

* Interact frequently with various stakeholders across the organization to prioritize backlog for availability as required.

* Build strong internal and external relationships with technical teams, customers and third parties

* Serve as a key contributor to post-mortem reviews as an SME

* Customer Advocate - focus on what is deemed to be the best outcome for the customer

* Is NOT responsible for communication planning or execution to the internal or external stakeholders; actively participates and interfaces with the Incident Communication Manager as needed

Skills & Qualifications

* Demonstrated strategic and tactical thinking, quantitative and analytical skills, while under pressure

* Knowledge and exposure with distributed systems across hyper-scale, cloud-based environments

* Working knowledge of physical IT infrastructures such as Enterprise Server Platforms and related IT architectures and equipment

* Solid understanding of large-scale networking, including OSI Model, DNS, WINS, TCP/IP, VLANs, DHCP, Routing, ACLs, switching protocols, etc.

* Understanding and knowledge of physical datacenters and their related infrastructure or resources such as power, rack space, CE Infrastructures (e.g. UPS, Generators, AHU) etc.

* Flexibility and willingness to support a 24x7 global operation via off-hours support, on-call availability, or other as needed per rhythm and needs of the business

* Working knowledge of ITIL incident, problem, and change management components

* Excellent problem resolution, judgment, negotiation, and decision-making skills

* Practical experience with incident/outage and crisis management

* Ability to balance competing demands for resources and adapt to changing priorities

* Excellent written and oral communication skills; with a special focus on customer/client level interaction

* Operations experience in a 24x7x365 support model (NOC experience beneficial)

Preferred Skills & Experience

* BS in Computer Science, math or equivalent education or experience with database development and applications, data warehousing operations, and analytical software applications or ecosystems.

CountryEEOText_Description: Teradata is proud to be an equal opportunity employer. We do not discriminate based upon race, color, ancestry, religion, creed, sex (including pregnancy, childbirth, breastfeeding, or related conditions), national origin, sexual orientation, age, citizenship, marital status, disability, medical condition, genetic information, gender identity or expression, military and veteran status, or any other legally protected status. We welcome and encourage individuals from all backgrounds to apply and join our team, bringing their unique perspectives and experiences to help us innovate and grow.

City: San Diego

State: California

Community / Marketing Title: Technical Incident Manager

Job Category: Customer Support

Company Profile:

LinkedIn Remote:

Location_formattedLocationLong: San Diego, California US