Job Description
Overview
At LOGMEIN GUATEMALA we are looking for a Senior NOC Engineer that will contribute to a global team responsible for performing 1st and 2nd level Event, Incident, and Problem management activities in a complex and highly technical environment working on a variety of issues across multiple network elements in a predominantly Linux environment. This position also helps lead continuous product improvements, assists with non-negotiable projects that support the Department Goals, performs system-wide upgrades and occasionally acts as an individual contributor or SME on special Ops projects. The Senior NOC Engineer will be accountable for the monitoring of multiple data centers and/or cloud environments on a local and worldwide level in a 24/7/365 production environment. This role is responsible for assisting in the development of all production monitoring tools and works closely with the Monitoring Team for prioritizing enhancements and validation of the monitoring infrastructure. This position is also responsible for starting and driving the RCA/RFO/PIR process for any P1 issues that occurred during their shift. This position will receive general direction on new assignments and has work reviewed by the Operations Management team for the soundness of technical judgment, quality, and business sense.
Responsibilities
- Confirms and troubleshoots all alerts from remote monitoring tools, Inbound Care, Ops, or Dev and works to resolve all L1 & L2 issues related with our data center, cloud environments, network infrastructure, hardware and/or applications.
- Responsible for executing operational objectives and ensuring that the teams meet or exceed service level expectations by following defined resolution and escalation procedures and pre-defined intra-company outage communications and updates.
- Verifies that all reported Incident and Problem Management tickets created for our data center or cloud environments are accurate and kept up to date and acts as second tier response and technical support for incident & problem management and resolution.
- Verifies that all incoming operations incidents are in the ticketing system. Responsible for escalating and prioritizing any unresolved issues to the appropriate on-call staff so the ticket can be closed in a timely manner and reports any violations to Ops management.
- Manages Incident Communications regarding any Outages, Major Incidents, or Service Center Issues to Ops Managers and other appropriate teams or personnel within SLA (Service Level Agreement) parameters. Performs timely notification updates to middle and senior management electronically and via telephone for extended outages and Maintenance Windows.
- Maintains the Outage and Maintenance database, the official Outage Announcement Templates, and all other associated reports and documentation.
- Suggests integration of new monitoring applications into the build and deployment framework and how best to accommodate deployment and configuration management of these applications for monitoring.
- Provides upper management with weekly staffing reports and metrics to accurately measure the organizational needs and progression of the Site Reliability department.
- Accountable for the accuracy and timeliness of our Care and Trouble Free SLA Reporting.
- Meets regularly with operations teams, development, and other Site Reliability staff to prioritize future stage and live application, deployment, or project tasks.
- Aggressively follows up with NOC Engineers or Engineering staff on resolution of ticket and information update so ticket can be effectively closed in a timely fashion.
- Proofing and recommending updates, patches, replacements or upgrades to current Site Reliability Software tools and Monitoring systems. Responsible for researching and developing new Site Reliability monitoring tools as they become available.
- Works with other Department leads to develop, validate, and properly catalog SOP documents on the internal Opswiki and Knowledge Base.
- Create/Update incident and problem management procedures to be used by the 1st Level and 2nd Level Site Reliability Technicians.
Other Duties and Responsibilities
- Regularly participates in the Shift Handover process with previous and incoming shift teams to help sync and transfer any ongoing issues or outages.
- Available for on-call and emergency response rotation as needed.
- Maintains the Escalation contact matrix and processes to ensure that all levels of the Support Organization are listed and audits this list frequently and works with other staff and team members to maintain the on-call status of other Operations and Development personnel.
- Responds to any additional needs coming from his/her Direct Management.
- Ensures that the other members of the team follow and enforce the Ops Change Control procedures and immediately escalate any violations to Ops management.
Qualifications
- Bachelor’s degree or equivalent experience required.
- 6-9 years’ experience in a technical or network operations support environment.
- Strong English skills (written and verbal)
- Proven understanding of TCP/IP networking, SNMP, UNIX/Linux/Windows Server Operating Systems, HTTP/HTTPS, SMB, NFS, SMTP, IMAP, SSH, DNS, NTP
- Linux Certification or equivalent experience required with demonstrated understanding of command line tools to create, move, view, grep, sed, and other commands to investigate files and directories.
- Experience in Change Management & Problem Management domains.
- Based in Guatemala
Internal employees only:
- 1 year within the company (mandatory)
- No active disciplinary process in the last 6 months
- Outstanding performance
- Manager's approval
About
What we offer:
- Growing opportunities
- Great work environment
- Competitive salary
- Free life and medical insurance
- Free parking
- Work stability
- Learning and development
- Great compensation package
LogMeIn simplifies how people connect with each other and the world around them to drive meaningful interactions, deepen relationships, and create better outcomes for individuals and businesses. One of the world’s top 10 public SaaS companies, and a market leader in communication & conferencing, identity & access, and customer engagement & support solutions, LogMeIn has millions of customers spanning virtually every country across the globe. LogMeIn is headquartered in Boston with additional locations across North America, Europe, Middle East, Asia and Australia.
OUR VALUES Be Accountable - even when no-one is lookingThrive Together - greatness comes from unlocking each other’s potentialAdvance Confidently - we find opportunity and act on itCollaborate Openly - our whole is greater than the sum of our partsEngage Fearlessly - we speak up and listen