DevOps AWS - SRE -
Job Description:
Job Specification: SRE (Site Reliability Engineer)
Location: Buenos Aires
Team: Sales & Research Production Management
Work Pattern: M-F; Hybrid (Belgrano)
Role Overview
We are seeking experienced Site Reliability Engineers (SREs) to support the transition
and operational uplift of a key application being handed off to our team. This application
underpins critical workflows for our Global Markets business, and your work will directly
impact the stability, reliability, and efficiency of our client-facing technology.
You'll be joining a high-performing production management team responsible for
ensuring operational excellence across the platforms that power our Sales and
Research professionals. This is a hands-on role with a clear mandate: take ownership
of the application, drive improvements in supportability and observability, and ensure a
seamless handover to our production management team.
Key Responsibilities
1. BAU Support & Ownership
o Provide day-to-day (BAU) support for the application's processes and
workflows, ensuring stability, availability, and swift response to end-user
issues.
o Act as the primary point of contact for all production support matters
related to the application.
2. Application Maturity Assessment
o Evaluate the current state of the application with respect to supportability,
reliability, and observability.
o Identify gaps and areas for improvement, documenting findings and
recommendations.
3. Observability Integration & Enhancement
o Remediate observability gaps by integrating the application's processes
with our standard monitoring and ing tools, including Dynatrace, Splunk,
Grafana, and Geneos.
o Ensure robust monitoring coverage and actionable ing for all critical
workflows.
4. Operational Toil Reduction & Automation
o Identify and remediate sources of operational toil and manual intervention.
o Build automation solutions or integrate with existing in-house platforms to
streamline support activities and improve operational efficiency.
5. Documentation & Handover
o Develop comprehensive support documentation and runbooks, ensuring
all procedures, troubleshooting steps, and escalation paths are clearly
captured.
o Prepare and execute a structured handover to the permanent production
management team at the end of the engagement.
Technical Environment
Hosting: AWS and internal cloud platforms
Orchestration: Astronomer (Airflow jobs)
Monitoring & Observability: Dynatrace, Splunk, Grafana, Geneos
Required Skills & Experience
SRE & Production Support: Proven experience in SRE, production
management, or application support roles within large-scale, mission-critical
environments.
Cloud Platforms: Hands-on expertise with AWS and internal cloud platforms.
Programming: Proficiency in at least one programming language such as
Python or Java/Spring Boot, with the ability to script, automate, and troubleshoot
application workflows.
CI/CD Tools: Experience with continuous integration and continuous delivery
tools, such as Jules, Jenkins, GitLab, or Terraform, to support automated build,
deployment, and infrastructure management.
Observability Tooling: Strong background in observability such as white and
black box monitoring, service level objective ing, and telemetry collection using
tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
Workflow Orchestration: Experience with workflow orchestration tools, ideally
Astronomer/Airflow.
Containers & Orchestration: Familiarity with container technologies and
orchestration platforms such as ECS and Kubernetes, including deployment and
operational best practices.
Automation & Toil Reduction: Demonstrated ability to automate operational
tasks and reduce manual toil, preferably using in-house or open-source
solutions.
Documentation: Excellent documentation skills and experience creating
runbooks for production support teams.
Communication: Strong communication and stakeholder management skills,
with a collaborative and proactive approach.