We are seeking an experienced Site Reliability Engineer (SRE) to join our Infrastructure Team within the hedge fund industry in Singapore. This role is highly technical and strategic, focusing on scaling research and batch computing capabilities through robust infrastructure solutions on both Google Cloud and on-premise environments. You will play a vital role in building systems that deliver high availability, observability, and reliability across all trading services.
As part of the Infrastructure Team, your responsibilities will include designing and maintaining monitoring, logging, tracing, and alerting systems to ensure fast incident response and deep operational insights. You will architect and manage scalable solutions and contribute to research computing infrastructure while collaborating closely with development teams to enhance CI/CD pipelines and developer tooling. Driving an SRE culture across the organisation will also form a key part of your remit.
Key Responsibilities:
- Design and operate observability systems to enhance reliability and rapid issue detection
- Develop scalable infrastructure on Google Cloud and on-premise servers
- Support and evolve research clusters and batch/HPC workloads
- Troubleshoot and resolve infrastructure and application issues in production
- Enhance developer workflow with CI/CD improvements and infrastructure automation
- Drive adoption of SRE best practices and mindset across teams
Requirements:
- Minimum 4 years of experience in platform or SRE engineering roles
- Background in high-frequency trading or research/backtesting environments
- In-depth knowledge of Kubernetes architecture and operations
- Experience with GitOps tools such as Argo-CD and GitLab CI
- Strong command of cloud platforms, notably GCP or AWS
- Excellent scripting and programming skills in Python or Go