Find Your Dream Job

Search through thousands of job postings to find your next opportunity

Date Posted

Job Type

Technology

Work Setting

Salary Range

$0k $100k $200k+

Experience Level

Grafana SRE Architect

VRK IT Vision Inc.

Basking Ridge, NJ

Job Summary

The Grafana SRE Architect will lead the design, implementation, and management of scalable, reliable, and performant Grafana-based observability solutions. This role bridges Site Reliability Engineering (SRE) practices with Grafana’s ecosystem (Loki, Mimir, Tempo, etc.) to ensure robust monitoring, logging, tracing, and alerting for mission-critical systems. You will collaborate with DevOps, engineering, and infrastructure teams to align technical strategies with business objectives, driving automation, resilience, and cost efficiency across cloud and on-premises environments.

Key Responsibilities

  • Architecture & Design
  • Design end-to-end Grafana solutions for metrics, logs, traces, and dashboards, ensuring scalability, security, and compliance.
  • Architect integrations with Prometheus, Loki, Mimir, Tempo, and third-party tools (e.g., AWS CloudWatch, Datadog).
  • Define best practices for Grafana deployment (self-managed vs. Grafana Cloud) and optimize data storage/retention strategies.
  • SRE Leadership
  • Implement SRE principles: SLAs/SLOs/SLIs, error budgets, and blameless post-mortems.
  • Build automated monitoring/alerting systems to preemptively identify system bottlenecks and failures.
  • Lead incident response, root cause analysis, and remediation for observability-related outages.
  • Collaboration & Integration
  • Partner with DevOps teams to embed Grafana into CI/CD pipelines and automate provisioning via IaC (Terraform, Ansible).
  • Work with developers to instrument applications for observability (OpenTelemetry, custom exporters).
  • Advise stakeholders on cost-effective monitoring strategies and resource optimization.
  • Performance Optimization
  • Tune Grafana dashboards, queries, and data sources for high-performance environments.
  • Optimize PromQL/Loki LogQL queries and manage large-scale time-series databases (Mimir).
  • Conduct capacity planning and disaster recovery testing for Grafana ecosystems.
  • Governance & Security
  • Ensure compliance with security policies (RBAC, SSO, encryption) and audit requirements.
  • Monitor Grafana stack health, perform upgrades, and enforce version control.
  • Mentorship & Innovation
  • Mentor SRE/engineering teams on Grafana best practices and SRE culture.
  • Stay ahead of Grafana/Observability trends and pilot new tools (e.g., AI-driven anomaly detection).

Education & Experience

  • Bachelor’s/Master’s in Computer Science, Engineering, or related field.
  • 10+ years in SRE/DevOps roles, with 5+ years hands-on Grafana experience.
  • Proven track record in designing large-scale observability solutions.
  • Managing offshore teams
  • Open to work overlapping hours with offshore teams

Technical Skills

  • Expertise in Grafana: Dashboards, plugins, alerting, and integrations (Prometheus, Loki, Mimir, Tempo).
  • Cloud Platforms: AWS/GCP/Azure, Kubernetes, and serverless architectures.
  • Automation: Terraform, Ansible, Python/Go scripting.
  • Monitoring Tools: Thanos, Cortex, Jaeger, OpenTelemetry.
  • Database Optimization: Time-series data (Mimir), log management (Loki).

Certifications (Preferred)

  • Grafana Certified: Observability Engineer/Administrator.
  • AWS/GCP/Azure Architect or DevOps certifications.

Soft Skills

  • Leadership in cross-functional teams and crisis management.
  • Strong communication for technical and non-technical audiences.
  • Analytical problem-solving and strategic thinking.

Preferred Qualifications

  • Contributions to Grafana/Prometheus open-source projects.
  • Experience with AI/ML model monitoring.
  • Knowledge of regulatory frameworks (GDPR, HIPAA).

NewSREJobs

Connecting top SRE talent with leading companies.

For SRE Professionals

For Employers

Company