Principal Site Reliability Engineer

Description:

As the Information Company, our mission at OpenText is to create software solutions and deliver services that redefine the future of digital. Be part of a winning team that leads the way in Enterprise Information Management.

The Opportunity:

You will join a team of globally located Site Reliability Engineers to design, automate, build, operate, and continuously improve some of the services that back our customer facing SaaS products. Your focus will be on NoSQL and Big Data technologies such as Elasticsearch, Cassandra, Kafka, Redis, and RabbitMQ, integrated with the underlying IaaS (VMware, AWS, GCP, Azure) and PaaS (Cloud Foundry, K8s, Anthos). You will be responsible for delivering and operating a highly available design that includes security, scalability, monitoring, upgradeability, and data backup and recovery, across non-production and production environments. You’ll work in a fast-paced organization while quickly learning new skills and creating ways to consistently meet service-level agreements for our global cloud services.

The best person for this role is someone that has a collaborative spirit - in our world, it’s not about being a hero and having all the answers, it’s about sometimes saying "I don't know" and working on finding solutions rather than starting with an assumption. The team needs someone who can ask questions, learn from others, and turn chaos into order. This role would be a great fit for someone with creative and innovative problem-solving skills. You will develop and implement solutions that operate at scale. Our teams are empowered and expected to improve our products to truly deliver a reliable experience to customers.

Your Responsibilities Will Include:

Designing, automating, building, operating, and continuously improving multiple backing services including Cassandra, Elasticsearch, Kafka, RabbitMQ, Redis, and Solr
Building software and systems to manage infrastructure and backing services for customer-facing OpenText applications through automation tools such as Terraform and Ansible
Working closely with development and application support teams to design, build, deploy, support, and monitor new and existing deployments, including gathering requirements and documenting the solution
Identifying tactical and strategic opportunities to improve service health, performance, reliability, and telemetry
Contributing to capacity planning and management processes
Supporting the migration of legacy deployments to modernized design patterns
Supporting and responding to service requests that satisfy our OLAs
Supporting incident resolution process for backing services that we are responsible for
Participating in training and information sharing activities
Interacting with third party provider(s) who provide additional expertise and a layer of escalation support for our services
Implementing best practices and operating environments for Kafka, helping with topic creation and management, owning and managing the Kafka Schema Registry, helping new teams with Kafka usage, educating teams on Kafka capabilities, and helping teams to adopt new features
Implementing best practices and operating environments for Elasticsearch, helping with index creation and lifecycle management, shard scaling, index rollover and rollup strategies and performance tuning
Implementing best practices and operating environments for Cassandra, helping with node sizing, datacenter and rack topology, replication factor, quorum configuration and driving settings
Implementing best practices and operating environments for Redis, RabbitMQ, and Solr
Acting as backup for other team members when necessary
Problem solving and finding solutions to resolve issues
Building repeatable application technology design patterns
Learning new technology on your own or in conjunction with an online learning platform
Creating and updating documentation such as operational procedures, change execution plans, and incident write-ups
May require shift work
On-call rotation is required, as 7x24x365 support is required

Qualifications:

Bachelor’s Degree in Computer Engineering or related field
10+ years of Information Technology experience, working on large scale enterprise systems
10+ years of experience working within the Linux operating system
5+ years of operations experience for one or more of the backing services that we are supporting (Kafka, Cassandra, Elasticsearch, RabbitMQ, Redis, Solr)
Experience supporting and operating distributed java applications
Knowledge of private and public cloud infrastructure platforms (VMware/AWS/GCP)
Basic understanding of Java memory management including garbage collection and available GC methods
Hands-on experience with configuration of monitoring and alerting tools such as Prometheus, New Relic, Nagios, Zabbix, and/or Pager Duty
Experience with automation or CI/CD tools, such as Terraform, Ansible, and GitLab
Knowledge of Kafka concepts such as topics, partitions, replication factor, offsets, consumers and producers
Knowledge of Elasticsearch concepts such as roles, shards, replicas, indexes, index patterns, index lifecycle management, and aliases
Basic understanding of Cassandra, RabbitMQ, Redis, or Solr operations
Application clustering / load balancing concepts

The Opportunity:

Your Responsibilities Will Include:

Qualifications:

Organization	opentext
Industry	Engineering
Occupational Category	Principal Site Reliability Engineer
Job Location	Cork,Ireland
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Intermediate
Experience	2 Years
Posted at	2022-12-18 3:05 pm
Expires on	Expired