Site Reliability Engineer, AI Applications San Jose Regular R&D Job ID: A193830
Company: ByteDance
Location: San Jose
Posted on: May 24, 2025
|
|
Job Description:
Site Reliability Engineer, AI ApplicationsSan Jose Regular
R&D Job ID: A193830ResponsibilitiesTeam IntroThe Speech team's
mission is to empower content interaction and creation using speech
& audio related technologies. The team focuses on cutting-edge
R&D in areas like speech & audio, music processing, natural
language understanding and multimodal deep learning. We are looking
for top talents to work on these exciting technologies, integrate
them into various products and ultimately bring joy to our global
user base!ResponsibilitiesWe are seeking a talented and experienced
Site Reliability Engineer (SRE) to join our dynamic team. This role
focuses on the reliability, scalability, and performance of our AI
applications. The ideal candidate will have a strong background in
both software engineering and systems engineering, with a
particular emphasis on maintaining and optimizing AI and machine
learning infrastructure. The major responsibilities include:-
Monitoring and Incident Response: Develop and implement monitoring
solutions to track the performance and reliability of AI systems.
Respond to incidents, diagnose issues, and implement fixes to
minimize downtime.- Automation and Tooling: Automate repetitive
tasks, streamline deployments, and create tools to improve the
efficiency and reliability of AI operations.- Performance
Optimization: Analyze and optimize the performance of AI
applications and the underlying infrastructure, including tuning
algorithms and resource management.- Capacity Planning: Forecast
infrastructure needs and ensure that the AI applications have the
necessary resources to handle future workloads.- Security:
Implement and maintain security best practices to protect data and
applications, ensuring compliance with relevant regulations.-
Documentation: Create and maintain detailed documentation of
infrastructure, processes, and procedures to ensure knowledge
sharing and continuity.- Continuous Improvement: Identify
opportunities for process improvements and implement solutions to
enhance the reliability and performance of AI
systems.QualificationsMinimum Qualifications:- Bachelor's or
Master's degree in Computer Science, Engineering, or a related
field.- 3+ years of experience in site reliability engineering,
DevOps, or a related role.- Proven experience managing and
optimizing AI and machine learning infrastructure.Preferred
Qualifications:- Proficiency with cloud platforms such as AWS,
Google Cloud, or Azure.- Strong programming skills in languages
such as Python, Go, or Java.- Experience with containerization and
orchestration tools like Docker and Kubernetes.- Familiarity with
CI/CD pipelines and tools such as Jenkins, GitLab CI, or CircleCI.-
Knowledge of monitoring and logging tools such as Prometheus,
Grafana, ELK stack, or Datadog.- Understanding of networking,
security principles, and best practices in cloud environments.-
Strong problem-solving capabilities, with a detail-oriented and
user-focused approach.- Strong communication and interpersonal
skills, capable of engaging effectively with both technical and
non-technical stakeholders.Job InformationAbout UsFounded in 2012,
ByteDance's mission is to inspire creativity and enrich life. With
a suite of more than a dozen products, including TikTok, Lemon8,
CapCut and Pico as well as platforms specific to the China market,
including Toutiao, Douyin, and Xigua, ByteDance has made it easier
and more fun for people to connect with, consume, and create
content.Why Join ByteDanceInspiring creativity is at the core of
ByteDance's mission. Our innovative products are built to help
people authentically express themselves, discover and connect - and
our global, diverse teams make that possible. Together, we create
value for our communities, inspire creativity and enrich life - a
mission we work towards every day.As ByteDancers, we strive to do
great things with great people. We lead with curiosity, humility,
and a desire to make impact in a rapidly growing tech company. By
constantly iterating and fostering an "Always Day 1" mindset, we
achieve meaningful breakthroughs for ourselves, our Company, and
our users. When we create and grow together, the possibilities are
limitless. Join us.Diversity & InclusionByteDance is committed to
creating an inclusive space where employees are valued for their
skills, experiences, and unique perspectives. Our platform connects
people from across the globe and so does our workplace. At
ByteDance, our mission is to inspire creativity and enrich life. To
achieve that goal, we are committed to celebrating our diverse
voices and to creating an environment that reflects the many
communities we reach. We are passionate about this and hope you are
too.Reasonable AccommodationByteDance is committed to providing
reasonable accommodations in our recruitment processes for
candidates with disabilities, pregnancy, sincerely held religious
beliefs or other reasons protected by applicable laws. If you need
assistance or a reasonable accommodation, please reach out to us at
https://tinyurl.com/RA-request
#J-18808-Ljbffr
Keywords: ByteDance, Berkeley , Site Reliability Engineer, AI Applications San Jose Regular R&D Job ID: A193830, Professions , San Jose, California
Click
here to apply!
|