How we made our emailing service resilient — Part 1

Avinash Jaiswal
3 min readFeb 1, 2022

--

Electronic Mails, dearly known as emails, are your friendly neighbourhood information exchange agents. Most people spend their day writing and reading tons and tons of them as part of their jobs. For a decacorn like Gojek, emails are our primary way to reach our users. So it is incidental to build the service responsible for this in a way that ensures high availability and cater to the needs of our various clients(inhouse services like GoPay, GoRide, etc.) with unique emailing requirements.

We rely on emails to send out marketing and promotional info or private and confidential monthly transaction mails across Gojek's ecosystem. Ensuring a high uptime for our service means that Gojek can communicate with its customers seamlessly.

These stories are a multi-part series where we talk about our mailing practices and design enhancements which has helped us deliver consistently. But first, an overview of Postie.

Postie, the emailer

Postie(yeah, we know, it has a vibe!) is written purely in Golang and has distributed all its responsibilities to various actors in the design. It has a server-worker model in which requests made to servers are routed via a message queue and consumed by the workers.

We don't maintain an SMTP server because we understand that it requires dedicated effort in maintenance and development and is a full-time job in itself. Therefore we trust sending out our emails to third party providers.

Postie High-Level Design

Postie was originally a gRPC based service, but we have also added HTTP support with time and changing requirements. PDGs(Product Development Groups) like GoPay, GoRide, etc., requests Postie via gRPC or HTTP depending on their ease of integration and how they choose to interact with our servers.

The server validates and converts the requests in a domain message ready to be published in a queue. The validation happens on multiple data points, including the sender and receiver emails, domain names and priority. Every mail request has a priority, and we use a priority queue to sort them based on their preference accordingly.

Postie provides multiple ways to send emails, including templating, sending with attachments, or simple vanilla strings. Based on their type, these messages are routed in different queues.

The typical worker consumes the simple or template-based messages and pushes them out to our vendors. However, the attachment worker fetches the attachment from cloud storage and then balloons it out in the mail request before sending them off to vendors. The server receives an attachment ID and the mail requests to fetch the attachments from cloud storage.

An example email from Gojek

Outside the Gojek ecosystem, the vendors do their jobs, deliver the mails to our users, and send us callbacks with email delivery statuses. These are then collected and further analysed, but that is outside the scope of this story.

Everything above seems to be pretty straightforward, and so far, we have only talked about the general design. However, I wanted all our readers to be on the same page so that the deep dive coming up next is worthwhile.

Making a service resilient has a lot to do with the High-level design and depends on the Low-level design of one's code. Moreover, as long as your service is dumb, i.e. consuming requests and spitting out responses irrespective of the characteristics of requests getting served, significant improvements might not be possible.

I will discuss the changes we added to Postie, which helped us be available to all clients and decreased the external cost by blocking requests made for invalid mailboxes. But more on this in the next post!

--

--