Yelp Publishes Blueprint For Managing S3 Server-Access Logs At Massive Scale

In a detailed engineering post, Yelp shared how it built a scalable and cost-efficient pipeline for processing Amazon S3 server-access logs (SAL) across its infrastructure, overcoming traditional limitations of raw log storage and querying at high volume. The article outlines both the challenges the company faced, like log volume, storage cost, and query performance, and the technical strategies they used to make object-level logging at scale practical.

In essence, Yelp now writes terabytes of daily access logs but converts them into compact, parquet-formatted archives that are easy to query with tools like Amazon Athena. Through a process of periodic “compaction,” raw plaintext log objects are merged into fewer, larger Parquet files, reducing storage usage by about 85% and cutting the number of objects by more than 99.99%. This transformation makes analytics efficient and cost-effective, enabling quick lookups for permission debugging, cost attribution, incident investigation, and data retention analysis.

Behind the scenes, the architecture leverages AWS Glue Data Catalog for managing schemas across multiple AWS accounts, and a mix of scheduled batch jobs, Lambda functions, and partition-projection-based tables for robust, automated log ingestion. The system is designed to tolerate delayed or duplicate log delivery, something SAL inherently allows, by making inserts idempotent, and tagging old log objects for lifecycle expiration once their contents are safely archived.

Yelp’s system also supports key operational use-cases. For debugging, engineers can query whether a particular object was accessed (or denied) at a given time. For cost analysis, it is possible to aggregate API usage by IAM role to understand which services or teams generate the most traffic. For data hygiene, combining access logs with S3 inventory allows the team to identify and safely delete objects that haven’t been accessed for defined periods.

The significance of Yelp’s work is two-fold: it demonstrates that object-level logging on S3, long considered too expensive or unwieldy at scale, can in fact be made efficient and operationally manageable, and it provides a reference architecture for other companies seeking similar visibility or compliance posture. As demand grows for tighter data governance, auditing, and cost visibility in cloud storage environments, Yelp’s lessons offer a practical approach to scaling access-logging without blowing up storage costs or compromising queryability.

Alongside this Yelp example, there are several other examples that echo or implement similar design patterns to what Yelp described in its “S3 server-access logs at scale” architecture.

Upsolver is a data-lake/ETL platform that offers built-in support for ingesting S3 access logs, converting them into analytics-ready formats, and optimizing them for query engines. Their S3 Access Logs processing workflow mimics what Yelp did: ingest logs, transform them, and make them queryable by SQL engines like Amazon Athena. This allows teams to skip writing custom log-processing pipelines and still get the benefit of scalable log analytics.

AWS itself published an example architecture for processing S3 server-access logs using a Glue job (particularly interesting when paired with Ray for scalable Python-based processing). The pipeline partitions, formats (into Parquet), catalogs the result, and then uses Athena (or, in some cases, visualization tools like QuickSight) to query or analyze access patterns at scale. This essentially matches the “compaction + table + catalog + query” pattern that Yelp adopted, but as a managed recipe from AWS.

Additionally, projects like Druid (for analytical workloads / time-series or event data) and Presto/Trino (for SQL querying over large datasets, including S3 object stores) are often used as the underlying query engines for large-scale log or event data lakes. With logs converted to columnar formats (e.g., Parquet, ORC, or managed via lake-table formats like Apache Iceberg), these engines can serve as scalable, low-latency query layers – making them useful backbones for access-log, audit-log, or event-log architectures./p>

And for organizations that want near-real-time search/alerting (e.g., for security or anomaly detection), the AWS blog also describes a pattern to ingest server-access logs from S3 into OpenSearch (using Lambda + ingestion pipelines) and visualize them with Kibana. Though this trades off some of the long-term storage efficiency that Parquet + Athena offers, it delivers more immediacy and real-time investigative capability – useful in security, compliance, or operational monitoring contexts.

Yelp Publishes Blueprint for Managing S3 Server-Access Logs at Massive Scale

Leave a Reply

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply