Amazon S3 is the best place to build data lakes because of its durability, availability, scalability, and security. Hundreds of thousands of data lakes are built on S3, storing diverse sets of unstructured data for use in analytics pipelines, machine learning training, business intelligence, and more. Often, these tasks build on top of open-source analytics engines like Apache Spark that natively connect to S3 as a data source, and benefit from S3’s elasticity to finish jobs quickly and at low cost. AWS supports and contributes to a variety of these open-source projects, including the s3a adapter for Apache Hadoop and the Hive connector for the Trino query engine, to improve performance and support new S3 features.
But many data lake customers use more domain-specific tools that don’t natively support S3’s object APIs and instead expect inputs and outputs to be files in a local file system. For example, many genomics research tools are open-source Linux applications that read sequencing data from a file system, and some machine learning training pipelines store checkpoints on the local file system. Adapting these applications to access S3 data and take advantage of S3’s elasticity and high throughput can be a lot of undifferentiated heavy lifting, or even impossible in cases where customers don’t own the application source code. We are committed to making these tools just as easy to use with S3 data lakes as Hadoop or Trino are today.
Today, we’re excited to announce Mountpoint for Amazon S3, a new open-source file client that makes it easy for Linux-based applications to connect directly to Amazon S3 buckets and access objects using file APIs. Mountpoint is designed for large-scale analytics applications that read and generate large amounts of S3 data in parallel, but don’t require the ability to write to the middle of existing objects. Mountpoint allows you to map S3 buckets or prefixes into your instance’s file system namespace, traverse the contents of your buckets as if they were local files, and achieve high throughput access to objects without worrying about performance tuning or provisioning. With Mountpoint, file operations map to GET and PUT operations against S3, allowing scalable file-based applications to burst to terabits per second of aggregate throughput, without any code changes.
It’s important to note that Mountpoint isn’t a general-purpose networked file system, and comes with some restrictions on file operations. For example, it doesn’t support writes in this first release, and in the future will only support sequential writes to new objects. If you’re looking to run file-based applications that need to collaborate or coordinate on shared data across instances or users, we recommend AWS fully managed file services, such as Amazon FSx or Amazon Elastic File System (EFS). For example, Amazon FSx for Lustre enables customers to link a fully POSIX-compliant file system to an S3 bucket. But if you’re building a data lake application that needs to read large objects without using other file system features like locking or POSIX permissions, or write objects sequentially from a single node, Mountpoint makes direct access to S3 simple and efficient.
Mountpoint for Amazon S3 is launching as an alpha release today, and the code is available on GitHub. It’s not yet ready for use in production workloads, but we’d love for you to kick the tires and let us know what you think! We’re also making public our roadmap for Mountpoint’s development, and we’d welcome feedback on features or use cases that you’d like to see added to the roadmap.
Getting started with Mountpoint for Amazon S3
Mountpoint for Amazon S3 is an application that runs on your compute node, such as an Amazon EC2 instance or Amazon EKS pod, and mounts an S3 bucket to a directory on your local file system. To get started with Mountpoint, we created an Amazon S3 bucket to store recordings of some of our favorite re:Invent 2022 videos:
We’re going to use Mountpoint to produce downscaled versions of these videos to optimize their file size. First, we followed the instructions in the Mountpoint GitHub repository to install Mountpoint on our instance:
We need to create a new directory to which we’ll be attaching our S3 bucket:
Now we launch Mountpoint, specifying the name of the S3 bucket we want to mount and the directory to mount into. Mountpoint doesn’t change anything about permissions or access management for S3 buckets—it respects all of a bucket’s access policy options—and so we’ll need credentials to access the bucket in order to mount it. In this case, our EC2 instance has an IAM role attached with permissions for the bucket, and Mountpoint picks up those credentials automatically, so there’s no further configuration required:
Once the bucket is mounted, we can traverse our bucket as if it were a local file system, using familiar command line tools:
To downscale a video, we’ll use the open-source
ffmpeg utility. While
ffmpeg can access S3 natively using presigned URLs, it’s a lot simpler to just give it a local path, which Mountpoint lets us do:
And now we have a downscaled version of this video on our local instance storage. Mountpoint’s alpha release doesn’t support writes (PUTs), so we can’t yet directly write this output back to our S3 bucket through the file system. But we can use the AWS CLI to upload it to our bucket, and then it will be visible through Mountpoint:
Stay tuned to our public roadmap for updates on write support.
How we’re building a reliable high-performance file client
We know that Mountpoint for Amazon S3 isn’t the first file client for accessing S3—there are several open-source file clients that our customers have experience with. A common theme we’ve heard from these customers, however, is that they want these clients to offer the same stability, performance, and technical support that they get from S3’s REST APIs and the AWS SDKs. To make this possible, we’re building Mountpoint on top of three foundational technologies.
First, Mountpoint builds on the same AWS Common Runtime (CRT) used by most AWS SDKs. The CRT is a software library for interacting with AWS services, offering components like IO, HTTP, and encryption. The CRT is purpose-built for high performance and low resource usage to make the most efficient use of your compute resources. For Amazon S3, the CRT includes a client that implements best practice performance design patterns, including timeouts, retries, and automatic request parallelization for high throughput. Applications that use Mountpoint for Amazon S3 to access objects through a file system interface get the benefits of all this work for free.
Second, Mountpoint is built in the Rust programming language, which offers strong compile-time safety guarantees without run-time overhead like garbage collection. Teams across Amazon S3 and AWS are building core services in Rust today, including the ShardStore storage node that durably stores S3 object data, and open-source projects like Firecracker and Bottlerocket. We use Rust because it offers reliability without compromise: we get strong safety guarantees like memory safety and a rich type system, without paying the cost of a language runtime or garbage collector to enforce those guarantees. Mountpoint uses Rust to reduce tail latency and minimize resource usage, leaving more memory and CPU available to your applications.
Third, we make use of automated reasoning to improve Mountpoint’s reliability and clearly define its behavior. Engineers across S3 use automated reasoning to mathematically define the behavior of key systems and then formally demonstrate that our implementations meet these requirements. For Mountpoint, we use automated reasoning to validate the file system semantics offered when mounting an S3 bucket. For example, the mapping from S3 object keys to file system names is full of edge cases like keys that end with / or keys that are not valid filenames. To raise our confidence in Mountpoint’s reliability, the open-source repository includes an executable specification of how S3 keys are mapped to file system structures, and we use automated reasoning to continuously validate that Mountpoint follows this specification. This automated reasoning effort makes it easy for applications to use Mountpoint with high confidence.
These foundations are a great example of what we do every day to build high-quality services across Amazon S3. High-performance libraries like the CRT reflect over 17 years of experience building for immense scale, and techniques like automated reasoning help us engineer not just for the average case but the one-in-a-trillion edge cases that matter at scale. Together, they allow us to make Mountpoint a reliable and high-performance file client for large-scale data lake workloads on S3.
So why isn’t it a full-featured file system?
AWS has several full-featured file system offerings like Amazon Elastic File System (EFS) and the Amazon FSx family of services. As we’ve talked to S3 customers about Mountpoint, we’re finding that they often ask why Mountpoint doesn’t offer a full-featured file system interface or POSIX compatibility. Our answer is that modern file system interfaces are rich and complex. They have a surprising number of features that don’t overlap with object storage, including operations like file and directory rename, specific consistency semantics, the ability to mutate the contents of files in place, and OS-managed permissions and attribute mappings. Almost all applications that use files depend on some subset of these file system APIs and features, and those applications just won’t work on Mountpoint.
But, while building Mountpoint, we’ve heard from our customers that the applications they want to run against their S3 data often don’t need the full POSIX feature set. They’re successfully using alternative file system specifications like the Apache Hadoop File System (HDFS) interface, which is optimized for streaming access to large files and only allows files to be written once. Adopting this model would make Mountpoint a really bad fit for use cases like your instance’s root file system, but it’s a great match for many data lake applications doing scale-out analytics on large datasets stored in S3.
With these applications in mind, we’re designing Mountpoint around the tenet that it exposes S3’s native performance, and will not support file system operations that cannot be implemented efficiently against S3’s object APIs. This means Mountpoint won’t try to emulate operations like directory renames that would require many S3 API calls and would not be atomic. Similarly, we don’t try to emulate POSIX file system features that have no close analog in S3’s object APIs. This tenet focuses us on building a high-throughput file client that lowers costs for large scale-out data lake workloads.
Our model for Mountpoint is that it’s a building block for Amazon S3 customers, just like the AWS SDKs are today. For our alpha release, we’ve focused on foundational features like high-throughput single-object reads that data lake applications need to work with S3. Some of the early customers that we’ve talked to about Mountpoint have already told us they’re interested in helping to evolve it to provide richer functionality, and we’re excited to collaborate on this as an open-source project open to contributions.
Mountpoint for Amazon S3 makes it easy for data lake applications to translate file operations into S3 API calls, allowing them to run against your S3 data without any code changes. With today’s launch of an alpha release for Mountpoint, we’re taking the first steps towards a reliable, high-performance file client for these applications. The alpha isn’t ready for production workloads, but it’s ready for testing and feedback. We hope this initial release on GitHub and our public roadmap give you an idea of what we’re planning for the future, and we’re excited to hear your feedback and to iterate quickly on Mountpoint as an open-source project. We can’t wait to see your S3 workloads running on Mountpoint soon!
Thank you for reading this blog. If you have any comments or questions, don’t hesitate to leave them in the comments section.