Storage administrators—whether they are managing user and departmental shares, high-availability databases like SQL Server, or business applications—need to understand how their workloads are performing. Understanding optimal throughput capacity, storage capacity, and storage type for your file systems helps ensure high performance and enables you to right-size your file storage to optimize cost.
In addition to the previously available metrics (described in the Storage Blog post “Monitor performance of Amazon FSx for Windows File Server with Amazon CloudWatch“), Amazon FSx for Windows File Server now supports metrics that enable you to easily monitor file system performance across a wider range of workloads, such as read/write-intensive workloads supporting SQL databases and time-varying workloads supporting user directories.
In this blog, we cover how these new metrics can help you understand your Windows, Linux, or MacOS file-based workload performance to optimize performance and costs. After reading this blog, you’ll be able to answer questions like:
- Have I chosen the right throughput capacity, storage capacity, and storage type for my workload?
- Will my file system be able to handle my workload’s performance needs?
- How can I set up monitoring to let me know if I’m approaching a performance limit?
Overview of file system performance
Amazon FSx for Windows File Server provides fully managed, scalable file storage built on Windows Server and delivers a rich set of administrative and security features. Before we dive into the new metrics, it’s worth reviewing the primary architectural components of every Amazon FSx file system:
- The file server that serves data to clients accessing the file system
- The storage volumes that host the data in your file system
The throughput capacity that you select determines the performance resources available for the file server—including the network I/O limits, the CPU and memory, and the disk I/O limits imposed by the file server. The storage capacity and storage type that you select determine the performance resources available for the storage volumes—the disk I/O limits imposed by the storage disks. For more information, see the Amazon FSx for Windows File Server performance page.
These architectural components, and the corresponding metrics, are displayed in the following diagram:
The metrics can be broadly divided into the following domains:
- Network I/O: measuring traffic and sessions between clients and the file server
- File server: measuring file server I/O, CPU, and memory utilization
- Disk I/O: measuring traffic between file server and storage volumes
- Storage volume I/O: measuring storage volume I/O utilization
- Storage volume capacity: measuring how your files are utilizing available storage
Network I/O metrics and how to use them
The network I/O metrics measure traffic and active sessions between your clients and your file system’s file server. All of the metrics mentioned below except ClientConnections were previously available and can be used to understand the amount and type of traffic for clients that are accessing the file system.
|DataReadBytes||The number of bytes for read operations for clients accessing the file system||bytes|
|DataWriteBytes||The number of bytes for write operations for clients accessing the file system||bytes|
|DataReadOperations||The number of read operations for clients accessing the file system||count|
|DataWriteOperations||The number of write operations for clients accessing the file system||count|
|MetadataOperations||The number of metadata operations for clients accessing the file system||count|
|ClientConnections||The number of connections established between clients and the file server||count|
Using these metrics, you can identify when your end users or processes are driving I/O to your file system, whether this I/O is throughput-heavy or IOPS-heavy, and whether traffic is coming from read, write, or metadata operations.
MetadataOperations include reads or writes to file metadata, such as filenames, file sizes, permissions, or directory structure. These operations can be a higher proportion of traffic when your file system has a large number of small files and there are frequent changes to files or directory structures. For workloads such as user or departmental shares, you’ll typically see a higher number of metadata operations given the large number of files. For workloads such as virtual desktop infrastructure (VDI) like Citrix, WorkSpaces, or AppStream, where data is stored in large virtual hard disk (VHD) files, you’ll typically see a lower number of metadata operations. You can determine the percent of metadata operations by calculating MetadataOperations/(MetadataOperations + DataReadOperations + DataWriteOperations).
A new metric that has been added to provide further visibility into performance is ClientConnections, which counts the number of active SMB sessions to your filesystem at a point in time. As an example, you can use a historical chart of ClientConnections to identify a low usage window for your file system and predict when the impact will be minimized for an activity. If you need to dive deeper, you can also view the exact list of open user sessions and open files by using the Windows-native Shared Folders GUI tool and the Amazon FSx CLI for remote management on PowerShell, as described in the Amazon FSx Windows User Guide.
File server metrics and how to use them
File server metrics measure how your workload is utilizing the network and disk I/O limits, CPU, and memory resources available for your file server. Note that the metrics starting with FileServerDisk* measure the disk I/O utilization relative to the limits imposed by the file server. (The Disk* metrics in the Storage volume section below measure the disk I/O utilization relative to the limits imposed by the storage volumes.)
|NetworkThroughputUtilization||Network throughput activity as a percentage of the provisioned limit||percent|
|MemoryUtilization||The percentage of file server memory in use||percent|
|CPUUtilization||The percentage of file server CPU capacity in use||percent|
|FileServerDiskThroughputUtilization||Disk throughput activity as a percentage of the file server’s provisioned maximum||percent|
|FileServerDiskThroughputBalance||The percentage of available disk throughput burst credits for the file server*||percent|
|FileServerDiskIOPSUtilization||Disk IOPS activity as a percentage of the file server’s provisioned maximum||percent|
|FileServerDiskIOPSBalance||The percentage of available burst credits for disk IOPS for the file server*||percent|
*The metrics FileServerDiskThroughputBalance and FileServerDiskIOPSBalance metrics are only available for file systems with throughput capacity of 256 MBps or less because only these throughput capacities provide bursting for disk I/O. Visit the FSx for Windows File Server performance page for more details.
NetworkThroughputUtilization measures how the data transferred between your clients and the file server utilizes the available network throughput limit for your file system, which is determined by the throughput capacity you select. For example, if this utilization is 100 percent (during a stalled migration or slow access experience), your traffic is hitting the network throughput limits, and you might see increased latency, queueing, or even traffic shaping for requests you make to your file system.
You can also proactively use this metric to help you identify if network throughput is regularly approaching a limit. For example, if it exceeds 75 percent consistently, you may want to scale up throughput capacity.
The available memory for your file server also depends on its throughput capacity. MemoryUtilization describes how your file system is using the memory allocated that is available for caching and for performing background activities such as data deduplication and shadow copies.
For example, say you’re running Data Deduplication on your file system, but it isn’t freeing up any space. By examining the deduplication job output, you see that although the deduplication schedule is set, the most recent job failed to complete. You examine the MemoryUtilization metric and observe it pegged at 100 percent when the job started and dropped to 0 percent when the job failed. Based on this information, you could increase the throughput capacity on your file system, update your deduplication settings to consume less memory, or update your deduplication schedule to run only when your file system is idle. Alternatively, if your deduplication jobs run successfully and you’re utilizing well below full memory, you can decrease your throughput capacity, or you can allocate additional memory to deduplication by using the -memory parameter. Similarly, you can monitor CPUUtilization to determine if any resource-intensive tasks might require you to increase throughput capacity.
You can use FileServerDiskThroughputUtilization and FileServerDiskIOPSUtilization to measure disk throughput activity and disk I/O activity your workload is driving as a percentage of provisioned limit of file server, which is determined by selected throughput. For example, if during a large data transfer the FileServerDiskThroughputUtilization is nearing 100 percent, it means that your workload is hitting the provisioned limit for throughput between the file server and the storage volumes, and you may see slow data transfer, share access, and folder enumeration.
The FileServerDiskThroughputBalance and FileServerDiskIopsBalance give you the available burst credits for disk I/O and disk throughput of the file server. During an investigation of latency for a read/write-heavy activity like a SQL database backup, you can refer to these metrics to understand whether you are experiencing a bottleneck due to the I/O limits imposed by the file server or the I/O limits imposed by the storage volumes. Answering this question is critical to determining the correct action to take. If the bottleneck is on file server limits, then increasing throughput capacity will remove the bottleneck. If the bottleneck is on storage volume limits, then you’ll need to change the storage capacity or the storage type.
Disk I/O metrics and how to use them
These metrics measure traffic between the file server and the storage volumes. Regardless of whether your clients are actively driving traffic to the file system, with disk I/O metrics, you’ll be able to observe read and write activity for any activity that’s impacting your storage volumes.
|DiskReadBytes||The number of bytes for read operations that access storage volumes||bytes|
|DiskWriteBytes||The number of bytes for write operations that access storage volumes||bytes|
|DiskReadOperations||The number of read operations for the file server accessing storage volumes||count|
|DiskWriteOperations||The number of write operations for the file server accessing storage volumes||count|
These metrics can help you differentiate activity that drives up your disk I/O (like shadow copy jobs) from activity that drives up your network I/O (like data transfer between two file systems).
Note that the maximum disk I/O levels your file system can achieve are the lower of the disk I/O limits imposed by your file server (based on the throughput capacity) and the disk I/O limits imposed by the storage volumes (based on the storage capacity and storage type). More examples are covered in the storage capacity and throughput capacity documentation.
Storage volume I/O metrics and how to use them
These metrics measure how your workload utilizes the disk I/O limits imposed by the storage volumes. HDD storage volumes get 12 MB/s baseline and 80 MB/s burst disk throughput per TiB of storage, and SSD storage volumes get 750 MB/s and 3,000 IOPS per TiB of storage and up to 2048 MB/s throughput and 80,000 IOPS.
|DiskThroughputUtilization||(HDD only) Disk throughput activity as a percentage of the provisioned limit for the storage volumes||percent|
|DiskThroughputBalance||(HDD only) The percentage of available burst credits for disk throughput for the storage volumes||percent|
|DiskIOPSUtilization||(SSD only) Disk IOPS as a percentage of the provisioned maximum for the storage volumes||percent|
With these metrics, you can get visibility into the DiskThroughputUtilization and DiskThroughputBalance for your HDD-backed file system. You can determine whether a potential bottleneck exists at the storage volume level. If so, you might want to consider increasing the storage capacity (which will provide linearly higher throughput limits) or switching to SSD storage type (which will provide 250x higher disk I/O).
For SSD file systems, you can use DiskIOPSUtilization to understand how your performance- or latency-sensitive workload is using the available IOPS provided by the SSD storage volumes. For example, if you’re running a SQL Server high-availability deployment, you can ensure performance for your database.
You can also use these metrics to prevent shadow copies from being deleted. As explained in Troubleshooting shadow copies, having insufficient I/O performance on your file system can cause all shadow copies to be deleted by Windows Server because it is unable to maintain the shadow copies with the available I/O performance capacity.
Storage capacity metrics and how to use them
These metrics measure how your data is using the available storage capacity of your file system and how much storage you’re saving by using Data Deduplication.
|StorageCapacityUtilization||Used physical storage capacity as a percentage of total storage capacity||percent|
|DeduplicationSavedStorage||The amount of storage space saved by data deduplication, if it is enabled||bytes|
|FreeStorageCapacity||The amount of available storage capacity||bytes|
While you can increase the storage capacity for your file system at any time, most storage administrators prefer to take a more proactive approach. With the new StorageCapacityUtilization metric, you can keep a closer eye on the utilized physical storage and take action before the space runs out.
Customers using Data Deduplication on their file systems previously had to run a PowerShell command (Get-FSxDedupStatus) to get the amount of saved space by deduplication jobs. Now with the DeduplicationSavedStorage metric, you can get the value right away by looking at the graph that gives the amount of storage space saved by data deduplication in bytes. You can also observe how it trends over time—and set alerts if it drops below a certain value, indicating that a deduplication job may have failed.
In this post, we have explained how to use the new Amazon FSx performance metrics to better understand the performance characteristics of your workload, to diagnose and fix common performance issues, and to right-size your file system to optimize performance and cost.
Each of the architectural components of a file system and its respective metric domains are impacted by different types of file system activity, and it is important to understand your specific workload needs by using the full set of metrics. For example, if my end users are reporting slowness, I can look at the network I/O and disk I/O metrics to determine if resource contention exists between the end users and the file server or between the file server and the storage volumes.
The new metrics are available in Amazon CloudWatch, which means you can also set up your own custom dashboards, alarms, and notifications to take proactive action and minimize any potential disruptions to your file system’s performance. To learn more, visit Monitoring metrics with Amazon CloudWatch.