My article three weeks ago on Linux file systems set off a firestorm unlike any other I've written in the decade I've been writing on storage and technology issues.
My intentions were to relate my experience as an HPC storage consultant and my knowledge of file systems and operating systems to advise readers on the best course of action. This is no different from the approach I take in all my articles. I spend most of my time reviewing storage technology issues for my customers. The installations I work with generally start with 500 TB of storage and go up from there. I have one site that I work with that currently has more than 12 PB, and many planning for 60 PB by 2010.
There is a big difference in my world between the computation environments and the large storage environments. In the HPC computational environments I work with, I often see large clusters (yes, Linux clusters). Of the many hundreds of thousands of nodes that I am aware of, however, no one is using a large — by large, I mean 100 TB or greater — single instantiation of a Linux file system. I have not even seen a 50 TB Linux file system. That does not mean that they don't exist, but I have not seen them, nor have I heard of any.
What I do see for large Linux clusters are clustered file systems such as Lustre. Lustre currently uses many ext-3/4 file systems and binds them together into a single name space. But this article and the last one have nothing to do with these clustered file systems and everything to do with implementation of Linux file systems of more than 100 TB. And that's where Linux file systems run into trouble.
Here are two of the problems I see with scaling for Linux file systems:
- The file system is not aligned to the RAID stripe unless you pad out the first stripe to align the superblock. Almost all high-performance file systems that work at large scale do this automatically, and metadata regions are always aligned to RAID stripes, as metadata is often separated from the data on different devices so the RAID stripes can be aligned with the allocation of the data and the metadata.
- Fscking the log is not good enough when you have a hardware issue ranging from a RAID hiccup to a hard crash of multiple things caused by something like a power incident. If this happens, you must fsck the whole file system, not just the log (a number of responders pointed this out). Since the metadata is spread through the file system and all the metadata must be checked, given a variety of issues from power to RAID hiccups to undetected or mis-corrected errors, there is going to be a great deal of head-seeking involved and read-modify write on the RAID devices, since the metadata areas are not aligned with RAID stripe applications.
One other thing I tried to make clear was that small SMP systems with two or even four sockets are not being used for the type of environment I've been talking about. If you have a 500 TB file system, you often need more bandwidth to the file system than can be provided in a four-socket system with, say, two PCIe 2.0 buses (10 GB/sec of theoretical bandwidth). Many times these types of systems have eight or even 16 PCIe buses and 10 GB/sec to 20 GB/sec (or more) of bandwidth. These types of environments are not using blades, nor can they, given that breaking up the large file systems is expensive in terms of management overhead and scaling performance.
Aside from the emotional responses and personal attacks (sorry folks, I'm an independent consultant and not paid by Microsoft or any other vendor, and my opinion is my own), a number of readers raised some good points.
One wondered about Google's use of large file systems. My response is that each of Google's file systems on each of the blade nodes is pretty small and the aggregation of the file systems is done by an application. Also, Google's file system is not part of the standard Linux release.
A number of readers noted that I didn't delve into the details of the extents in ext-3/4 and XFS. For more on the issue, see Choosing a File System or File Manager.
One reader wondered whether users with petabyte storage requirements would use a block device file system rather than a networked hot-add file system (I would think resizing would become quite a nightmare, not to mention a forced fsck) or whether they would run a stock Linux file system or do it without investing some time and money into some heavy tweaking. Additionally most NAS file systems do not scale to petabytes.
My response is I know people who need the performance for an SMP for file systems of this size today. Breaking the file system up using blades and over a network increases the overhead of management and therefore the cost. NAS performance doesn't cut it for these people doing streaming I/O for large archives, almost always with HSM-based file systems. The reader basically agrees with my point that Linux file systems need to dramatically improve fsck performance, and as for the last point, yes, these people are investing heavily in performance resources.
A few readers pointed out that I failed to mention other factors besides the file system, such as device drivers, the hardware platform and the application access patterns and what other applications were running. This is a fair comment, but my response is I was just trying to address the file system issues in Linux, not critique the whole data path.
A Call to Action
These are the opinions and analysis of one storage consultant, based on what I have seen in real-world environments at very large sites. My advice is that Linux file systems are probably okay in the tens of terabytes, but don't try to do hundreds of terabytes or more. And given the rapid growth of data everywhere, scaling issues with Linux file systems will likely move further and further down market over time.
If you disagree, try it yourself. Go mkfs a 500 TB ext-3/4 or other Linux file system, fill it up with multiple streams of data, add/remove files for a few months with, say, 20 GB/sec of bandwidth from a single large SMP server and crash the system and fsck it and tell me how long it takes. Does the I/O performance stay consistent during that few months of adding and removing files? Does the file system perform well with 1 million files in a single directory and 100 million files in the file system?
My guess is the exercise would prove my point: Linux file systems have scaling issues that need to be addressed before 100 TB environments become commonplace. Addressing them now without rancor just might make Linux everything its proponents have hoped for.