Wednesday, March 17, 2010

Intro to IO Profiling of Applications

One of the sorely missing aspects of storage is analyzing and understanding the IO patterns of applications.

This article will examine some techniques for performing IO profiling of an application to illustrate what information you can gain.

One of the classic problems in designing storage solutions is that we don’t know what kind of IO performance applications need.

We may have some empirical knowledge, “switching to SSD’s improved by wall clock time by 25%,” or, “adding an extra drive improved my application performance by 11%.”

While this information is helpful to some degree what is missing is the understanding of why performance improved and by extension, what are the driving mechanisms behind the performance improvement.

Perhaps equally important for the application developers is that understanding the driving forces of IO performance for your application can be used to improve the IO performance, if needed.



This article is an introduction to using a common Linux system administration tool - strace.

Strace output from an application can be used to produce a huge amount of information about the IO behavior of your application.


Strace is your Friend
The goal behind IO profiling is to understand the application not necessarily to understand the storage system itself.

The information is to be about the application - how does it perform writes, how does it perform reads, how much lseek happens, what kind of IOPS performance does the application require, etc.

However, to obtain this information was can’t divorce ourselves completely from the storage system but we can at least gain something beyond an empirical knowledge.


The tool that we’re going to use is strace.

This tool is commonly used for understanding .so libraries that are added during runtime as well as gathering other debugging information about misbehaving applications.

What strace does is trace system calls and signals. Since the system is doing the IO strace can be used to trace what is happening with IO syscalls calls.

Moreover, you can use strace to gather additional information about the system calls (syscall) such as how much time was spent performing the syscall.

All of this information can be used to profile an application’s IO patterns.


Strace has a large number of options (just run “man strace”). Some of them are relevant to this discussion and most are not.

The specific options used in this article are:


% strace -T -ttt -o strace.out

The “-T” option cause strace to print out the elapsed time of the system call at the end of the line in the form, <0.000197>.

This is amount of time used to complete the syscall. The option “-ttt” causes microsecond timing but does so in seconds since the epoch (makes life easier for applications that run a long time).

The last option, “-o”, sends the output to a file - in this case “strace.out.”


Quick strace example
Let’s look at a quick example of using strace. Here is a simple C code that creates an array then dumps it to a file using the function fwrite().

#include 
#define BUFSIZE 100000
int main(int argc, int *argv[] )
{
   int i;
   float buf[BUFSIZE];
   char filename[128];
   FILE *myfile;

   for (i=0; i < BUFSIZE; i++)
      buf[i] = 2.5 + BUFSIZE * i;
   sprintf(filename, "testfile");
   myfile = fopen(filename, "w");
   fwrite(buf, sizeof(float), BUFSIZE, myfile);
   fclose(myfile);
}

The code takes an array of float variables and dumps the entire array to a file (”testfile”) using the fwrite function.


The code is built using gcc 4.3.2 (Ubuntu 4.3.2-lubuntul2) with no options. The strace output using the options “-T -ttt” is the following:


1267394331.467245 execve("./test1c", ["./test1c"], [/* 37 vars */]) = 0 <0.000211>
1267394331.467773 brk(0)                = 0×951e000 <0.000012>
1267394331.467862 access(”/etc/ld.so.nohwcap”, F_OK) = -1 ENOENT (No such file or directory) <0.000016>
1267394331.467973 mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb80d8000 <0.000014>
1267394331.468055 access(”/etc/ld.so.preload”, R_OK) = -1 ENOENT (No such file or directory) <0.000016>
1267394331.468137 open(”/etc/ld.so.cache”, O_RDONLY) = 3 <0.000017>
1267394331.468212 fstat64(3, {st_mode=S_IFREG|0644, st_size=121820, …}) = 0 <0.000012>
1267394331.468349 mmap2(NULL, 121820, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb80ba000 <0.000014>
1267394331.468405 close(3)              = 0 <0.000011>
1267394331.468480 access(”/etc/ld.so.nohwcap”, F_OK) = -1 ENOENT (No such file or directory) <0.000016>
1267394331.468573 open(”/lib/tls/i686/cmov/libc.so.6″, O_RDONLY) = 3 <0.000021>
1267394331.468655 read(3, “177ELF111���������3�3�1���340g1″…, 512) = 512 <0.000014>
1267394331.468750 fstat64(3, {st_mode=S_IFREG|0755, st_size=1425800, …}) = 0 <0.000012>
1267394331.468873 mmap2(NULL, 1431152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7f5c000 <0.000014>
1267394331.468934 mmap2(0xb80b4000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0×158) = 0xb80b4000 <0.000022>
1267394331.469009 mmap2(0xb80b7000, 9840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb80b7000 <0.000017>
1267394331.469078 close(3)              = 0 <0.000011>
1267394331.469152 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f5b000 <0.000014>
1267394331.469215 set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f5b6b0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 <0.000013>
1267394331.469382 mprotect(0xb80b4000, 8192, PROT_READ) = 0 <0.000016>
1267394331.469441 mprotect(0×8049000, 4096, PROT_READ) = 0 <0.000019>
1267394331.469506 mprotect(0xb80f5000, 4096, PROT_READ) = 0 <0.000014>
1267394331.469560 munmap(0xb80ba000, 121820) = 0 <0.000021>
1267394331.471100 brk(0)                = 0×951e000 <0.000012>
1267394331.471153 brk(0×953f000)        = 0×953f000 <0.000013>
1267394331.471224 open(”testfile”, O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 <0.000035>
1267394331.471325 fstat64(3, {st_mode=S_IFREG|0644, st_size=0, …}) = 0 <0.000013>
1267394331.471446 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb80d7000 <0.000013>
1267394331.471509 write(3, “ @@Q303G240PCHP|222HPP303HP$364H(|22I(346*I”…, 397312) = 397312 <0.001326>
1267394331.472913 write(3, “25240Nr30240N3333240N(36240N5!240NB$240N”…, 2688) = 2688 <0.000031>
1267394331.473014 close(3)              = 0 <0.000014>
1267394331.473067 munmap(0xb80d7000, 4096) = 0 <0.000018>
1267394331.473133 exit_group(0)         = ?

Sorry about the line wraps but I think you can read the output well enough.


There are a few interesting things to note about the strace output. First, the open() syscall is used even for .so libraries (sharable object - dynamic link libraries).

The lines 1 to 25 do a bunch of “setup” work to start the application, allocate memory, read and load any needed functions from any .so libraries.

Finally, on line 26 we get to the point where our application starts to do some IO. Line 26 opens the file and assigns the file descriptor “unit” of 3.

Then line 29 actually writes 397,312 bytes of data to the file and line 30 writes 2,688 bytes to the file. The sum of the two writes is 400,000 bytes which is what we expected given that the test system is 32 bits, we’re using floats, and we have an array of 100,000 elements (4 bytes each).

Finally, the application closes the file on line 31 and the application stops shortly after that line.


You can also see that for the write() syscalls, the first few bits of data that are being written inside the quotations in the strace output.

For this example, the data is binary so the data doesn’t much sense.

But if the data is text or non-binary data, then you might be able to read the first part of the data. Strace does not list all of the data being written to the file.

The very last number, after the equals sign, is the number of bytes actually written to the file descriptor. For this example, there are two write syscalls.

The first one writes 397,312 bytes and the second writes 2,688 bytes.

The total of the fwrite functions is 400,000 bytes (as it should be).


Another interesting observation, and one that I can’t explain, is why the compiler forces the single fwrite command into two write() syscalls.

Through some experimentation, if the number of float elements is set with anything up to 1,024 elements (4,096 bytes) then you will get a single write() syscall in the strace output.

However, for 1,025 elements and up (at least to 1,000,000 elements), you will get two write() syscalls in the strace output.

My only thoughts are that 4,096 Bytes (4KB) is the virtual memory pagesize as well as the sector size on the drive.

I’m sure it has something to do with the virtual memory pagesize but I haven’t been able to determine why (if you have any information please post it!).


Further Analysis of the strace Output

The really cool part of using strace is that you have a great deal of information at your finger tips. For example you have the amount of data written or read and the amount of time it took to perform a syscall.

You can use this information to compute the throughput of each individual write() or read() system call.

You can also add the amount of data written or read to produce the total amount of data.

In addition, you can count the number of read() and write() system calls in a given amount of time to compute the Read IOPS or Write IOPS.

Moreover, you can count the total number of IO syscalls to get the Total IOPS.

There is a great deal you can do with this information including performing a statistical analysis of the results.

You can look at the average size and standard deviation size of read() and write() syscalls.

You can look at the distribution of the “sizes” in the read() and write() syscalls (i.e. the amount of data in the syscall).

You can examine histograms (results as a function of time) for reading, writing, IOPS, movement of the file offset (lseek), etc.

In other words - there is a huge number of possibilities you can do with the data.

Let’s walk through the strace output and examine some of the statistics for the IO sycalls. First let’s examine the first write() syscall.

The elapsed time is 0.001326 seconds (1326 microseconds).

The amount of data written was 397,312 bytes.

So the throughput was,

397312 Bytes / 0.001326 secs. = 299,631,975.867 Bytes/s = 299.632 MB/s

Don’t forget that we are just measuring what the OS gives us from strace.

We don’t have any way to measure what is going in the storage system stack using strace (to my knowledge).

It is very likely that what we are seeing is the amount of time it took to write the data (0.4 MB) to the system cache or the disk cache - not the time it took to write the data to the drive.

But, perhaps most important, we are seeing the amount of time the application took to write the data.

Since we are interested in the IO behavior from the perspective of the application then this amount of time is very pertinent to our investigation.


We can do the same computations to the second write():
2688 Bytes / 0.000031 secs = 86,709,677.4194 Bytes/s = 86.7097 MB/s



The total elapsed time for the application is also easy to compute.

You just take the difference in the time from the time of the last syscall plus it’s elapsed time (line 33) and the beginning of the fist syscall (line 1).

Since the last syscall did not have any elapsed time then it’s just the time for the start of the last syscall:

1267394331.473133 - 1267394331.467245 = 0.005888 seconds (5,888 microseconds).



We can also compute the amount of time for all IO syscalls.

This gets a little more involved since we need to add the time for the open(), fstat64(), write(), and close() syscalls.

Skipping the open syscalls for files that the application does not specifically open or close, results in the following:

0.000035 (line 26) + 0.000013 (line 27) + 0.001326 (line 29) + 0.000031 (line 30) + 0.000014 (line 31) = 0.001419 seconds (1419 microseconds)

We can then determine how much time was spent doing IO versus everything else:

(0.001419 / 0.005888) * 100 = 24.1%

So for this application, almost 25% of the time is spent performing IO.

Since the application only does IO, except for filling up the array, 25% is quite understandable.


The last thing I will examine in this article is the IOPS.

The basic definition of Write IOPS is the number of write syscalls in a given amount time. It is usually measured in number of Write IO operationss per second.

The same is true for Read IOPS except for read() syscalls. And finally, the Total IOPS includes all IO syscalls.


The Write IOPS is fairly easy to compute. Since the application runs for less than 1 second, you just count the number of write() syscalls to get the Write IOPS.

Write IOPS = 2 IOPS

The Total IOPS is found in a similar manner by just counting the number of IO related operations.

Total IOPS = 5 IOPS

This application is almost as simple as you can get so these IOPS are very, very small.


Summary
Understanding the IO profile or IO behavior of your application is almost non-existent yet it is one of the most important things needed when designing a storage system.

If you understand how your application(s) is performing IO then you can determine the performance requirements of your storage.

How much capacity? How much throughput? How many IOPS? How much lseek capability? How much time is spent performing IO? And on and on.

Empirical evidence such as “adding a disk improved throughput by 11%”, can help but it falls short in truly understanding the storage requirements of your application.

Moreover, if the requirements aren’t to you liking (or your budget), then if you own the application and have the source, you can modify the application to adjust the requirements.


In this article strace is used as the mechanism to gather the IO system information from the application. Yes, running strace impacts the behavior of the application and if you dump the strace information to the same file system being used by the application, you are impacting the timing behavior.

But, to be bluntly honest, there are no other tools to gather IO system information (to my knowledge).

In addition, strace does not require the application to be modified to use an independent library for tracing IO syscalls nor does it require that a new library be installed to capture this information (which also impacts IO behavior).

Finally, strace is easy to use and can produce some extremely useful information (as we saw).


This article is the first in a series examining the use of strace to gather a great deal of IO statistics about an application.

It’s a quick introduction but I hope it shows that you can generate a great of information from simple computations.

The example application was trivial but I wanted to show how you go about computing statistics and IO measurements such as throughput and IOPS.


In the next article I will discuss how to use strace for MPI (Message Passing Interface) applications commonly encountered in HPC (High Performance Computing).

Future articles will also present tools for processing the strace output files since performing the math for several hundred thousand IO syscalls can get a little tedious (and time consuming).

No comments:

Post a Comment

Post a Comment