Concepts and Terminology for Disks and Filesystems

This case study uses the terminology from the most excellent IRIX Admin: Disks and Filesystems IRIS InSight book by Susan Ellis, Dany Galgani, and Gloria Ackley, which you'll find at the above link and also in eoe.books.IA_DiskFiles. Refer to the "Logical Volume Concepts" chapter for some good pictures.

Here we will summarize the terminology from that InSight book and fill in many additional assumptions and facts that are relevant to this case study.

This document ignores efs. This document ignores lv, which is an obsolete efs-based logical volume scheme.

Disks and Partitions

A physical disk device that sits on a bus (usually a SCSI bus) and has one bus ID is called a disk. Each hard drive inside your SGI is one disk. A disk array is an external device which may contain many hard drives, but looks like one disk (has one ID) to a connected SGI machine. Internally the array may have many of the concepts we'll describe below (striping, plexing, etc.) but to the SGI, it's just one disk (usually a really fast one). A disk enclosure or disk vault is an external device which also contains disk drives, but merely acts as an extension of your SCSI bus; each disk drive has a separate ID to the SGI machine, and thus is a different disk.

You divide a disk into partitions with fx. You can see how a disk is partitioned with prtvtoc or fx.

XLV Logical Volumes

Although you can access a partition directly, video disk I/O often demands more storage or bandwidth than one disk partition can deliver. In this case you use XLV to group many partitions together into a logical volume using xlv_make. Here is the key example picture, stolen outright from the IRIX Admin: Disks and Filesystems IRIS InSight book:

Your disks, divided into partitions, appear at the bottom of this diagram. Like a file, a partition is an addressible object: it has a size, and you can think of reading or writing at any offset (or address) from zero up to its size. The lower three layers of XLV work by taking a group of addressible objects (partitions or other XLV addressible objects) and making them look like one addressible object. Each layer maps the address range of underlying objects into its own address range.

At the lowest layer of XLV, you group one or more partitions together into an addressible object called a volume element.

You can logically abut partitions by creating a multipartition volume element, meaning that as you access the volume element from beginning to end, you move across the first partition from beginning to end, then the second partition from beginning to end, etc. But this is boring.
Much more interesting is to create a striped volume element consisting of partitions on different disks, as is done within the real-time subvolume in the diagram above. Striping allows you to record and play uncompressed video, whose bandwidth we described in How Big is Video?, even when no single disk can sustain that througput. Say you have a striped volume element with N partitions on N disks. If you were to access the volume element from beginning to end one byte at a time, you would access the first S bytes from the first disk, then the first S bytes from the second disk, etc. until you ran out of disks. Then you would access the second S bytes of the first disk, the second S bytes of the second disk, and so on. If instead you access the volume element with much larger reads and writes (at least N*S), XLV will be able to split up your request on these stripe boundaries and execute the pieces in parallel on each disk. This gives you up to an N-fold throughput improvement over one disk. You choose S, the volume element's stripe unit, when you create the volume element. In order to get the speedup, you must choose an appropriate stripe unit and I/O size for striped volume elements, as explained elsewhere in this document.

At the next layer up, you concatenate the address range of one or more volume elements into the address range of a plex, as is done within the data subvolume in the diagram above. This performs a function very similar to multipartition volume elements, but one level higher. For example, if you are using striping and you want to create a volume that is larger than any individual striped volume element, you can concatenate several volume elements together into a plex by mapping them contiguously into the plex's address range.

Then you specify one or more plexes for a logical subvolume. The plex (also called the mirror) is the level of redundancy. The diagram above includes a logical subvolume (the log subvolume in this case) with two plexes, meaning anything written to the logical subvolume will be replicated in each plex for reliability. For our purposes, a logical subvolume's address range is the same as that of each of its plexes. If you have a logical subvolume with more than one plex, XLV sometimes lets you use plexes with "holes" in their address range that do not map to any volume element, but this is beyond the scope of this document.

The logical subvolume is the highest level of addressible object in XLV. A logical volume is a collection of separately addressed subvolumes, consisting of a data subvolume, an optional real-time subvolume, and an optional log subvolume. These subvolumes are like three different files: they have independent sizes and address ranges and are not logically concatenated, striped, or replicated. We'll describe the purpose of these subvolumes below.

This document makes the following simplifying assumptions about your XLV setup:

Each of your logical subvolumes has one plex that encompasses the entire subvolume. Therefore your volume has no redundancy.
Each of your plexes has one or more concatenated volume elements that cover the entire range of the plex with no holes.
Each of your volume elements consists of either a single partition, or several partitions striped together. There's no real need to discuss multipartition volume elements.

Accessing Partitions and Logical Volumes

When you access disks from IRIX tools or your program, you access either partitions or logical volumes. There are two ways you can access either of these objects:

Normally you use mkfs to create an XFS filesystem on the partition or logical volume, specifying the device file for the partition (/dev/dsk/dks*) or logical volume (/dev/dsk/xlv/*). Then you mount the filesystem (again providing the device file) and use it.
You can also open the raw device file for the partition (/dev/rdsk/dks*) or logical volume (/dev/rdsk/xlv/*) and directly read() and write() its raw bits. Some video applications choose this option because they want to roll their own filesystem. When you access an XLV logical volume with the raw device file, you are accessing the data subvolume. There is not currently a way to access the log or real-time subvolumes of an XLV volume with a raw device file. Whenever you access a raw device file, you must follow certain disk alignment, memory alignment, and I/O size rules. We'll go over those in Software Methods for Disk I/O.

This document assumes that your application accesses one filesystem or one raw device file to do its video I/O.

XFS Filesystems

An XFS filesystem has three separately addressed sections: a log section, a data section, and an optional real-time section.

When you create an XFS filesystem on a single partition, mkfs divides the partition into two parts and uses one for the log section and one for the data section (this is called an internal log). Filesystems on single partitions never have real-time sections.
When you create an XFS filesystem on an XLV logical volume,
- mkfs creates the data section of the XFS filesystem on the XLV data subvolume.
- mkfs creates a real-time section for the XFS filesystem on the XLV real-time subvolume, if the subvolume is present.
- mkfs creates the log section of the XFS filesystem on the XLV log subvolume, if the subvolume is present (this is called an external log). Otherwise, mkfs creates the log section alongside the data section in the XLV data subvolume (internal log).

Here is more on the three sections:

The data section contains file data and the metadata normally associated with a UNIX filesystem (superblock, inodes, directories, extent tables, ...).
The optional real-time section is an alternate place where you can store file data. Unless otherwise specified, this document will assume that you have only a data and log section. The real-time section:
- has different blocksize/extent properties which are often useful for video disk I/O to a striped XLV volume. We'll discuss these elsewhere in this document.
- contains no metadata; inode and extent information for a file in a real-time section is stored in the filesystem's data section. Therefore, the XLV real-time subvolume that contains the real-time section may include disks on which you have disabled retries. This provides a tradeoff between reliability (writes are unreliable, reads can fail) and latency (the disk only ever tries a read or write once, reducing the worst-case probabilistic command completion time) which is important for some applications.
A file's data is either stored in the data section (the normal case) or the real-time section. Creating files in the real-time section requires specially written code (one or more XFS-specific fcntl()s). Reading from, writing to, or accounting for those files (stat() and statvfs()) also requires special code. Standard UNIX tools like ls, df or du require special IRIX-specific flags to tell you about the real-time section storage of a file. Very few current GUI tools can robustly deal with files in the real-time section.
As you make changes to your filesystem that require updates to the filesystem metadata, XFS makes a low-bandwidth log of those changes in the log section. This log greatly increases the speed and likelihood of recovering your filesystem if your machine crashes (the log is why fsck is no more). XFS lazily updates the real metadata in the data section based on the changes described in the log section.

If you need to stripe disks together to get enough bandwidth to read and write uncompressed video, you want to create striped volume elements in either the data or real-time subvolumes of your XLV logical volume, since that is where your file data will get stored when you create an XFS filesystem on the XLV logical volume.