Assignment 1. Reading Unix v6 File Systems

An archaeological expedition in Murray Hill, NJ has uncovered some magnetic disk drives from the mid 1970s. After considerable effort, the dig’s technical expert read the contents of the drives, and for each disk drive, she created a local file to be an exact bitwise copy of the disk’s data.

The archaeologists determined the data on the disks was stored using the Version 6 Unix file system. Sadly, none of the archaeologists made it through CS103, so they haven’t been able to read the image contents. Your task is to write a program that understands the Unix v6 file system to extract the file system data.

Learning Goals

  • To fully understand the design and implementation of a legacy filesystem as outlined in class

  • To fully grasp the benefit of layered decomposition in the design of a larger system that works to make bland hardware work to be a fully operational filesystem

  • To gain even more practice working with a multiple-file codebase

Overview

The key design principle we are trying to drive home this week is that of layering: one way to manage complexity in systems is to break it down into layers that sit on top of each other. There are several layers involved in the Unix v6 filesystem:

  • The block layer is among the lowest software layers. It manages the details of communication with the disk, enabling us to read or write sectors. This layer sits underneath almost every filesystem operation, and above this layer, we don’t want to have to think about the nitty-gritty of talking to the hardware.

  • The inode layer supplies higher layers with metadata about files. When we need to know which block number is storing a particular portion of a file, the inode layer tells us that. Above this layer, we don’t want to think about the mechanics of retrieving or updating metadata (which isn’t simple, considering inodes are smaller than sectors).

  • The file layer supplies higher layers with actual file contents. We request some number of bytes from a file at a particular offset, and the file layer populates a buffer with that information.

  • The filename layer allows us to find a file within a directory. Given a filename and a directory presumably containing that file, this layer figures out what inode stores the information for that file.

  • Lastly, the pathname layer implements full absolute path lookups, starting from the root directory. If you hand /usr/cs103/hello.txt to the pathname layer, it will utilize the filename layer beneath it, looping through the components of that full path, until it finds hello.txt.

Any software accessing the filesystem will sit on top of the file and pathname layers. If we’re trying to write an application that does meaningful work, we almost certainly don’t want to be bogged up thinking about the organization of bytes on disk, so we leave that to the aforementioned layers beneath us.

For this assignment, we give you a fully implemented block layer, and you’ll flesh out the others. By the end of the assignment, your program will be able to list and read files from a disk formatted with the Unix v6 filesystem.

Notes

Expect that it will take some time to figure out the starter code before you can actually start doing anything. We throw a lot of files at you, and it isn’t trivial to figure out what’s going on and what’s expected of you. This may be frustrating, but we want to push you to get better at reading code and finding your way around an existing system; such skills are critical if you want to work on systems comprised of hundreds of thousands or millions of lines of code. Take time to read through all the starter code and to figure out how the files all fit together.

There are several files in the starter code that you will need to jump back and forth between, and future assignments will also involve work with many files. It may help you to sketch out how the files interact on pencil and paper, or to make some notes for yourself somewhere. If you are using VSCode, command/ctrl-click will be very helpful for jumping between definitions and uses of structs and functions. If you are using vim, the ctags plugin may be helpful.

Unix v6 file system supplement

Section 2.5 of the Salzer and Kaashoek contains most of what you need to know about the Unix v6 file system in order to do this assignment. The information below is supplementary to the textbook, so it assumes you’ve already read and understand the material there.

Getting Started

To get started on the assignment, clone the starter project using the command:

$ cp -r assign1 assign1

This line creates a new folder called assign1 in your current working directory containing a copy of the starter code.

Building and Testing

To build the project, cd into the project directory and type

$ make

This will create an executable file diskimageaccess. To test out your code on one of the sample disk images, type

$ ./diskimageaccess <options> <diskimagePath>

You can visually compare your output to that generated by my own solution by typing

$ ./samples/diskimageaccess_soln <options> <diskimagePath>

The <diskimagePath> should be the path to one of the test disks located in ./samples/testdisks. We currently have three test disks: basicDiskImage, depthFileDiskImage, and dirFnameSizeDiskImage.

The <options> indicates which test to perform:

  • -i: test the inode and file layers. This test will read all of the inodes in order of i-number and print out the contents of the inode, along with a SHA-1 checksum that uniquely summarizes the contents of each file.

  • -p: test the directory and pathname layers. This test will walk the entire file system directory tree, starting at the root directory, and print out all of the paths that it finds, along with inode and checksum information for each file or directory found.

You can specify -ip to perform both tests. If you type the command

$ ./samples/diskimageaccess_soln -ip samples/testdisks/basicDiskImage

The expected output for running diskimageaccess on each disk image X is stored in the X.gold file inside the testdisks directory.

In fact, we’re leveraging even another CS102 tool: the sanitycheck tool. In particular, if you type

$ sanitycheck

at the command prompt while in your assign1 directory, the sanitycheck script will exercise your implementation in precisely the same way our own grading scripts will. We won’t always expose all of the functionality tests like we are for Assignment 1, but we are this time.

What We Provide

The files we provide you fall into three categories.

  • The Version 6 Unix header files (filsys.h, ino.h, direntv6.h) These are described below.

  • The test harness (diskimageaccess.c, chksumfile.[ch], unixfilesystems.[ch]) These files provide the infrastructure for building your code and running our tests against it.

  • File system module (diskimg.[ch], inode.[ch], file.[ch], pathname.[ch])

The test code we give you interfaces with the file system using a layered API that we’ve already designed. For each of the layers that you’ll need to implement for this assignment, there is a header file that defines the interface to the layer and a corresponding .c file that should contain the implementation.

The layers are:

  • Block layer (diskimg.h and diskimg.c) This defines and implements the interface for reading and writing sectors (note that the words block and sector are used interchangeably) on the disk image. We give you an implementation of this layer that should be sufficient for this assignment.

  • Inode layer (inode.h, inode.c) This defines and implements the interface for reading the filesystem’s inodes. This includes the ability to look up inodes by inumber and to get the block/sector number of the nth block of the inode’s data.

  • File layer (file.h and file.c) This defines and implements the interface for reading blocks of data from a file by specifying its inumber. This is one of the two layers our tests explicitly exercise.

  • Directory layer (directory.h and directory.c) This defines and implements the interface for implementing Unix directories on top of files. Its primary function is to get information about a single directory entry.

  • Pathname layer (pathname.h, pathname.c) This defines and implements the interface to look up a file by its absolute pathname. This is the second of the two layers we explicitly test.

You are welcome to modify any of the files we give you. When making changes in the test harness, do so in a backward compatible way so that the testing scripts we give you continue to work. In other words, it’s fine to add testing and debugging code, but make sure that when we run your program with -i and -p, its output has the same format with or without your changes, so our automated tests won’t break.

Header files and structures

In the starter code we’ve provided C structs corresponding to the filesystem’s on-disk data structures. These structs have the same layout in memory as the structures have on disk. They include:

struct filsys (filsys.h)

Corresponds to the superblock of the file system. This is a slightly modified copy of the header file from Version 6 Unix.

struct inode (ino.h)

Corresponds to a single inode. Again this comes from Version 6 of Unix, with some small modifications.

struct direntv6 (direntv6.h)

Corresponds to a directory entry. This was copied from section 2.5 in the textbook.

In addition, unixfilesystem.h contains a description of the file system layout, including the sector address of the superblock and of the start of the inode region.

Legacy of an old machine

Back in the 1970s, storage space — both on disk and in main memory — was at a premium. As a result, the Unix v6 file system goes to lengths to reduce the size of data it stores. You’ll notice that many integer values in the structs we provided are stored using only 16 bits, rather than today’s more standard 32 or 64. (In our code we use the type uint16_t from stdint.h to get a 16-bit integer, but back in the ‘70s, the C int type was 16 bits wide.)

In another space-saving move, the designers of the file system stored the inode’s size field as a 24-bit integer. There’s no 24-bit integer type in C, so we represent this value using two fields in the inode struct: i_size1, which contains the least-significant 16 bits of the value, and i_size0, which contains the most-significant 8 bits of the value. We provide a function inode_getsize in inode.c that assembles these two fields into a normal C integer for you.

The first inode

Since there is no inode with an inumber of 0, the designers of the file system decided not to waste the 32 bytes of disk space to store it. The first inode in the first inode block has inumber 1; this inode corresponds to the root directory for the file system. (See unixfilesystem.h for details.)

Be careful not to assume that the first inode has an inumber of 0. Off-by-one errors are the worst.

inode’s i_mode

The 16-bit integer i_mode in the inode struct isn’t really a number; rather, the individual bits of the field indicate various properties of the inode. ino.h contains #define which describe what each bit means.

For instance, we say an inode is allocated if it points to an existing file. The most-significant bit (i.e. bit 15) of i_mode indicates whether the inode is allocated or not. So the C expression (i_mode & IALLOC) == 0 is true if the inode is unallocated and false otherwise.

Similarly, bit 12 of i_mode indicates whether the file uses the large file mapping scheme. So if (i_mode & ILARG) != 0, then the inode’s i_addr fields point at indirect and doubly-indirect blocks rather than directly at the data blocks.

Bits 14 and 13 form a 2-bit wide field specifying the type of file. This field is 0 for regular files and 2 (i.e. binary 10, or the constant IFDIR) for directories. So the expression (i_mode & IFMT) == IFDIR is true if the inode is a directory, and false otherwise.

Suggested Implementation Order

The starter code contains a number of unfinished functions. We suggest you attack them in the following order:

  • inode_iget and inode_indexlookup in inode.c.

  • file_getblock in file.c. After this step, your code should pass the inode level functionality tests.

  • directory_findname in directory.c.

  • pathname_lookup in pathname.c. Refer to Section 2.5.6 of the Salzer and Kaashoek book for this part. Afterward, all the pathname tests should pass.

Grading

We’ll be grading your submission for functionality and style, and this time—because the program is written in pure C—we expect valgrind reports to be clean. We’re exposing the vast majority of the tests via sanitycheck, but not all of them. We do promise, however, that the exposed tests will comprise at least 75% of the functionality score.

  • Code test results: 70%

  • Clean build and clean valgrind reports: 25%

  • Code quality: 5%

Our tests run your code on a set of test disk images and check that your program publishes the expected output. You have access to all of the images we’ll use for grading and the correct output for each of them.

Tips and Tidbits

This assignment requires some careful low-level C coding, so I thought I’d share a list of tips and tidbits to help you arrive at a working solution more quickly and elegantly. Virtually all of these bullet points are based on points of feedback the course staff has given countless students over the course of the past several years. Many of these will be incredibly helpful to the students who haven’t passed through our own CS102 course and may not be as comfy with pointers and memory management.

Here goes:

  • Check out unixfilesystem.h for the unixfilesystem definition and a few constants you need to use (in particular, INODE_START_SECTOR and ROOT_INUMBER are pretty important). Don’t hard code 2 and 1 in your code, even if it works. These constants make the implementation narrative much easier to follow.

  • You can largely ignore the struct filsys and all of the fields defined inside. The data structure is included for the purposes of the testing framework, but none of the code you write needs to deal with the struct filsys.

  • The implementation of inode_iget needs to search the inode table for a particular inode, and place a copy of that particular inode in the space addressed by inp. Define some of you own constants (I have two: INODES_PER_BLOCK for my inode_iget implementation, and a second called NUM_BLOCK_NUMS_PER_BLOCK for my inode_indexlookup implementation.)

  • When implementing inode_iget, past students have declared generic character buffers (e.g. char buffer[DISKIMG_SECTOR_SIZE]) to store the contents of a sector in the inode table, and then go on to use some of desperate type casting and manual pointer arithmetic to replicate the contents of an inode in the space addressed by inp.

    Don’t do that. Instead, declare an array of struct inode records—after all, you know the contents of sectors in the inode table store inode records, right?—and pass the address of that array to diskimg_readsector. That way, you can manipulate the contents of a strongly typed array of struct inode instead of a weakly typed array of char. Stated differently, a lot of you rely on the void * trickery you learned in CS102 here on the assumption that’s what we’re looking for, when if fact, there’s no reason for it. You only go with void *, manual pointer arithmetic, and memcpy when you don’t know the data types that are being stored, copied, and passed around. In the case of inode_iget, you know the sectors you’re examining contain arrays of struct inode.

    This coding style issue is the most common error so I want people to get it right this time before you even submit. You’ll just be happier with your work.

  • The implementation of inode_indexlookup needs to manage direct, singly indirect, and doubly indirect block numbers. Translating direct block numbers to sector numbers is trivial, but marshaling indirect block numbers to sector numbers is hard to get right.

    • Hint 1: Understand that indirect block numbers identify other blocks that contain 256 two-bytes integers (best managed via the uint16_t type).

    • Hint 2: Understand that doubly indirect block numbers lead to singly indirect block numbers, which lead to sector numbers. Make sure whatever code you write to translate singly indirect block numbers to sector numbers also contributes (without cutting and pasting or otherwise copying it) to the process used to ultimately convert doubly indirect block number to sector numbers.

  • file_getblock returns the number of meaningful bytes in a block/sector that contribute to the payload of some file (be it a regular file or a directory). If the file size happens to be a multiple of 512, then this function always returns 512. When a file isn’t a multiple of 512 and you’re to get the payload stored in the very last block, you should expect to return a number that’s less than 512 (because only a part of what’s stored in the block is relevant).

  • Your implementation of directory_findname should declare an array of struct direntv6’s instead of a character array of length 512. See discussion of struct inode arrays vsv. the unnecessarily error-prone void * approach above.

  • Your implementation of directory_findname should make use of strncmp instead of strcmp, because some of the stored directory names aren’t '\0'-terminated. They are, however, guaranteed to be of length 14 or less. Check out the man page for strncmp and strcmp so you understand the difference.

  • My own implementation of pathname_lookup uses strlen, strcpy, strsep, and directory_findname. Most of you probably haven’t used strsep before, but fortunately there’s a man page for it. Check it out.

  • My implementation of pathname_lookup is iterative, although in principle I’m fine with recursive implementations provided they’re clean and tidy. In general, you only use recursion over iteration when it buys you something—simplicity, code clarity, etc. My feeling is that iteration is just as easy and clear as the recursive implementation, but that’s more of an opinion than some larger truth, so go nuts with recursion if that’s your preference.

  • None of the code any of you need to write needs to rely on dynamic memory allocation (i.e. malloc, realloc, free).

  • Again, we will be running your code under valgrind this time, so make sure your implementation is devoid of memory leaks and access errors, else valgrind will snitch on you.


Enjoy!