Filesystem System Call

User Perspective

  • Studying how we interact with the filesystem as users will inform how we interact with it as programmers.

  • As users, we can run ls to get details about particular files. Using -a shows all files (even hidden ones), -l shows more info about each file.

    cs103@stickmind:~/assign1$ ls -al
    total 42
    drwx------ 5  cs103 operator 2048 Jan 9 08:41 .
    drwxr-xr-x 51 cs103 operator 6144 Jan 9 15:12 ..
    drwx------ 8  cs103 operator 2048 Jan 9 15:12 .git
    -rw------- 1  cs103 operator 259  Jan 5 15:31 .gitignore
    -rw------- 1  cs103 operator 1750 Jan 9 15:12 imdb.cc
    -rw------- 1  cs103 operator 3501 Jan 5 15:31 imdb.h
    -rw------- 1  cs103 operator 6439 Jan 5 15:31 imdbtest.cc
    -rw------- 1  cs103 operator 1720 Jan 5 15:31 imdb-utils.h
    -rw------- 1  cs103 operator 964  Jan 5 15:31 Makefile
    drwx------ 2  cs103 operator 2048 Jan 5 15:31 .metadata
    -rw------- 1  cs103 operator 2146 Jan 5 15:31 path.cc
    -rw------- 1  cs103 operator 4122 Jan 5 15:31 path.h
    -rw------- 1  cs103 operator 1829 Jan 5 15:31 search.cc
    drwx------ 2  cs103 operator 2048 Jan 5 15:31 tools
                                              
    1. Type and permissions

    2. Hard Links

    3. Owner name

    4. Group name

    5. Size (bytes)

    6. Last modified time

    7. Filename

    8. . means current directory

    9. .. meas parent directory

File Permissions

  • Here, the owner has read, write, and execute permissions, the group has only read and execute permissions, and the other user also has only read and execute permissions.

     rwx   r-x   r-x
      |     |     |
    owner group other
    
  • Filesystem represents permissions in binary (1 or 0 for each permission option):

    • e.g. for permissions above: 111 101 101

    • we can further convert each group of 3 into one base-8 digit

      • base 8: 7 5 5

    • so, the permissions for the file would be 755


Programmer Perspective

System Calls

  • Functions to interact with the operating system are part of a group of functions called system calls.

  • A system call is a public function provided by the operating system.

  • The operating system handles these tasks because they require special privileges that we do not have in our programs.

  • The operating system kernel actually runs the code for a system call, completely isolating the system-level interaction from your (potentially harmful) user program.

  • We are going to examine the system calls for interacting with files. When writing production code, you will often use higher-level methods that build on these (like C++ streams or FILE *), but let’s see how they work!

  • A function that a program can call to open a file:

    int open(const char *pathname, int flags);
    
    • pathname: the path to the file you wish to open

    • flags: a bitwise OR of options specifying the behavior for opening the file

    • the return value is a file descriptor representing the opened file, or -1 on error

  • Many possible flags (see man page for full list).

    You must include exactly one of the following flags. These say how you will use the file in this program:

    • O_RDONLY: read only

    • O_WRONLY: write only

    • O_RDWR: read and write

    Optional:

    • O_TRUNC: if the file exists already, clear it (“truncate it”)

    • O_EXCL: the file must be created from scratch, fail if already exists.

  • A function that a program can call to open (and potentially create) a file:

    int open(const char *pathname, int flags, mode_t mode);
    
    • mode: the permissions to attempt to set for a created file, e.g. 0644 (octal!)

    Optional:

    • O_CREAT: You can also create a new file if the specified file doesn’t exist, by including O_CREAT as one of the flags. You must also specify a third mode parameter.

Aside

How are there multiple signatures for open in C? See c - open() system call polymorphism - Stack Overflow.

  • A function that a program can call to close a file when done with it.

    int close(int fd);
    
    • fd: the file descriptor you’d like to close.

    • Returns: 0 on success, -1 on error (we usually won’t error-check close)

  • It’s important to close files when you are done with them to preserve system resources.

  • You can use valgrind to check if you forgot to close any files. (–track-fds=yes)

  • A function that a program can call to attempt to read up to count bytes from an open file referenced by fd into the buffer starting at buf.

    ssize_t read(int fd, void *buf, size_t count);
    
    • fd: the file descriptor for the file you’d like to read from

    • buf: the memory location where the read-in bytes should be put

    • count: the number of bytes you wish to read

    • The function returns -1 on error, 0 if at end of file, or nonzero if bytes were read (will never return 0 but not be at end of file)

Key idea

  • read may not read all the bytes you ask it to! This is not necessarily an error - e.g. if there aren’t that many bytes, or if interrupted. The return value tells you how many were actually read. If we must have all bytes, we can call read more.

  • The operating system keeps track of where in a file a file descriptor is reading from. So the next time you read, it will resume where you left off.

  • A function that a program can call to write up to count bytes from the buffer starting at buf to an open file referenced by fd.

    ssize_t write(int fd, const void *buf, size_t count);
    
    • fd: the file descriptor for the file you’d like to write to

    • buf: the memory location storing the bytes that should be written

    • count: the number of bytes you wish to write from buf

    • The function returns -1 on error, or otherwise the number of bytes that were written (nonzero assuming count > 0)

Key idea

  • write may not write all the bytes you ask it to! This is not necessarily an error - e.g. if not enough space, or if interrupted. The return value tells you how many were actually written. If we must write all bytes, we can call write more.

  • The operating system keeps track of where in a file a file descriptor is writing to. So the next time you write, it will write to where you left off.

File descriptors

  • A file descriptor is like a “ticket number” representing your currently-open file.

    • It is a unique number assigned by the operating system to refer to that instance of that file in this program.

    • Each program has its own file descriptors.

    • When you wish to refer to the file (e.g. read from it, write to it) you must provide the file descriptor.

    • File descriptors are assigned in ascending order (next FD is lowest unused)

  • File descriptors are just integers - for that reason, we can store and access them just like integers.

    • If you’re interacting with many files, it may be helpful to have an array of file descriptors.

  • There are 3 special file descriptors provided by default to each program:

    • 0: standard input (user input from the terminal) - STDIN_FILENO

    • 1: standard output (output to the terminal) - STDOUT_FILENO

    • 2: standard error (error output to the terminal) - STDERR_FILENO

  • File descriptors are a powerful abstraction for working with files and other resources. They are used for files, networking and user input/output!

Operating System Data Structures

Thinking

What is a file descriptor really mapping to behind the scenes? How does the operating system manage open files?

All of these data structures are private to the operating system. They are layered on top of the filesystem data itself.

../../_images/fds.png
  • For each active process, Linux maintains a data structure called process control blocks - a set of relevant information about its execution (user who launched it, CPU state, etc.). These blocks live in the process table.

  • A process control block also stores a file descriptor table. This is a list of info about open files/resources for this process

  • A file descriptor (used by your program) is a small integer that’s an index into file descriptor table.

    • Descriptors 0, 1, and 2 are standard input, standard output, and standard error, but there are no predefined meanings for descriptors 3 and up. When you run a program from the terminal, descriptors 0, 1, and 2 are most often bound to the terminal.

  • A file descriptor is the identifier needed to interact with a resource (most often a file) via system calls (e.g., read, write, and close)

  • A name has semantic meaning, an address denotes a location; an identifier has no meaning

    • /etc/passwd vs. 34.196.104.129 vs. file descriptor 5

  • Many system calls allocate file descriptors

    • read: open a file

    • pipe: create two unidirectional byte streams (one read, one write) between processes

    • accept: accept a TCP connection request, returns descriptor to new socket

  • When allocating a new file descriptor, kernel chooses the smallest available number

    • These semantics are important! If you close stdout (1) then open a file, it will be assigned to file descriptor 1 so act as stdout (this is how $ cat in.txt > out.txt works)

  • An entry in the file descriptor table is really a pointer to an entry in another table, the open file table.

  • The open file table is one array of the file table entries, which is the information about open files across all processes.

  • Multiple file descriptor entries (even across processes!) can point to the same open file table entry.

  • An open file table entry stores changing info like “cursor” (how far into file are we?)

  • E.g., a file table entry (for a regular file) keeps track of a current position in the file

    • If you read 1000 bytes, the next read will be from 1000 bytes after the preceding one

    • If you write 380 bytes, the next write will start 380 bytes after the preceding one

  • This structure allows the OS to share resources across processes. If you want multiple processes to write to the same log file and have the results be intelligible, then you have all of them share a single file table entry: their calls to write will be serialized and occur in some linear order

  • Each process maintains its own descriptor table, but there is one, system-wide open file table. This allows for file resources to be shared between processes, as we’ve seen

  • As drawn above, descriptors 0, 1, and 2 in each of the three PCBs alias the same three open files. That’s why each of the referred table entries have refcounts of 3 instead of 1.

  • This shouldn’t surprise you. If your bash shell calls make, which itself calls g++, each of them inserts text into the same terminal window: those three files could be stdin, stdout, and stderr for a terminal

  • Each open file entry has a pointer to a vnode, which is a structure housing static information about a file or file-like resource.

  • The vnode is the kernel’s abstraction of an actual file: it includes information on what kind of file it is, how many file table entries reference it, and function pointers for performing operations.

  • A vnode’s interface is file-system independent, but its implementation is file-system specific; any file system (or file abstraction) can put state it needs to in the vnode (e.g., inode number)

  • The term vnode comes from BSD UNIX; in Linux source it’s called a generic inode

  • vnodes live in the vnode table; a single table referenced by all open file table entries.

  • There is one system-wide vnode table for the same reason there is one system-wide open file table. Independent file sessions reading from the same file don’t need independent copies of the vnode. They can all alias the same one.

  • These resources are all freed over time:

    • Free a file table entry when the last file descriptor closes it

    • Free a vnode when the last file table entry is freed

    • Free a file when its reference count is 0 and there is no vnode

  • None of these kernel-resident data structures are visible to users. Note the filesystem itself is a completely different component, and that filesystem inodes of open files are loaded into vnode table entries. The yellow inode in the vnode is an inmemory replica of the yellow sliver of memory in the filesystem.

How are System Calls Made?

  • Key idea: the OS performs private, privileged tasks that regular user programs cannot do, with data that user programs cannot access.

  • Problem: because of this, we can’t have system calls behave like regular function calls - there are security risks to having OS data in user-accessible memory!

Function Call Semantics

../../_images/stack.png
  • Refresher: for a normal function call, the stack grows downwards to add a new stack frame. Parameters are passed in registers like %rdi and %rsi, and the return value (if any) is put in %rax.

  • This means stack frames are adjacent, and can in theory be manipulated via pointer arithmetic when they’re not supposed to.

    • E.g. loadFiles can poke around in main’s stack frame, or main can poke around in the values left behind by loadFiles after it finishes.

    • Functions are supposed to be modular, but the function call and return protocol’s support for modularity and privacy is pretty soft.

  • Solution: a range of addresses will be reserved as “kernel space”; user programs cannot access this memory. Instead of using the user stack and memory space, system calls will use kernel space and execute in a “privileged mode”. But this means function calls must work differently!

System Call Semantics

  • New approach for calling functions if they are system calls:

    • put the system call “opcode” in %rax (e.g. 0 for read, 1 for write, 2 for open, 3 for close, and so forth). Each has its own unique opcode.

    • put up to 6 parameters in normal registers except for %rcx (use %r10 instead)

    • store the address of the next user program instruction in %rcx instead of %rip

    • The syscall assembly instruction triggers a software interrupt that switches execution over to “superuser” mode.

    • The system call executes in privileged mode and puts its return value in %rax, and returns (using iretq, “interrupt” version of retq)

    • If %rax is negative, the global errno is set to abs``(``%rax), and %rax is changed to -1.

    • The system transfers control back to the user program.

Filesystem Recap

  • User programs interact with the filesystem via file descriptors and the system calls open, close, read and write.

  • The operating system stores a per-process file descriptor table with pointers to open file table entries containing info about the open files

  • The open file table entries point to vnodes, which cache inodes

  • inodes are file system representations of files/directories. We can look at an inode to find the blocks containing the file/directory data.

  • Inodes can use indirect addressing to support large files/directories

  • Key principles: abstraction, layers, naming

../../_images/fs.png