[Unit 2] File Operation

8 min readSep 17, 2023

open() vs fopen() … What if 2 processes open files with the same fd (file descriptor) ? In the other words, if 2 fd have the same value, if they point to the same file ?

1. Overview

fd (file descriptor) is a positive integer, represents the number of opened files
When a program starts, by default it’ll open 3 files in /dev

fd = 0 : represents for stdin : keyboard
fd = 1 : represents for stdout : srceen
fd = 2 : represents for stderr : srceen

If we open a new file, we’ll get the fd = 3, the next file will have fd = 4 and so on

#include <iostream>
#include <sys/types.h> // for using O_XXX
#include <sys/stat.h>  // -- same --
#include <fcntl.h>     // -- same --
#include <unistd.h>    // for using open(), write(), close(), fsync()
#include <string.h>    // for using strlen()

int main()
{
  int fd;
  
  // O_CREAT  : create file, overwrite if exist
  // O_WRONLY : write-only
  fd = open("mylog", O_CREAT | O_WRONLY);
  printf("fd = %d\n", fd);

  write(fd, "Hello World\n", strlen("Hello World\n"));
  
  // sync data from RAM into the hard drive
  //sync()
  fsync(fd);

  close(fd);
  
  return 0;
}

btnguyen@DESKTOP-UAIA29B:/mnt/h/DEVELOPER/Linux/prog$ gcc test.cpp -lstdc++ -o test
btnguyen@DESKTOP-UAIA29B:/mnt/h/DEVELOPER/Linux/prog$ ./test
fd = 3
btnguyen@DESKTOP-UAIA29B:/mnt/h/DEVELOPER/Linux/prog$ cat mylog
Hello World
btnguyen@DESKTOP-UAIA29B:/mnt/h/DEVELOPER/Linux/prog$

2. File table of process

Every cell in “file descriptor table” is a pointer points to an address on RAM of a particular file.
By default, when the system is booted, the first 3 rows are used forstdin, stdout, stderr
When we open a new file, the system will look up a nearest empty cell which doesn’t point to any file, then it’ll open a file, store the memory address of that file into the cell, then return to us the index of the cell.

What if 2 processes open files with the same fd ? Or if 2 fd have the same value, if they point to the same file ?

If 2 processes open a same file, it means that file has a same name
There is no relation between file name and the number of elements in “file descriptor table”
The number of elements in “file descriptor table” just reprents for the opening files order, does not represent for the file name
If 2 fd = 5, then it means 2 processes open a file for the 5th time, and it’s not possible to know which file is opened at 5th, maybe they open the same file, maybe not but the fd is always 5, there will be no conflict

3. Redirect the flow of stdin, stdout, stderr

We could use | to redicrect the stdin flow
We could use > to redicrect the stdout flow, instead of output to screen, we could output to a file

#include <iostream>

int main()
{
  printf("Hello World\n");
  
  return 0;
}

btnguyen@DESKTOP-UAIA29B:/mnt/h/DEVELOPER/Linux/prog$ gcc test.cpp -lstdc++ -o test
btnguyen@DESKTOP-UAIA29B:/mnt/h/DEVELOPER/Linux/prog$ ./test > log
btnguyen@DESKTOP-UAIA29B:/mnt/h/DEVELOPER/Linux/prog$ cat log
Hello World
btnguyen@DESKTOP-UAIA29B:/mnt/h/DEVELOPER/Linux/prog$

4. Block devices

Most storage devices (e.g : hard drive) are of the block device type
Block devices often specify a minimum size for each read (block). For example when we format a hard drive, it usually ask us what is the size of block (i.e : 512 bytes, 1 kb, 4 kb, …) due to the physical characteristics

How does the hard drive work ?

The hard drive has a reading eye, a rotating disc underneath
When we need to read a data (e.g 1 byte), it’ll determine where the data is located on the hard drive, then move the “eye reader” to that location by rotating the disk
And the disk doen’t stop there, it maybe continues rotatings to move the “eye reader” to the new location to read the new data for the other programs.
Because the physical rotation speed is fast, so it can’t read exactly 1 byte, but a range of bytes nearby. That’s why it specifies the minimum size of reading bytes (i.e : 512 bytes), because it’s just be able to read when at least 512 bytes once read.

What is the idea ?

Reading from RAM is 10 times faster than the hard drive
The idea is since it took time to move the “eye reader”, even when we just read 1 byte, in fact the OS will read a block of 512 kb nearby (called as a block or a sector) then load into RAM and return to us 1 byte, 511 bytes will be kept on RAM to be reused next time (if needed)
Next time when we read another 1 byte or 10 bytes nearby the 1st read, it’ll look up and return directly from RAM instead of moving the “eye reader” again to that location and read a new 512 bytes which is time consuming (same for write action)

5. Asynchronus File I/O

RAM acts as a transit place (cached memory)
The system will use RAM as a memory cache for read/write file because read/write file directly from the hard drive is time consuming
We could flush the cache of a file actively or passively

Passive mechanism for hardware optimization

When we open a file on the hard drive, then write data but we can’t control that the data will be written directly to the hard drive or RAM (cache memory)
For example, a block is specified as 512 bytes, but we write only 2 bytes, then the OS probably will wait for us to write the additional 510 bytes to fully the block, then write the whole block to the hard drive
Periodically, the OS will sync from RAM (cache memory) into the hard drive and we can’t control this passive mechanism

Active flush cache

We could force the OS to sync from RAM to the hard drive by using the fucntion sync() or fsync()

Practical examples

We have 2 programs, program A will write into a file, and program B will periodically read from that file to process.
Sometime even program A already wrote the data, but the program B can’t see the data because the data is not synchronized
So we need to use sync(), fsync() in the program A.

6. Useful functions

Below are the synchronous functions which means the blocking one

int open(const char* pathname, int flags);

It’ll malloc a memory for the struct file and create an inode. Every inode points to its cached memory

int close(int fd);

It’ll free the row in the File table, free the initialized memory for the struct file, and sometime sync data from RAM to the hard drive for the regular file
For a device file (/dev) usually it will write directly to the hardware instead of via RAM

ssize_t read(int fd, void* buf, size_t count);

ssize_t write(int fd, const void* buf, size_t count);

buf : buffer to read, it’ll init a memory, after reading it’ll copy data into that memory
count : number of bytes to read

off_t lseek(int fd, off_t offset, int whence);

When open a file, but we don’t want to read from the first 1 byte, but from the 100th byte, then we could use the lseek() to move the “reader” to that 100 bytes by the :
offset : the position to read (byte unit) e.g : we need to read 100th byte then offset = 100
whhence : the hook calculation, when we pass it as the end of file, it’ll move the “reader” to the end of file then read backward 100 bytes,

void fsync(int fd);

Actively force the OS to sync data from RAM into the hard drive of only the currrent file

void sync(void);

Sync all the data of all the files of all the programs on RAM to the hard drive.
Therefore when we just write the config file for just several hundred bytes to the hard drive then calling sync() it could take minutes to be finished, which could hang the application.

7. Asynchronus File I/O

Read/Write functions block the program until it finishes

For example, when user click a save button, then it’ll write 20 MB data into the hard drive, whick takes 10 seconds. In the meantime, the GUI will be blocked/hanged.

To resolve it, we could use the asynchronous read/write or create a new thread to read/write a file.

int main()
{
  printf("Hello World\n");
  
  int fd = open("text_aoi.txt", O_RDONLY, 0);
  
  if(fd == -1)
  {
   printf("Unable to open file !\n");
   return 1;
  }

  // create a buffer
  char* buffer = (char*)calloc(SIZE_TO_READ, 1);
  
  // create a control block structure
  struct aiocb cb;
  
  // init the call back
  memset(&cb, 0, sizeof(struct aiocb));
  cb.aio_nbytes = SIZE_TO_READ;
  cb.aio_fildes = fd;
  cb.aio_offset = 0;
  cb.aio_buf    = buffer;
  
  // it'll impliclitly create a new thread to read and jump immediately into the next code without waiting the return
  if(aio_read(&cb) == -1)
  {
   printf("Unable to create the request!\n");
   close(file);
  }

  // do_anything_we_want_without_waiting 
  printf("Request enqueued!\n");
  
 
  // wait until the request has finished (to check if it is done)
  while(aio_error(&cb) == EINPROGRESS)
  {
   printf("Working...\n");
  }
  
  // success ?
  int numBytes = aio_return(&cb);
  
  if(numBytes != -1)
  {
   pritnf("Success!\n");
  }
  
  return 0;
}

callback is a pointer which points to a function
when the read is done, it’ll call our registered function

8. open() vs fopen()

int open(const char *pathname, int flags, mode_t mode);
FILE *fopen(const char *path, const char *mode);

fopen is a library function while open is a system call.
fopen provides buffered IO which may be faster compared to open which is non-buffered.
fopen is portable while open not portable (open is environment specific).
fopen does line ending translation if the file is not opened in binary mode, which can be very helpful if your program is ever ported to a non-Unix environment (though the world appears to be converging on LF-only (except IETF text-based networking protocols like SMTP and HTTP and such)).
fopen returns a pointer to a FILE structure (FILE *) while open returns an integer that identifies the file.
A FILE * gives you the ability to use fscanf and other stdio.h functions.
Your code may someday need to be ported to some other platform that only supports ANSI C and does not support the open function.

Why fopen() is portable ?

open() is a system call and specific to Unix-based systems and it returns a file descriptor. You can write to a file descriptor using write() which is another system call
fopen() is an ANSI C function call which returns a file pointer and it is portable to other OSes. We can write to a file pointer using fprintf

How about fdopen(), fileno() ?

As far as fdopen is concerned, if you aren't playing with file descriptors, you don't need that call.
fdopen is what you would use if you first called open and then wanted a FILE *. There is no sense doing that if you have the choice to just call fopen instead
fdopen converts an os-level file descriptor to the higher-level FILE-abstraction of the C language. fdopen calls open in the background and gives you a FILE-pointer directly.
In Unix, we can get a file pointer from the file descriptor using

fP = fdopen(fD, "a");

In Unix, we can get a file descriptor from the file pointer using

fD = fileno (fP);

What is recommened ?

fopen and its family of methods (fwrite, fread, fprintf, fscanf, fget…)

9. Advanced functions

9.1 Read and write file properties

int stat(const char* restrict_pathname, struct stat* restrict_buf);
int chmod(const char* pathname, mode_t mode);
int chown(const char* pathname, uid_t owner, gid_t group);
int link(const char* existingpath, const char* newpath);

9.2 Manipulate directories

int mkdir(const char* pathname, mode_t mode);
DIR* opendir(const char* pathname);
//open and read a folder then return the info (i.e: name, size, modify time) of files in that folder with the fixed path in the code
struct dirent* readdir(DIR* dp);
int closedir(DIR* dp);