CPSC 2310 - DAY 27
NOVEMBER 21, 2016

================================================================================

Clemson University -- CPSC 2310 


I/O - input/output

  system components: CPU, memory, and bus -- now add I/O controllers and
      peripheral devices

            +-----+
            | CPU |
            +-----+
            |cache|
            +-----+
               |
   +============================================+ bus
       |                |                 |
   +--------+     +-----------+     +-----------+
   | memory |     |controller |     |controller |
   |        |     +-----------+     +-----------+
   |+------+|           |                 |     
   || I/O  ||     +-----------+     +-----------+     
   ||buffer||     |  device   |     |  device   |     
   |+------+|     +-----------+     +-----------+
   |        |
   |        |
   +--------+


Devices

  source: e.g., keyboard, mouse, scanner
  sink: e.g., monitor, printer
  source/sink: e.g., modem, network connection
  slow memory: e.g., disk, tape

  keyboard - e.g., consider how keyboard on PC works:
    - each key press causes an interrupt and sends a scan code
    - each key release also causes an interrupt and sends a scan code
    - keyboard ISR must keep track of scan codes and translate to ASCII (e.g.,
          consider that a capital 'A' requires four interrupts)
    - auto-repeat function requires a timer to be set at each depress and
          special processing should it go off prior to key release
    - "raw mode" - all characters sent to buffer and on to program
    - "cooked mode" - special characters processed in buffer before sending
          to program - e.g., backspace and/or delete

  monitor - again, consider a PC
    - memory-mapped display buffer
    - monitor adapter refreshes display from display buffer, will often use
          special dual-ported memory chips (VRAM)

  disk
    - rotating platters covered with magnetic oxide coating
    - collection of read/write heads, one per surface, mounted on "arm", which
          moves in and out
    - concentric circles defined by where r/w head can be positioned are called
          tracks, and the set of parallel tracks defined by one arm position
          is called a cylinder
    - simple disks have fixed number of sectors per track, and small number of
          bytes per sector (e.g., 512)
    - disk access requires:
        * "seek" - arm motion of 2-20 msecs
        * "rotational latency" - wait until proper sector rotates under the
              r/w/ head, 1/2 rotation on avg., 8.3 msecs for 3600 rpm
        * "transfer" - time to read or write sector, depends on sector size,
              fraction of msec
    e.g., disk specs - Fujitsu MAP 3147
        * formatted capacity: 147 GB
          + number of disks (i.e., platters): 4
          + number of data heads: 8
          + sector size: 512 bytes
          + number of recording zones per surface: 18
            - sectors/track ranges from 533 inner zone to 936 outer zone
          + recording density (maximum): 600,000 BPI
          + track density: 63,100 TPI
            - 48,000 cylinders
        * seek time (read): 4.5 ms (avg.); track to track 0.3 ms; full 10 ms
          + write seeks are slightly longer (5ms, 0.5ms, 11ms, respectively)
        * rotational latency: 2.99 ms (avg.)
          + rotational speed: 10,000 RPM
        * disk transfer rate: 62-107 MB/sec
        * interface transfer rate: 320 MB/sec (Ultra320 SCSI)
        * sustained data rate: 40-70 MB/sec
        * data buffer (i.e., cache on controller): 8 MB


Bus

  address lines
  data lines
  control lines (read, write, interrupt request, ...)


Bus and disk interface standards

  PC bus: Industry standard architecture (ISA), EISA, 
          Video Electronics Standards Association (VESA), 
           Peripheral Component Interconnect (PCI), ...
  

  PC disk: SCSI, IDE, EIDE, ...
    SCSI slow, but handles 7 devices 
    IDE  slow, limited to 2 small disks 
    EIDE fast, handles disks, CDROMs, etc. 
    wide SCSI fastest
    EIDE drives cheaper than SCSI 

RAID storage - "Redundant Array of Inexpensive Disks"
  (now common to see "Redundant Array of Independent Disks")


There are 2 important concepts to be understood in the design and 
implementation of disk arrays: 

1. Data striping, for improved performance. 
2. Redundancy for improved availability. 

Data Striping

Data striping transparently distributes data over multiple disks to 
make them appear as a single fast, large disk. Striping improves 
aggregate I/O performance by allowing multiple I/Os to be serviced 
in parallel. There are 2 aspects to this parallelism. 
  
- Multiple, independent requests can be serviced in parallel by 
  separate disks. This decreases the queueing time seen by I/O requests.
- Single, multiple block requests can be serviced by multiple disks 
  acting in co-ordination. This increases the effective transfer rate 
  seen by a single request. The performance benefits increase with the 
  number of disks in the array. Unfortunately, a large number of disks 
  lowers the overall reliability of the disk array.

Most of the redundant disk array organizations can be distinguished 
based on 2 features: 
1. the granularity of data interleaving and 
2. The way in which the redundant data is computed and stored 
   across the disk array. 

  RAID    --concurrent--   redundancy   striping    number
  level   reads   writes     level                 of disks
  -----   -----   ------   ----------   --------   --------
    0      yes     yes       none         block      n
    1      yes     no      mirroring      none       2n
    2      no      no         ECC         bit        n+k
    3      no      no        parity     bit/byte     n+1
    4      yes     no        parity       block      n+1
    5      yes     yes    distributed     block      n+1
    6      yes     yes    distributed     block      n+2
  10/01    yes     no      mirroring      block      2n


  level 0 - no redundancy but parallel access, called "striping"
            offers the best performance but no fault-tolerance
            * n disks

  level 1 - disk mirroring/shadowing
            * 2n disks
            * multiple, simultaneous I/Os are allowed
            * each write goes to two disks
            * read from either (e.g., choose one with shortest seek time
              or with fewest queued requests)

  level 2 - bit interleaved array - similar to ECC 
            * n+k disks
            * dedicated ECC disks
            * expensive and uncommon 

  level 5 - rotated parity
            * n+1 disks
            * rotated parity allows multiple, simultaneous writes 
            * most popular among levels 3,4,5

  level 10 - striped mirrors - 2n disks with n-way striping of n disk pairs

  level 01 - mirrored stripes - 2n disks with mirroring of n-way striped disks


  commercial systems emphasize reliability and use high quality disks, also
    use redundant power supplies, etc., to remove single points of failure

  enhancements include battery-backed write-back cache in controller and
    dynamic hot sparing

  adaptive RAID allows segments to be dynamically reconfigured at level 1 or
    level 5 (background relocation of sectors, called "rebalancing") 

Disadvantages due to Redundancy:

Every time there is a write operation, there is a change of data. 
This change also, has to be reflected in the disks storing redundant 
information. This worsens the performance of writes in redundant disk 
arrays significantly compared to the performance of writes in non 
redundant disk arrays. 

Also, keeping the redundant information consistent in the presence of 
concurrent I/O operation and the possibility of system crashes can be 
difficult.

INTERRUPTS 
----------


Clemson University -- CPSC 231 


I/O - input/output

  system components: CPU, memory, and bus -- now add I/O controllers and
      peripheral devices

            +-----+    CPU must perform all transfers to/from simple controller,
            | CPU |    e.g., CPU reads byte from buffer in memory and stores
            +-----+    it in controller's data register then stores a write-
            |cache|    to-device command in the controller's command register
            +-----+
               |
   +============================================+ bus
       |                |                 |
   +--------+     +-----------+     +-----------+   a simple controller will
   | memory |     |controller |     |controller |   respond to bus signals,
   |        |     |       +--+|     |       +--+|   will set status register
   |+------+|     |   data|  ||     |   data|  ||   for CPU to later check
   || I/O  ||     |       +--+|     |       +--+|
   ||buffer||     | status|  ||     | status|  ||
   |+------+|     |       +--+|     |       +--+|
   |        |     |command|  ||     |command|  ||
   |        |     |       +--+|     |       +--+|
   +--------+     +-----------+     +-----------+
                        |                 |
                  +-----------+     +-----------+
                  |  device   |     |  device   |
                  +-----------+     +-----------+


  controller registers
    - data register - holds data byte going to/from device
    - status register - holds bits indicating if device is ready, error, etc.
    - command register - bit for read, bit for write, etc. (may be combined
        with the status register)

  access to controller registers either by:
    - memory-mapped - registers respond to main memory addresses (typically
        high memory), so you can use normal load/store instructions to access
    - isolated I/O - special instructions (e.g., IN, OUT on Pentium) are
        required, use port numbers as addresses of the controller registers

  programmed I/O - CPU is involved with sending/receiving every byte, CPU
    must busy wait on device to be ready for sending/receiving

    ; write bytes from memory buffer to device
    ;
    ;   pseudo-code
    ;   |
    ;   |   int  count = N;
    ;   |   char *addr = memory_buffer;
    ;   |   char byte;
    ;   |
    ;   |   do{ byte = *addr;
    ;   |
    ;   |       while( io_device_status != READY ) /* busy wait */ ;
    ;   |
    ;   |       io_device_data    = byte;
    ;   |       io_device_command = WRITE;
    ;   |
    ;   |       addr++;
    ;   |       count--;
    ;   |
    ;   |   }while( count > 0 );
    ;
    ;

    consider 6 ppm printer, with 5,000 characters per page = 30,000 chars/min
    = 500 chars/sec = 0.002 sec/char = 2 ms/char

    for a 500 MHz processor (= 500 M cycles/sec), the cycle time is 2 ns,
    thus 1,000,000 cycles in 2 ms and thus 1,000,000 cycles between characters

    if the busy wait loop takes 100 cycles per iteration (of which most will
    be required for the latency of the load inst. accessing the device status
    register), the busy wait loop requires 10,000 iterations between characters

    1 success in 10,000 iterations => not an efficient use of the CPU
    (CPU spends 99.9+% of its time waiting)


    ; read bytes from device to memory buffer
    ;
    ;   pseudo-code
    ;   |
    ;   |   int  count = N;
    ;   |   char *addr = memory_buffer;
    ;   |   char byte;
    ;   |
    ;   |   do{ io_device_command = READ;
    ;   |
    ;   |       while( io_device_status != READY ) /* busy wait */ ;
    ;   |
    ;   |       byte = io_device_data;
    ;   |       *addr = byte;
    ;   |
    ;   |       addr++;
    ;   |       count--;
    ;   |
    ;   |   }while( count > 0 );
    ;
    ;


  interrupt-driven I/O - CPU can do something else while controller and device
    are busy, the controller grabs the CPU's attention when needed by causing
    what is essentially an unplanned procedure call

            +-----+
            | CPU |--------.
            +-----+<-----. |
            |cache|      | |
            +-----+      | |
               |         | |
   +===+=======+=====+===|=|===========+=====+ bus
       |             |   +-|-----------|---+-+   interrupt request line (INTR)
   +--------+        |   | +-----------|---|-+   interrupt ack line (INTA)
   | memory |        |   | v           |   | v
   |+------+|     +-----------+     +-----------+   controllers that can
   ||buffer||     |controller |     |controller |   interrupt raise request
   |+------+|     |       +--+|     |       +--+|   signal on bus
   |        |     |   data|  ||     |   data|  ||
   |+------+|     |       +--+|     |       +--+|   when CPU responds with
   || ISR  ||     | status|  ||     | status|  ||   an acknowledgement, the
   |+------+|     |       +--+|     |       +--+|   controller places some
   |        |     |command|  ||     |command|  ||   type of identification
   |+------+|     |       +--+|     |       +--+|   on the bus
   || int. ||     +-----------+     +-----------+
   ||vector||           |                 |
   ||table ||     +-----------+     +-----------+
   |+------+|     |  device   |     |  device   |
   +--------+     +-----------+     +-----------+

     we rely on an external interrupt from the controller to signal that the
       device is ready (i.e., that the previous I/O operation is complete);
       this will cause the currently executing program to stop and the processor
       to enter the OS and start executing an interrupt service routine (ISR)
       - sometimes called an interrupt handler (IH)

       there are also internal interrupts (sometimes called exceptions) for
         divide by zero, unaligned memory accesses, memory protection errors,
         etc.

       moreover, to protect the OS, calls to the OS must be made by a special
         instruction that causes an interrupt - called SVC (supervisor call)
         on IBM mainframes, INT on x86, and trap on SPARC


     an interrupt must save a return address and information on the processor
       state to allow the interrupted program to be resumed later => save the
       program counter (PC) and processor state register (PSR)

     an interrupt switches execution mode to an OS-only mode by changing a
       mode bit (or bits) in the PSR

     there are typically interrupt control bits in the controller's command
       register, and interrupt enable bits (either a priority level or a bit
       mask) in the PSR - the processor typically disables interrupts (at
       least at that level and lower) whenever an ISR starts

     the entry point address to the interrupt service routine (ISR) is typically
       provided by a table of such addresses in low memory; for I/O the entry
       is chosen according to the interrupt code placed on bus by controller

           +------------------------+                    \
         0 | addr of ISR for type 0 |-------------.      |
           +------------------------+             |      |
         4 | addr of ISR for type 1 |----------.  |      | interrupt vector
           +------------------------+          |  |      |    table (IVT)
         8 | addr of ISR for type 2 |-------.  |  |      |
           +------------------------+       |  |  |      |
         c | addr of ISR for type 3 |----.  v  v  v      |
           +------------------------+    |               /
           |          ...           |    |
                                         |
           |          ...           |    |
           +------------------------+    |               \
           |  code for type 3 int   |<---'               | interrupt
           |          ...           |                    | service
           | return from interrupt  |                    | routine (ISR)
           +------------------------+                    /
           |          ...           |

     a special return from interrupt instruction at end of ISR switches back to
       previous processor state and restores saved PC

     the fetch-execute cycle is extended to check for interrupts after each
       instruction - the hardware response to an interrupt acts like procedure
       call if interrupt requested by device and if CPU has interrupts enabled

     note that the ISR is a software routine - and that instructions in the
       ISR are fetched, decoded, and executed, just like any other program


         PC - program counter, contains address of next instruction
         PSR - processor status register, contains:
           * processor execution mode (kernel/user)
           * interrupt enable/permission (can be single bit, mask, or
               priority code
           * condition codes
         IVT - interrupt vector table, contains entry-point addresses for ISRs
         ISR - interrupt service routine


            +-------+
     .----->| fetch |
     |      +-------+
     |          v
     |       .------.
     |      < decode >
     |       `------'
     |          |
     |       +--+---------+------------+------- ...  -------------+
     |       v            v            v                          v
     |  +---------+  +---------+  +---------+         +-----------------------+
     |  | execute |  | execute |  | execute |   ...   | return from interrupt |
     |  |   load  |  |   add   |  |  store  |         |  (restore PC and PSR) |
     |  +---------+  +---------+  +---------+         +-----------------------+
     |       v            v            v                          v
     |       +--+---------+------------+------- ...  -------------+
     |          |
     |       .-------------------------------------------------.
     |      < if interrupt requested and interrupts are enabled >
     |   no  `-------------------------------------------------'
     +<--------'    | yes
     ^              v
     |          +--------------------------------------------------+
     |          | 1) save PC and PSR                               |
     |          | 2) switch execution mode to kernel (OS-only)     |
     |          | 3) disable/restrict further interrupts           |
     |          | 4) load new PC from IVT (interrupt vector table) |
     |          +--------------------------------------------------+
     |              |
     `--------------'


  nested interrupts

    if the PC and PSR are saved on a stack (or in a set of registers), a high
       priority device can interrupt the execution of the ISR for a lower
       priority device

                                  .----------.
                                  | disk ISR |
                                  `----------|
                    .-------------^          v-------------.
                    | printer ISR |          | printer ISR |
                    `-------------|          `-------------|
      --------------^             ^                        v--------------
       user program |             |                        |user program
      --------------|             |                        `--------------
                    ^             |
                    |             |         rti           rti
                 printer        disk
                interrupt     interrupt

    otherwise, the second interrupt is held pending until the first ISR
      finishes and executes its rti instruction; at that point, the rti
      briefly reestablishes user mode with interrupts enabled but
      immediately the highest-priority pending interrupt is accepted

                    .--------------------------..----------.
                    |        printer ISR       || disk ISR |
                    `--------------------------|`----------|
      --------------^                          |^          v--------------
       user program |                          ||          |user program
      --------------|                          v|          `--------------
                    ^                           |              
                    |                        rti|         rti
                    |                           |
                    |             ^^^^^^^^^^^^^^*
                 printer        disk
                interrupt     interrupt


    interrupts are called traps

    256 trap types, half software and half hardware

    invoke OS by trap instruction    

    trap enable bit (ET) and 4-bit interrupt level (PIL) in PSR
      - a synchronous trap is accepted only if the ET bit is set
      - an external interrupt is accepted only if the ET bit is set and
          priority level of the interrupt is greater than current PIL

    TBR - trap base register with leading 20 bits set by OS, an 8-bit field
      supplied internally based on trap type or externally from interrupting
      device, and four zeros

    trap sequence
      1. ET bit in PSR cleared so that further traps are disabled
      2. S bit saved into PS bit and S bit set => places processor in
         supervisor mode
      3. CWP incremented to give trap handler new set of local registers
         (you can only use the eight local registers since the out registers
         are not guaranteed to be mapped to physical registers w/o window
         overflow processing; also the in registers and global registers
         might be in use by interrupted process)
      4. pc, and npc are saved into %l1 and %l2 (caveat: hyperSPARC manual
         differs in register assignment from textbook)
      5. trap type field placed into TBR and pc = TBR, npc = TBR+4


  DMA - direct memory access - extra registers and logic in the controller
        allow it to transfer a whole block of bytes without CPU involvement,
        interrupts the CPU after completion or after an error
    - address register - address of buffer in main memory, controller increments
    - count register - length of block, controller decrements

            +-----+         INTA
            | CPU |----------------.
            +-----+<-------------. |
            |cache|         INTR | |
            +-----+              | |
               |                 | |
   +===+=======+=====+===========|=|=====+ bus
       |             |           | |
       |             |           | v
   +--------+     +----------------------+  DMA controller can be bus master,
   | memory |     | DMA controller       |  so need to arbitrate for bus among
   |        |     |       +-------------+|  DMA controllers and CPU
   |+------+<-.   |   data|             ||
   || I/O  || |   |       +-------------+|  extra logic in controller to
   ||buffer|| |   | status|             ||  implement loop:
   |+------+| |   |       +-------------+|
   |        | |   |command|             ||    while( count > 0 ){
   |        | |   |       +-------------+|       transfer byte at address;
   |        | `----address| & buffer    ||       address++;
   |        |     |       +-------------+|       count--;
   |        |     |  count| # in buffer ||    }
   |        |     |       +-------------+|
   +--------+     +----------------------+  interrupt once at end of buffer
                             |
                       +-----------+
                       |  device   |
                       +-----------+


  I/O methods                 CPU involvement            # interrupts
                              ---------------            ------------

  programmed I/O              completely dedicated       none
                                to the transfer

  interrupt-driven I/O        transfers each byte        after each byte

  DMA I/O                     initially loads the        one, at end of block
                                address and count
                                registers, gives
                                transfer command


  effect of offloading CPU
  
    do{ byte = *addr;                                       \
        while( status != READY );  <= interrupt-driven I/O  | complete block
        io_device_data    = byte;        relieves CPU of    | transfer is
        io_device_command = WRITE;       busy-wait loop     | offloaded onto
        addr++;                                             | DMA controller
        count--;                                            |
    }while( count > 0 );                                    /


  further offloading of I/O from the CPU


    mainframe channel - provides for transfers of multiple blocks by traversing
      linked-list-like channel programs, interrupt only when end of channel
      program reached (or on error)


        a channel has the equivalent of a program counter

        +---------+      each channel instruction is
        | channel |     called a "channel command word"
        | +----+  |    +-----+---------+-------+------+      each CCW has
        | | pc ------->| r/w | address | count | next --.    address and
        | +----+  |    +-----+---------+-------+------+ |    count fields
        +---------+                                     |    to support a
                       .--------------------------------'    block transfer
                       v
                       +-----+---------+-------+------+      additional
                       | r/w | address | count | next --.    fields indicate
                       +-----+---------+-------+------+ |    end of physical
                                                        v    block, etc.


      CCWs can also provide scatter-gather

        scatter - read data from a single physical block on an I/O device
          and send different parts to multiple, non-contiguous I/O buffers
          in memory

        gather - read data from multiple, non-contiguous I/O buffers in
          memory and write a single physical block to the device


    I/O processors - offload I/O conversion, editing, etc., e.g., I^2O =
      "intelligent I/O", Intel i960 running its own real-time OS