CPSC 2310 - DAY 27 NOVEMBER 21, 2016 ================================================================================ Clemson University -- CPSC 2310 I/O - input/output system components: CPU, memory, and bus -- now add I/O controllers and peripheral devices +-----+ | CPU | +-----+ |cache| +-----+ | +============================================+ bus | | | +--------+ +-----------+ +-----------+ | memory | |controller | |controller | | | +-----------+ +-----------+ |+------+| | | || I/O || +-----------+ +-----------+ ||buffer|| | device | | device | |+------+| +-----------+ +-----------+ | | | | +--------+ Devices source: e.g., keyboard, mouse, scanner sink: e.g., monitor, printer source/sink: e.g., modem, network connection slow memory: e.g., disk, tape keyboard - e.g., consider how keyboard on PC works: - each key press causes an interrupt and sends a scan code - each key release also causes an interrupt and sends a scan code - keyboard ISR must keep track of scan codes and translate to ASCII (e.g., consider that a capital 'A' requires four interrupts) - auto-repeat function requires a timer to be set at each depress and special processing should it go off prior to key release - "raw mode" - all characters sent to buffer and on to program - "cooked mode" - special characters processed in buffer before sending to program - e.g., backspace and/or delete monitor - again, consider a PC - memory-mapped display buffer - monitor adapter refreshes display from display buffer, will often use special dual-ported memory chips (VRAM) disk - rotating platters covered with magnetic oxide coating - collection of read/write heads, one per surface, mounted on "arm", which moves in and out - concentric circles defined by where r/w head can be positioned are called tracks, and the set of parallel tracks defined by one arm position is called a cylinder - simple disks have fixed number of sectors per track, and small number of bytes per sector (e.g., 512) - disk access requires: * "seek" - arm motion of 2-20 msecs * "rotational latency" - wait until proper sector rotates under the r/w/ head, 1/2 rotation on avg., 8.3 msecs for 3600 rpm * "transfer" - time to read or write sector, depends on sector size, fraction of msec e.g., disk specs - Fujitsu MAP 3147 * formatted capacity: 147 GB + number of disks (i.e., platters): 4 + number of data heads: 8 + sector size: 512 bytes + number of recording zones per surface: 18 - sectors/track ranges from 533 inner zone to 936 outer zone + recording density (maximum): 600,000 BPI + track density: 63,100 TPI - 48,000 cylinders * seek time (read): 4.5 ms (avg.); track to track 0.3 ms; full 10 ms + write seeks are slightly longer (5ms, 0.5ms, 11ms, respectively) * rotational latency: 2.99 ms (avg.) + rotational speed: 10,000 RPM * disk transfer rate: 62-107 MB/sec * interface transfer rate: 320 MB/sec (Ultra320 SCSI) * sustained data rate: 40-70 MB/sec * data buffer (i.e., cache on controller): 8 MB Bus address lines data lines control lines (read, write, interrupt request, ...) Bus and disk interface standards PC bus: Industry standard architecture (ISA), EISA, Video Electronics Standards Association (VESA), Peripheral Component Interconnect (PCI), ... PC disk: SCSI, IDE, EIDE, ... SCSI slow, but handles 7 devices IDE slow, limited to 2 small disks EIDE fast, handles disks, CDROMs, etc. wide SCSI fastest EIDE drives cheaper than SCSI RAID storage - "Redundant Array of Inexpensive Disks" (now common to see "Redundant Array of Independent Disks") There are 2 important concepts to be understood in the design and implementation of disk arrays: 1. Data striping, for improved performance. 2. Redundancy for improved availability. Data Striping Data striping transparently distributes data over multiple disks to make them appear as a single fast, large disk. Striping improves aggregate I/O performance by allowing multiple I/Os to be serviced in parallel. There are 2 aspects to this parallelism. - Multiple, independent requests can be serviced in parallel by separate disks. This decreases the queueing time seen by I/O requests. - Single, multiple block requests can be serviced by multiple disks acting in co-ordination. This increases the effective transfer rate seen by a single request. The performance benefits increase with the number of disks in the array. Unfortunately, a large number of disks lowers the overall reliability of the disk array. Most of the redundant disk array organizations can be distinguished based on 2 features: 1. the granularity of data interleaving and 2. The way in which the redundant data is computed and stored across the disk array. RAID --concurrent-- redundancy striping number level reads writes level of disks ----- ----- ------ ---------- -------- -------- 0 yes yes none block n 1 yes no mirroring none 2n 2 no no ECC bit n+k 3 no no parity bit/byte n+1 4 yes no parity block n+1 5 yes yes distributed block n+1 6 yes yes distributed block n+2 10/01 yes no mirroring block 2n level 0 - no redundancy but parallel access, called "striping" offers the best performance but no fault-tolerance * n disks level 1 - disk mirroring/shadowing * 2n disks * multiple, simultaneous I/Os are allowed * each write goes to two disks * read from either (e.g., choose one with shortest seek time or with fewest queued requests) level 2 - bit interleaved array - similar to ECC * n+k disks * dedicated ECC disks * expensive and uncommon level 5 - rotated parity * n+1 disks * rotated parity allows multiple, simultaneous writes * most popular among levels 3,4,5 level 10 - striped mirrors - 2n disks with n-way striping of n disk pairs level 01 - mirrored stripes - 2n disks with mirroring of n-way striped disks commercial systems emphasize reliability and use high quality disks, also use redundant power supplies, etc., to remove single points of failure enhancements include battery-backed write-back cache in controller and dynamic hot sparing adaptive RAID allows segments to be dynamically reconfigured at level 1 or level 5 (background relocation of sectors, called "rebalancing") Disadvantages due to Redundancy: Every time there is a write operation, there is a change of data. This change also, has to be reflected in the disks storing redundant information. This worsens the performance of writes in redundant disk arrays significantly compared to the performance of writes in non redundant disk arrays. Also, keeping the redundant information consistent in the presence of concurrent I/O operation and the possibility of system crashes can be difficult. INTERRUPTS ---------- Clemson University -- CPSC 231 I/O - input/output system components: CPU, memory, and bus -- now add I/O controllers and peripheral devices +-----+ CPU must perform all transfers to/from simple controller, | CPU | e.g., CPU reads byte from buffer in memory and stores +-----+ it in controller's data register then stores a write- |cache| to-device command in the controller's command register +-----+ | +============================================+ bus | | | +--------+ +-----------+ +-----------+ a simple controller will | memory | |controller | |controller | respond to bus signals, | | | +--+| | +--+| will set status register |+------+| | data| || | data| || for CPU to later check || I/O || | +--+| | +--+| ||buffer|| | status| || | status| || |+------+| | +--+| | +--+| | | |command| || |command| || | | | +--+| | +--+| +--------+ +-----------+ +-----------+ | | +-----------+ +-----------+ | device | | device | +-----------+ +-----------+ controller registers - data register - holds data byte going to/from device - status register - holds bits indicating if device is ready, error, etc. - command register - bit for read, bit for write, etc. (may be combined with the status register) access to controller registers either by: - memory-mapped - registers respond to main memory addresses (typically high memory), so you can use normal load/store instructions to access - isolated I/O - special instructions (e.g., IN, OUT on Pentium) are required, use port numbers as addresses of the controller registers programmed I/O - CPU is involved with sending/receiving every byte, CPU must busy wait on device to be ready for sending/receiving ; write bytes from memory buffer to device ; ; pseudo-code ; | ; | int count = N; ; | char *addr = memory_buffer; ; | char byte; ; | ; | do{ byte = *addr; ; | ; | while( io_device_status != READY ) /* busy wait */ ; ; | ; | io_device_data = byte; ; | io_device_command = WRITE; ; | ; | addr++; ; | count--; ; | ; | }while( count > 0 ); ; ; consider 6 ppm printer, with 5,000 characters per page = 30,000 chars/min = 500 chars/sec = 0.002 sec/char = 2 ms/char for a 500 MHz processor (= 500 M cycles/sec), the cycle time is 2 ns, thus 1,000,000 cycles in 2 ms and thus 1,000,000 cycles between characters if the busy wait loop takes 100 cycles per iteration (of which most will be required for the latency of the load inst. accessing the device status register), the busy wait loop requires 10,000 iterations between characters 1 success in 10,000 iterations => not an efficient use of the CPU (CPU spends 99.9+% of its time waiting) ; read bytes from device to memory buffer ; ; pseudo-code ; | ; | int count = N; ; | char *addr = memory_buffer; ; | char byte; ; | ; | do{ io_device_command = READ; ; | ; | while( io_device_status != READY ) /* busy wait */ ; ; | ; | byte = io_device_data; ; | *addr = byte; ; | ; | addr++; ; | count--; ; | ; | }while( count > 0 ); ; ; interrupt-driven I/O - CPU can do something else while controller and device are busy, the controller grabs the CPU's attention when needed by causing what is essentially an unplanned procedure call +-----+ | CPU |--------. +-----+<-----. | |cache| | | +-----+ | | | | | +===+=======+=====+===|=|===========+=====+ bus | | +-|-----------|---+-+ interrupt request line (INTR) +--------+ | | +-----------|---|-+ interrupt ack line (INTA) | memory | | | v | | v |+------+| +-----------+ +-----------+ controllers that can ||buffer|| |controller | |controller | interrupt raise request |+------+| | +--+| | +--+| signal on bus | | | data| || | data| || |+------+| | +--+| | +--+| when CPU responds with || ISR || | status| || | status| || an acknowledgement, the |+------+| | +--+| | +--+| controller places some | | |command| || |command| || type of identification |+------+| | +--+| | +--+| on the bus || int. || +-----------+ +-----------+ ||vector|| | | ||table || +-----------+ +-----------+ |+------+| | device | | device | +--------+ +-----------+ +-----------+ we rely on an external interrupt from the controller to signal that the device is ready (i.e., that the previous I/O operation is complete); this will cause the currently executing program to stop and the processor to enter the OS and start executing an interrupt service routine (ISR) - sometimes called an interrupt handler (IH) there are also internal interrupts (sometimes called exceptions) for divide by zero, unaligned memory accesses, memory protection errors, etc. moreover, to protect the OS, calls to the OS must be made by a special instruction that causes an interrupt - called SVC (supervisor call) on IBM mainframes, INT on x86, and trap on SPARC an interrupt must save a return address and information on the processor state to allow the interrupted program to be resumed later => save the program counter (PC) and processor state register (PSR) an interrupt switches execution mode to an OS-only mode by changing a mode bit (or bits) in the PSR there are typically interrupt control bits in the controller's command register, and interrupt enable bits (either a priority level or a bit mask) in the PSR - the processor typically disables interrupts (at least at that level and lower) whenever an ISR starts the entry point address to the interrupt service routine (ISR) is typically provided by a table of such addresses in low memory; for I/O the entry is chosen according to the interrupt code placed on bus by controller +------------------------+ \ 0 | addr of ISR for type 0 |-------------. | +------------------------+ | | 4 | addr of ISR for type 1 |----------. | | interrupt vector +------------------------+ | | | table (IVT) 8 | addr of ISR for type 2 |-------. | | | +------------------------+ | | | | c | addr of ISR for type 3 |----. v v v | +------------------------+ | / | ... | | | | ... | | +------------------------+ | \ | code for type 3 int |<---' | interrupt | ... | | service | return from interrupt | | routine (ISR) +------------------------+ / | ... | a special return from interrupt instruction at end of ISR switches back to previous processor state and restores saved PC the fetch-execute cycle is extended to check for interrupts after each instruction - the hardware response to an interrupt acts like procedure call if interrupt requested by device and if CPU has interrupts enabled note that the ISR is a software routine - and that instructions in the ISR are fetched, decoded, and executed, just like any other program PC - program counter, contains address of next instruction PSR - processor status register, contains: * processor execution mode (kernel/user) * interrupt enable/permission (can be single bit, mask, or priority code * condition codes IVT - interrupt vector table, contains entry-point addresses for ISRs ISR - interrupt service routine +-------+ .----->| fetch | | +-------+ | v | .------. | < decode > | `------' | | | +--+---------+------------+------- ... -------------+ | v v v v | +---------+ +---------+ +---------+ +-----------------------+ | | execute | | execute | | execute | ... | return from interrupt | | | load | | add | | store | | (restore PC and PSR) | | +---------+ +---------+ +---------+ +-----------------------+ | v v v v | +--+---------+------------+------- ... -------------+ | | | .-------------------------------------------------. | < if interrupt requested and interrupts are enabled > | no `-------------------------------------------------' +<--------' | yes ^ v | +--------------------------------------------------+ | | 1) save PC and PSR | | | 2) switch execution mode to kernel (OS-only) | | | 3) disable/restrict further interrupts | | | 4) load new PC from IVT (interrupt vector table) | | +--------------------------------------------------+ | | `--------------' nested interrupts if the PC and PSR are saved on a stack (or in a set of registers), a high priority device can interrupt the execution of the ISR for a lower priority device .----------. | disk ISR | `----------| .-------------^ v-------------. | printer ISR | | printer ISR | `-------------| `-------------| --------------^ ^ v-------------- user program | | |user program --------------| | `-------------- ^ | | | rti rti printer disk interrupt interrupt otherwise, the second interrupt is held pending until the first ISR finishes and executes its rti instruction; at that point, the rti briefly reestablishes user mode with interrupts enabled but immediately the highest-priority pending interrupt is accepted .--------------------------..----------. | printer ISR || disk ISR | `--------------------------|`----------| --------------^ |^ v-------------- user program | || |user program --------------| v| `-------------- ^ | | rti| rti | | | ^^^^^^^^^^^^^^* printer disk interrupt interrupt interrupts are called traps 256 trap types, half software and half hardware invoke OS by trap instruction trap enable bit (ET) and 4-bit interrupt level (PIL) in PSR - a synchronous trap is accepted only if the ET bit is set - an external interrupt is accepted only if the ET bit is set and priority level of the interrupt is greater than current PIL TBR - trap base register with leading 20 bits set by OS, an 8-bit field supplied internally based on trap type or externally from interrupting device, and four zeros trap sequence 1. ET bit in PSR cleared so that further traps are disabled 2. S bit saved into PS bit and S bit set => places processor in supervisor mode 3. CWP incremented to give trap handler new set of local registers (you can only use the eight local registers since the out registers are not guaranteed to be mapped to physical registers w/o window overflow processing; also the in registers and global registers might be in use by interrupted process) 4. pc, and npc are saved into %l1 and %l2 (caveat: hyperSPARC manual differs in register assignment from textbook) 5. trap type field placed into TBR and pc = TBR, npc = TBR+4 DMA - direct memory access - extra registers and logic in the controller allow it to transfer a whole block of bytes without CPU involvement, interrupts the CPU after completion or after an error - address register - address of buffer in main memory, controller increments - count register - length of block, controller decrements +-----+ INTA | CPU |----------------. +-----+<-------------. | |cache| INTR | | +-----+ | | | | | +===+=======+=====+===========|=|=====+ bus | | | | | | | v +--------+ +----------------------+ DMA controller can be bus master, | memory | | DMA controller | so need to arbitrate for bus among | | | +-------------+| DMA controllers and CPU |+------+<-. | data| || || I/O || | | +-------------+| extra logic in controller to ||buffer|| | | status| || implement loop: |+------+| | | +-------------+| | | | |command| || while( count > 0 ){ | | | | +-------------+| transfer byte at address; | | `----address| & buffer || address++; | | | +-------------+| count--; | | | count| # in buffer || } | | | +-------------+| +--------+ +----------------------+ interrupt once at end of buffer | +-----------+ | device | +-----------+ I/O methods CPU involvement # interrupts --------------- ------------ programmed I/O completely dedicated none to the transfer interrupt-driven I/O transfers each byte after each byte DMA I/O initially loads the one, at end of block address and count registers, gives transfer command effect of offloading CPU do{ byte = *addr; \ while( status != READY ); <= interrupt-driven I/O | complete block io_device_data = byte; relieves CPU of | transfer is io_device_command = WRITE; busy-wait loop | offloaded onto addr++; | DMA controller count--; | }while( count > 0 ); / further offloading of I/O from the CPU mainframe channel - provides for transfers of multiple blocks by traversing linked-list-like channel programs, interrupt only when end of channel program reached (or on error) a channel has the equivalent of a program counter +---------+ each channel instruction is | channel | called a "channel command word" | +----+ | +-----+---------+-------+------+ each CCW has | | pc ------->| r/w | address | count | next --. address and | +----+ | +-----+---------+-------+------+ | count fields +---------+ | to support a .--------------------------------' block transfer v +-----+---------+-------+------+ additional | r/w | address | count | next --. fields indicate +-----+---------+-------+------+ | end of physical v block, etc. CCWs can also provide scatter-gather scatter - read data from a single physical block on an I/O device and send different parts to multiple, non-contiguous I/O buffers in memory gather - read data from multiple, non-contiguous I/O buffers in memory and write a single physical block to the device I/O processors - offload I/O conversion, editing, etc., e.g., I^2O = "intelligent I/O", Intel i960 running its own real-time OS