Data Storage

Variables - Easy as 1,2,3...

Early computers (1940's) were used only for calculations. They could be programmed to do complex calculations, including some logical operations (e.g. if a number is negative, don't try to calculate the square-root). But the only data stored in the computer was numerical - no pictures or text.

Three types of numbers were needed: integers (whole numbers), fixed point decimal (like money that always has 2 decimal places), and floating point (also called scientific notation, with a mantissa an exponent.) In modern computers, fixed-point is usually ignored, as floating point is "good enough". This leads to round-off errors and salami-slice scams, where the small round-off errors from millions of calculations are collected into a single bank account, creating a sizable illegal profit.

Integers are generally stored in binary using two's complement notation, where the first bit has a large negative value, and all other bits are positive powers of 2.

Floating point is generally stored with a long binary mantissa representing an integer value, and a shorter exponent used to adjust the decimal point appropriately.

Stored Programs

Early computers had Random Access Memory (RAM) stored in vacuum tubes - this is called primary memory. The Central Processing Unit (CPU) can save and retrieve values from RAM very quickly, and in any order (that is the "random" part). Early computers had NO secondary (permanent) memory. So the computer could run a program and do computations as long as the power stayed on - when the power went off, or a vacuum tube burned out (every couple hours) the computer would crash - then repairs were made and the computer was restarted. Under these circumstances, programs were generally hard-wired (using real copper wires), so the programs were not lost when the computer crashed.

John von Neumann is called the "father" of the modern electronic computer, and is credited with several significant ideas. One of these was the stored program - the idea that a program could be represented by numbers and stored in RAM, just like all the numeric data. This meant all the wires would no longer be necessary. Then a machine with a CPU could be built that could execute various programs, without rewiring. The stored-program concept significantly increased productivity and lowered costs. But the stored-programs were no longer permanent, so external storage devices (disks, tapes, drums(?)) were crucial to the success of this concept.

Central Processing Unit (CPU)

The CPU in a modern computer is a microprocessor. It contains registers - these are single byte or multi-byte memory locations located directly in the CPU. They are used to store single values used in calculations. The Intel 8086 processor in the original IBM PC contained four 16-bit registers named AX, BX, CX, and DX. These could be used for addition, subtraction, multiplication, or division.

Higher Level Languages

A CPU can only execute commands stored as simple numbers - this is called machine language. The CPU doesn't know anything about Java commands or Strings. People aren't good at reading or writing machine code numbers. So higher-level languages were invented, with commands that people can understand. First a simple translator was used to change assembly language by a 1-to-1 conversion into machine language. But the basic concepts of assembly/machine language are too primitive, requiring too many steps to perform simple operations. FORTRAN (Formula Translator) was invented in 1954 to allow "anyone" (well at least engineers and scientists) to write their own programs.

The following example shows machine and assembly language code that : Loads 5 into the accumulator, multiplies by 10, and then stores the answer in memory location 24. In FORTRAN, this might be accomplished with a simpler command like: X = 5 * 10

Machine Language <=== Assembly Language <=== FORTRAN

    00101100   01011010                LOD   ACC , 5
    00001011   00001010                MUL   10                                X = 5 * 10
    10101111   00011000                STO    [24]

The purpose of a higher level language compiler is to change the human-understandable FORTRAN commands into machine-executable machine code. Many early compilers produced assembly language code, which was then assembled to machine language in a separate step.

More and more sophisticated programming languages and have evolved. Java is one of the latest developments. Java's claims to fame are that it is : (1) Object-Oriented and (2) web-enabled. Java doesn't compile all the way to machine language. The Java compiler produces "p-code" which is then executed by a Java-Virtual-Machine. The JVM is a program that "pretends" to be a CPU.

Arrays

Modern computers store lots and lots of data - 512 MegaBytes or a GigaByte. That doesn't mean the RAM is always full, but it could be. To keep all that data organized, we need much better data-structures than simple bytes.

An early development in data-structures was the array - a numbered list of storage locations. The entire RAM of a computer is actually one big ARRAY! This consists of bytes, meaning that every byte needs an address (index), or it might be organized into words (e.g. 32 bits), with an address for each word. If there are two many addresses (e.g. 5 billion) then the address bus must be very larger to transmit the entire address. This explains why many recent PCs were limited to 4 GB of memory, as the address bus was only 32 bits wide and could only carry addresses up to 4 billion.

A 2-dimensional array "looks" like a matrix (table), with rows and columns. But inside the computer, it is actually stored in the memory in a simple linear array, like this example for a Tic-Tac-Toe board:

Board[0,0]	Mem[0]	'X'
Board[0,1]	Mem[1]	'O'
Board[0,2]	Mem[2]	'X'
Board[1,0]	Mem[3]	'O'
Board[1,1]	Mem[4]	'X'
Board[1,2]	Mem[5]	'O'
Board[2,0]	Mem[6]	'O'
Board[2,1]	Mem[7]	'X'
Board[2,2]	Mem[8]	'X'

So a 2-D array is actually no different than a linear array - it just uses different commands.

Encodings

Text, graphics, sounds, and other data-types are stored using data-structures and encoding schemes. This does not mean "secret" codes, but rather numbering systems that can represent various media.

A simple and common encoding system is ASCII - American Standard Code for Information Interchange. This is a set of 1-byte values representing letters, digits, and punctuation marks. The original ASCII system used only 7 bits to represent 128 different characters. Later this was expanded to 8 bits for 256 characters. Recently the Unicode system increased the code size to 16-bits, permitting 65536 different characters, including Persian, Chinese, Russian, and many other languages.

Graphics can be encoded using Red-Green-Blue intensities, by recording a byte for each color. That takes care of a single pixel - a more complex system organizes the data for all the pixels in an image. In a .bmp image, a simple 2-D array of pixel values represents the image. In compressed formats like .gif and .jpg, the system is vastly more complex.

Sound can be recorded by sampling the signal, then recording 44000 measurements per second. This creates a data stream that is replayed sequentially. MP3 compression manages to reduce the basic data by a factor of 10 by removing sounds that the human ear doesn't actually notice. The reduced size of the data is attractive, but sophisticated compression and decompression algorithms are required. Fortunately the decompression of MP3 files has been embedded into dedicated chips that are small and cheap and can be included in MP3 players, like the popular I-pod.

Stack

Arrays are not very flexible. Array storage (and RAM) generally use fixed locations for storing data. For example, the keyboard buffer in MS-DOS was stored in a specific location, which could not be moved. The array above uses specific memory locations to store specific squares in a Tic-Tac-Toe board.

A more flexible data-structure is a stack. This is also a list of numbers, but the list can grow and shrink in size. That allows variable amounts of storage to be assigned, and it is not necessary to use fixed positions in the list. Stacks are based on the LIFO concept - Last In First Out. A stack is an example of an abstract data type (ADT). It needs a semi-infinite supply of storage, plus methods for pushing new data into the stack, and for popping the most recent item back out of the stack.

You have seen stacks used for the following purposes:

the [back] button in a browser history, used to go back to previous web-sites
the undo function in a Windows application, like a word-processor or graphics program
Windows stacking up on top of each other, then restoring the screen when a window is closed

At a more basic level, CPU's have a stack pointer (SP) that keeps track of the last item in a system stack. The stack is stored in RAM, starting in a specific location, but can grow and shrink by incrementing or decrementing the SP. This provides an unlimited amount of storage for temporary data. When a CPU executes a program, it has several common tasks that require a stack:

keeping track of return addresses from subroutines (methods)
storing parameter values for passing to subroutines
resuming tasks after responding to interrupts

All the tasks above are LIFO problems. For example, if method A calls method B and it calls method C, then the return commands must return from C back to B and then back to A.

Stacks are especially important for recursive methods, where each new call of the method to itself must store a new copy of all its parameters and temporary variables - these are all stored on the system stack.

Some early computer systems used Reverse Polish Notation (postfix). This uses a stack to keep track of values. When an operation is encountered, the last two values are popped from the stack and used for the calculation, and the answer is pushed back onto the stack. That means no other storage is required - everything is stored in the stack. Some very cheap, low-cost, low-performance microprocessors can be constructed on this basis, running programs in a simple programming language with postfix notation - Forth is an example. Forth never achieved great popularity, but was use in embedded processors in the 1970's and 1980's, when small processors were very simple and had little processing power.

Memory Management

Modern Operating Systems (OS) have lots of memory (RAM) to take care of. Memory must be allocated for applications when they are loaded into the memory and executed. Data-structures such as arrays occupy memory. Graphics images, sounds, and documents use RAM for storage. Memory is generally allocated in blocks. That is, if a graphics image requies 500 KB of memory, the whole area is allocated at once, not a single block at a time. Similarly programs are usually allocated a large block of memory, e.g. 10 MegaBytes.

Dynamic Structures

A simple OS like MS-DOS might have only a specific RAM area available for graphics storage. But a modern OS like Windows (or Linux) needs to be more flexible, allocating memory whenever it is needed - and the allocated memory might reside anywhere in RAM. For this to work, the OS uses pointers to keep track of where things are sitting in the memory. In Java, a pointer is called a reference. In C++, the program can directly manipulate these pointers, adding numbers to them or changing them as desired. But this is very dangerous - there is no guarantee that a program that has requested memory allocation will release that memory later, especially if the program contains a bug that prevents memory deallocation - this is called a "memory leak". When C++ programs fail to release memory, the programs tend to claim more and more RAM as they continue running, until they cause a RAM shortage. Then the computer slows down due to low resources (e.g. little free memory), and the user must reboot the PC to fix the problem.

Java implements a sophisticated memory-management system including automatic garbage collection. That means that if a program stops using a memory block, that memory is automatically returned to the OS as free memory. Automatic garbage collection is a pretty simple idea - each memory block has a pointer (reference) pointing to it. Whenever that reference disappears (e.g. when the program ends), the OS gets the memory back. This requires a message from the Java Virtual Machine to the OS to say that the memory block is no longer needed. Java does a very good job with this problem, and Java rarely suffers from "memory leaks".

Queue

A queue is similar to a stack, but is a FIFO (First In First Out) structure. This is appropriate for tasks like:

processing keystrokes from the keyboard
a print-queue (the first job submitted also gets printed first)
a line in the cafeteria or a store

Managing a queue is a lot more complex than managing a stack, so queues aren't implemented directly at the CPU/machine language level.

A queue must provide the following elements:

a list for data storage
a pointer to the back of the queue
a pointer to the front of the queue
an enqueue method to add new data at the back of the queue
a dequeue method to retrieve a data item from the front of the queue