Files are used to store data on a disk-drive. Without this, any data we typed in would be lost as soon as we shut down the computer. An easy way to create a data file is to use a text editor (e.g. Notepad) and typing the data, like this bank-account file:
Arnie |
This is a sequential text file. It is called a text-file because it is clearly readable, containing no special formatting or strange characters. It is sequential because a computer must read all the data in order when searching for something - there is no way to "jump around" in the file. Although this makes sequential text files inefficient, they are very flexible - you can type anything you want using any format. For example, there is no restriction on the size of a name or the size of a number. The example above is alphabetical, so if you were looking for "Yoyo" you would jump to the end. But if the file were not alphabetical, and contained all the words in a novel (in order), you would need to search sequentially to find a specific word. In a text-file, computers always read sequentially, because there is no command for jumping around in a text-file.
If we are willing to use clear, inflexible structures in our files, we can make it possible for the computer to jump around in the file by counting bytes.
We break the file up into records - one for each customer. Then decide what the maximum size of a record should be. For example, we can allocate 30 bytes (characters) for each name and 20 bytes for each number, giving each record 50 bytes total maximum size.
Each record is divided up into fields - in this case, a NAME field of 30 bytes and a MONEY field of 20 bytes.
After making these decisions, we can say exactly where the 750th record in the file is:
==> 750 * 50 = 37500 bytes
Assuming the computer can count bytes, it can jump to byte #37500 and then read that record, without reading the first 749 records. This won't work in a text file, where the records and fields are variable length. This only works if we use fixed length fields and records.
Jumping around in the file is potentially more efficient than reading sequentially. In addition to this efficiency in reading data, it also enables us to write data much more efficiently. For example, to change Ernie's money, the computer can jump directly to that spot in the file and write a new number there. In a text file, the only way to write new data is to copy all the data into arrays in the computer's memory, then change an item, and then write all the data back onto the disk.
The increased efficiency of Random Access becomes more important with large amounts of data. For example, there are 80 million people in Germany and probably an equal number of telephones. So the telephone customer database contains 80 million records. If one piece of data changes, copying 80 million records could require 50 x 80 million bytes, or approximately 4000 megabytes (4 GB) of memory. That might not fit into the memory at all, making it impossible to change the file if it is supposed to all be in an array at once.
Java provides the RandomAccessFile class for creating and manipulating random-access data files. It does not specify the size of a record or the number of fields - this is controlled by the program.
SEEK
The most important command is seek, which jumps around in the file. Despite the name, this is not a search command. It simply jumps to a specific position in the file. This is not possible in a text file.
READ and WRITE UTF
The input and output commands are .readUTF() and .writeUTF(String) . UTF stands for Unicode Transformation Format. If you want lots of details, try this link: http://en.wikipedia.org/wiki/UTF-8 . UTF supports Unicode characters in a standard way, so lots of computer software can read and write UTF successfully.
FIELD SIZES
Since the file has fixed size fields, the program must control the size of the data before writing. If a program writes 50 characters into a 30 character field, this will cause some sort of problem. But the RandomAccessFile methods will let you make this mistake - it does not control the size of the data. So your program must check data size before writing.
SEEK FIRST
Programs should always seek to a specific position before reading or writing. If the program is writing two fields - name and money - into the file, there should be two seek commands, one before each write command. This is shown in the following sample program.
Here is a sample program that writes and reads bank data in a RandomAccessFile.
//==
Create a RandomAccessFile ==
file.close(); file.close(); |
Notice the following details:
Counting the bytes in a RandomAccessFile is tricky. The arithmetic is not so difficult (see SEEK above). The problem is knowing exactly how many bytes are actually used by various data types. The following chart shows the .write commands and the corresponding number of bytes required.
write command | bytes occupied |
.writeInt(int) | 4 |
.writeDouble(double) | 8 |
.writeChar(char) | 2 (this is not UTF) |
.writeLong(long) | 8 |
.writeByte(byte) | 1 |
.writeFloat(float) | 4 |
.writeBoolean(boolean) | 1 |
.writeUTF(String) | String.length() + 2 bytes *** |
In the sample Bank program, the name field is limited to 40 characters. But the program allows 42 bytes in the file. UTF Strings are written with a 2 byte prefix that tells how long the String is. So the UTF String actually occupies 42 bytes instead of 40.
*** Calculating UTF storage space is actually more complex. UTF does not always use 1 byte per character - it uses 1,2, or 3 bytes per character, depending on the language. "Normal" English characters (those with ASCII codes below 128) require one byte per character. So the calculation above is fine as long as you have normal pure English language data. If the text might contain some Greek letters or special math symbols, then these characters will take more than one byte of storages. If you are unsure and you don't mind wasting disk storage space, allocate 3 times as much space as you actually need, and you won't have any problems.
In general, there is nothing wrong with allocating a bit of extra space. For example, if you are writing 20 character String, an int and a double, you calculate:
==> (20 + 2) + 4 + 8 = 36
You can allocate 40 bytes per record (or even 50), in case you miscounted.