Error Detection

When packets travel through the Internet or bytes travel around in the computer, sometimes errors occur. Then the byte or packet might arrive changed, containing incorrect values. This is commonly caused by electro-magnetic interference (EM noise), due to a big electric motor like the one in a photocopier or a vacuum cleaner. Since there is lots and lots of data moving around all the time, it is important to prevent, detect, and handle these errors.

Parity Check for Bytes

A parity check is an error-detection mechanism. It consists of 1 extra bit sent along with each byte. It won't prevent or fix errors, but it does enable a receiver to detect that an error has occurred.

Even Parity

An even parity works like this:

count the number of 1-bits in a byte
if the number of 1-bits is odd, then send along a 1-bit with the byte
if the number of 1-bits is even, then send along a 0-bit with the byte
at the receiving end, count the total number of 1-bits (including the parity-bit)
if the total number of received 1-bits is odd, then an error has occurred
if the total number of received 1-bits is even, than assume no error has occurred

Examples:

Data Byte	Parity Bit	Data Received	Error?
00001111	0	00001111 , 0	Okay
11111111	0	11111110 , 0	Error
10101011	1	01010100 , 1	Okay
11110010	1	11110010 , 0	Error

From the examples, it may appear that the parity-bit doesn't work very well. It is true that parity checking is not 100% reliable, because it does not signal an error if 2 bits are incorrect. But in real life it is very rare that any transmissions actually contain exactly two bit errors. Single-bit errors are much more common, and these are reliably detected.

The problem that actually DOES occur relatively often is that a connection is totally severed - e.g. wires are physically broken. Then the signal will appear to be ALL zeroes: 00000000000000000.... . According to even-parity checking, this is a valid result. For this reason, odd-parity checking is a better idea.

Odd-Parity

It's exactly the opposite of even-parity: count the 1-bits in the byte and if there is an even number of 1-bits, send a 1 parity-bit. If there is an odd number of 1's in the byte, send a 0 parity-bit. So the receiver should always receive an odd number of 1-bits. If they receive an even number of 1-bits, that indicates an error. So a long stream of 000000000000.... indicates an error, and thus broken wires can be detected.

Check-Sum for Packets

Since parity-bits are not 100% reliable, we need more error-checking. Since individual bytes are rarely sent, we can build higher-level error checking into packets or other collections of data. One very simple method is a check-sum. It works like this:

Add up all the bytes in the packet.
Send the sum along with the packet
The receiver must add up all the received bytes and check that the total matches the received sum

This would be a bit messy, because a larger packet could have a much bigger sum than a smaller packet. That means we are not quite sure how many bytes should be transmitted for the total. So rather than sending the actual total, we send only the last byte, or the last 4 bytes, or something like that. For example:

Bytes	Actual Sum	Last Byte of Sum
120 , 110 , 100 , 90	400	144
2 , 3 , 5 , 7 , 11 , 13 , 17 , 19	77	77
hex: FF , EE , DD , CC , BB, AA	4FB hex	FB hex
bin: 11111111 , 11111111	111111110	11111110

Of course computers carry out arithmetic in binary. This can easily be transcribed into hexadecimal, which is easier for a human being to read. Calculating the last byte in a decimal sum is a bit tricky, but it's easy in hex or binary.

Although check-sums do detect a lot of errors reliably, this mechanism is not 100% reliable. If two errors occur, they could cancel out each other. For example, the following has the SAME SUM as the hex total above: FA , EB, DC, CD , BE, AF. And NONE of the bytes are correct. Like the parity-bit problem, the probability of two errors occurring in the same packet and them exactly canceling out each other is extremely small. But just in case that does happen, there is a more sophisticated method available.

Cyclic Redundancy Check (CRC)

This is the method built into hard-disk drives, as well as the method used in IP packets. You can read some theory and details at : http://en.wikipedia.org/wiki/Cyclic_redundancy_check (not required reading).

CRC performs a complex calculation involving all the bytes in a packet (or sector on the hard-disk). The calculation is more complicated than simply adding up the bytes - it "scrambles" the bits during the calculation. So it is even MORE UNLIKELY that two errors can cancel each other out. The probability of a data-transmission-error going undetected by CRC is on the order of 10^(-20). Nevertheless, bad packets actually do arrive undetected from time to time, because there are lots and lots of them travelling around the Internet. What happens then? Can it be "fixed"?

Responding to Errors

Imagine an important essay is stored on a hard-disk and when you try to load it you get a "disk error" message. The standard response to a transmission error is to send a request for retransmission to the sender. Then wait for the new copy to arrive and hopefully it will be okay.

What if that doesn't happen - e.g. no correct packet ever arrives? This is a fairly common problem with a broken hard-disk (or a scratched CD-ROM). You can keep asking the disk drive to retransmit the damaged sector, but the fact is that he correct data is simply gone. You will never get it back.

So what about your essay? The hard-disk probably does contain some correct data. If you could get most of your essay back, you would probably be happy. You would just (?) sit down and re-type the missing bits. And anyway, you would certainly have a backup copy somewhere - or perhaps several copies.

Consider a different problem - a space ship, like Voyager, is travelling around the solar-system, taking pictures, and sending the pictures back to earth. If a picture arrives "scrambled up" or defective in some other way, we could ask to have it retransmitted. Unfortunately, there may be a delay of several hours between the original transmission and the error-detection on the earth. So the spacecraft would need to save all pictures for several hours, just in case they need to be retransmitted. Unfortunately, a spacecraft like this does not have lots and lots of extra storage capacity available, and it has a very limited amount of electrical power available. So sending up lots and lots of hard-disks in the spacecraft is really not feasible. The spacecraft needs to be able to take the picture and send it away, with certainty that the picture will arrive successfully. The simplest way to do this is through redundancy - that means multiple copies. The spacecraft could simply send ten copies of every picture. Then using whatever data arrives on the earth, it should be possible to reconstruct a reasonable picture from whatever does arrive. Except - what if there is a solar flare and ALL data is totally messed up for several minutes - e.g. ALL copies of the picture are messed up?

More Sophisticated Error Handling

Parity bits and check-sums are only good for detecting errors - they don't permit us to reconstruct correct data from defective data.

RAID = Redundant Array of Independent Disks

RAID is a hardware + software error-PREVENTION and error-RECOVERY methodology. It depends on high redundancy - that is, having many copies of the data - together with distributing parts of the data across various physical disk-drives. Then if one disk-drive fails, it is possible to recover all the data. Read a bit more about RAID here: http://www.creativecow.net/articles/lindeboom_ron/how_raid_works/index.html
If you are REALLY INTERESTED in RAID, read even more here: RAID Systems (levels)

Which Method is CHEAPEST?

The number 1 cheapest method for data-safety is : don't worry about it. That's simple and cheap, but not very effective. The cheapest and simplest dependable method is : BACKUPS!!!! The more backup copies you have, the more reliable things are. But backups also refers to availability, not just data storage. Suppose your favorite web-site is down just when you need something - what do you do? You probably go to a different site - e.g. the BACKUP. In the web, these are called MIRROR sites.

Which Method is BEST?

Simply put - a combination of ALL safety methods is the best idea. Then individual errors are unlikely to defeat ALL the safety mechanisms at once. And you probably use a combination of them without even knowing it. Your hard-disk drive uses CRC to assure the quality of the data stored on the drive. You probably make backup copies by putting extra copies on the hard-disk or on a USB stick. Your computer uses parity checks during memory access and when transferring data from one device to another. Modems create parity checks automatically (unless you turned this off), the OS creates check-sums (CRC) on IP packets, and applications do some validity checking every time they open a file or load a library. And if your computer is broken, you probably use a different one - so even your computer has a backup.

Click here to go on to the Bits practice assignment.