Database Design

Most of us are capable of designing a robust database that serves most of our needs by considering a few basic points.

  1. How will you design your computer database to hold your data?
  2. How will you use the data?
A good place to start is ask two basic questions:
  1. What was the "thing" about which I recorded information? Was it a bird, a sample of blood, an egg, a seabird colony, a person, a habitat, a plot, or an event? "Things" are called "entities".
  2. What information did I record about each thing? Information about "entities" are called "attributes".

Example 1.

Let's assume that you are running a bird-banding station at which you trap birds, band them, and measure their body mass. The "thing" or "entity" about which you are recording information is each individual bird you catch, and the "attributes" recorded for each thing are band number, species, location, and body mass. Traditionally we might record such information on paper in tabular format, with each bird representing a row in the table, and each item of information about each bird recorded in a column, as follows:

Banding Number Species Location Body Mass
110390788 COTE Gull Island 1 120
110390789 ARTE Gull Island 1 135
110390790 COTE Green Island 122
110390791 COTE Long Island 110
110390800 ARTE Gull Island 2 143
This is called a rectangular table and could be stored as a single file in a database management system. In a rectangular table each row is called a "record" and each column is called a "field". Records can be thought of as sampling units or replicants and fields as variables.

Although the rectangular format is usually the most intuitive, it might not be the most appropriate.

Example 2.

Imagine that you have taken 100 water samples in each of two lakes A and B and measured pH and water depth of each sample. You may record the data in your field book as follows:

pH A Depth A pH B Depth B
7.2 4 5.4 3
8.3 6 6.2 9
6.9 8 5.2 10
... ... ... ...
8.0 9 5.8 23
Although this arrangement might be suitable for a field note book, it is not appropriate for a computer database file. First, the variables are not unique (ie. pH is recorded in two fields), and second, the sampling unit (ie. what the data in each record or row of the data set represents) is unclear. Data arranged in this way would make analysis difficult or impossible without reformatting. For example, you could not produce a graph of depth on the x-axis and pH on the y-axis for both lakes unless you somehow overlaid data from both lakes.
Instead, use the following structure for storage of this data-set in a computer database:
Lake pH Depth
A 7.2 4
A 8.3 6
A 6.9 5
B 5.4 3
B 6.2 9
B 5.2 10

In this format, each variable is stored in a single field and it is clear what a row in the data-set represents one water sample. Note that it is necessary to add a variable to code for the lake from which the sample was taken. This was implicit in the fields names in the first attempt above.

Example 3.
Consider our bird banding example used previously (table shown below), but now imagine that you want to store more information about each location such as latitude and longitude, habitat, and nearest town.
Banding Number Species Location Body Mass
110390788 COTE Gull Island 1 120
110390789 ARTE Gull Island 1 135
110390790 COTE Green Island 122
110390791 COTE Long Island 110
110390800 ARTE Gull Island 2 143
You could do this by adding these fields to the flat file, however, the information would have to be repeated everytime a particular location appears in the database. This would lead to a lot of duplication and redundancy in the database and result in a much larger file. The solution is to store the location information in a separate file and link the two files together. Below is how the second table containing location information would look:
Location Latitude Longitude Nearest Town
Gull Island 1 44.54 60.02 Halifax
Gull Island 2 45.55 65.45 Liverpool
Green Island 47.40 64.20 Miscou
Long Island 47.63 53.01 Portugal Cove

Note how the Location field is common to both tables. The two tables are linked by this field (see image below). Whenever Island A is mentioned in the banding table, all the attributes in the location table for Island A are also available as if they resided in the banding table. Even though Island A is mentioned more than once in the banding table, it only needs to appear once in the Location table.

Match Fields

Match fields must be carefully planned because they play an important role in linking multi-table databases. In the example above, the values used in the match field "Location" need to be unique, so that references to the table are unambiguous. In the Banding table, two different Gull Islands were referred to, so a number was added to the end of the location name to make them unique and to allow the correct information to be matched in the Location table.

The number of characters used in the match field should be as few as possible because copies reside in each of the tables in your relational database. Short match fields reduce the amount of disk space used, and make the database run faster.