In the last chapter, we introduced the relational model of the database, and defined the fundamental mathematical object in the model, the relation. In this chapter, we discuss relational algebra, which is the set of algebraic operations that can be performed on relations. Relational algebra can be viewed as one mechanism for expressing queries on data stored in relations, and an understanding of relational algebra is important in understanding how relational databases represent and optimize queries. We will cover only basic relational algebra, excluding later extensions such as those for group and aggregate operations and those for outer joins.
A related topic, which we do not cover in this book, is relational calculus. Relational calculus provides another mathematical expression of queries on relations, and is equivalent in expressiveness to relational algebra.
The unary operations in relational algebra act on one relation and result in another relation. The unary operations are selection, projection, and renaming, and their associated operators are typically written as the Greek letters which match the starting letters of the operation:
We will explore each of these unary operators in application to the relation books shown below:
books ¶The House of the Spirits
The Fellowship of the Ring
House Made of Dawn
A Wizard of Earthsea
The books relation has primary key book_id, while author_id is a foreign key to another table we will use later in this chapter.
Selection applies a Boolean condition to the tuples in a relation. The result of a selection operation is a relation containing exactly those tuples for which the selection condition is true. For example, if we are interested in books published after 1960, we can write the selection operation to retrieve just those books as:
\[\sigma_ > 1960>(\text)\]\textThe operator is written with the Boolean condition as a subscript, and then the operand (the input relation) is given in parentheses. Note that the Boolean condition refers to an attribute of the books relation, comparing it to a constant value. The result of this operation is a relation with the same schema as books, but with no name:
The House of the Spirits
House Made of Dawn
A Wizard of Earthsea
Simple Boolean expressions in the relational algebra usually involve comparisons of an attribute with a constant, using any comparison operator. More complex Boolean expressions can be constructed from simple expressions using AND, OR, and NOT. For instance, if we are interested in books published after 1960 as well as books by the author with author_id equal to 6, we could write:
\[\sigma_ > 1960 \text < OR >\text = 6>(\text)\]\textSelection can result in a relation that has all of the tuples from the original (a relation equivalent to the original), some of the tuples from the original, or no tuples at all (an empty set). In the case of an empty relation, we still consider the relation to have the same schema as the original relation.
Since the result of a selection is a relation, we can apply another selection to the result. For example, we could find the books published after 1950, and then select from that result the books with author_id equal to 6:
\[\sigma_ = 6>(\sigma_ > 1950>(\text))\]\textThis would give us a result with one tuple:
The Fellowship of the Ring
This composition of selection operations is equivalent to a single selection operation using a conjunction (AND) of the selection conditions:
\[\sigma_ = 6 \text < AND >\text > 1950>(\text)\]\textThe projection operation creates a new relation which has a subset of the attributes of the input relation. We could use projection, for example, to get a set of tuples expressing just the title and publication year of our books. We write the projection operator with the list of attribute names in the subscript, followed by the operand in parentheses:
This result contains a tuple for each tuple in books, but the tuples in the result only have the attributes specified by the projection operation, thus the result relation has a different schema from the original:
The House of the Spirits
The Fellowship of the Ring
House Made of Dawn
A Wizard of Earthsea
At first glance, it might seem the result of a projection will always have the same number of tuples as the input relation, but this is not the case. Consider what happens if we project books onto the single attribute year. There are two tuples in books with the same year value of 1968. Since relations cannot contain duplicates, the result of our projection operation can contain only one tuple with year equal to 1968. Thus, the result has fewer tuples than the input relation:
Since the result of projection is a relation, we can apply selection to the result:
\[\sigma_=1968>(\pi_>(\text))\]\textNote the order of operations here: first, we supply books as an input to the projection operation; second, the result of the projection is given as the input to the selection operation.
Similarly, since the result of a selection is a relation, we can apply projection after selection. The above expression is equivalent to:
\[\pi_>(\sigma_=1968>(\text))\]\textThe result in both cases is:
House Made of Dawn
A Wizard of Earthsea
It is important to note, however, that you cannot always change the order of projection and selection for an equivalent result. Consider the following expressions:
\[\pi_>(\sigma_=1968>(\text))\]\text \[\sigma_=1968>(\pi_>(\text))\]\textIn the first expression, we select the books which were published in 1968, and then project the resulting tuples onto the title attribute. This result is:
House Made of Dawn
A Wizard of Earthsea
However, the second expression is not a correct expression. The projection occurs first, yielding a relation with just one attribute named title. The following selection is then incorrect, because it makes reference to an attribute, year, which does not exist in the input relation.
Projection can also be applied to the result of another projection; however, the result is equivalent to just performing the second projection. Compare:
\[\pi_>(\pi_>(\text))\]\textNote that we cannot change the order of the two projection operations in the first expression above, as the expression would then be incorrect.
The final unary operation allows for relations and their attributes to be renamed. As we will see, this operation is primarily useful in eliminating name conflicts in certain binary operations - that is, in expressions involving two relations in which the name of some attribute is the same in both relations. The general form of the renaming operator lets us provide new names for the relation and all of its attributes:
This results in a relation with the name mybooks with attributes b_id, a_id, title, and year. The tuples of the new relation have the same values as the tuples of the old relation, but the values are associated with the new attribute names.
As in this example, it is not necessary to alter the name of every attribute (we left unchanged the attribute names title and year), but some name must be provided for every attribute. A non-standard alternative notation allows us to rename only the attributes we want to change:
\[\rho_ <\textWe can optionally leave out either the relation name or the list of attributes. For example, the following expression is correct and results in a relation named books with attributes book_id, author_id, title, and publication_year:
\[\rho_ <\text<(year>\rightarrow \textWe now turn our attention to operations which extend tuples in one relation with tuples from another relation. For this section, we will be using books and a second relation, authors:
authors ¶N. Scott Momaday
Ursula K. Le Guin
The authors relation has a primary key of author_id. The books relation is related to authors via a foreign key on author_id.
The cross product (or Cartesian product) of two relations A and B is a new relation containing all tuples that can be created by concatenating some tuple from B onto some tuple from A [ 1 ] . Here we are using the definition of tuple as an ordered list of values. The attributes of the new relation are normally the attributes of A and B concatenated. However, if there is a name collision, e.g., if both A and B have some attribute x, we will disambiguate the attributes in the new relation by prepending the relation names, that is, the cross product will have attributes A.x and B.x; we can avoid having to do this if we first apply renaming to one relation or the other.
The cross product operator is denoted \(\times\) , and is written between its two operands. To start, consider two rather abstract relations S and T:
S ¶ T ¶We write the cross product of S and T as:
\[\textwhich gives us the relation containing every pairing of a tuple from S with every tuple from T:
From the definition, it is trivial to determine that the size of the cross product is the product of the sizes of the operands.
The cross product is a fundamental operation in relational algebra, but not a generally useful one when we consider actual data. Consider the cross product of books and authors:
\[\textThe full set of tuples in this relation is large (the number of books multiplied by the number of authors), so we only show a subset below:
The House of the Spirits
The House of the Spirits
The House of the Spirits
The author of The House of the Spirits is Isabel Allende. What meaning, then, can we make of a tuple that pairs The House of the Spirits with the author Ralph Ellison (the author of Invisible Man)?
We are typically interested in pairing only certain tuples of a relation with certain tuples of another. In the above example, we are interested in tuples where the author_id attribute from books agrees with the author_id attribute from authors. This relationship is indicated not only by the names we have used for attributes, but also by the foreign key constraint on books and authors. To retain only the tuples with matching author_id values, we can apply a selection operation to the result of our cross product:
\[\sigma_=\text>(\text \times \text)\]\textThis yields a useful result:
The House of the Spirits
The Fellowship of the Ring
House Made of Dawn
N. Scott Momaday
A Wizard of Earthsea
Ursula K. Le Guin
Since this pattern of applying a selection after a cross product is so common, we have an operator that combines the two into an operation known as a join [ 2 ] . Using the join operator, the above expression becomes:
\[\textor, you can instead format the expression as:
\[\textNote that one tuple from authors does not contribute to the join. This tuple’s author_id matches none of the tuples in books, and thus no combined tuple using it can appear in the join result. We call this tuple a dangling tuple. Dangling tuples may be an indication of a problem in the data; in this example, it may suggest that we are missing information about books by one author.
While an equality condition is typically used in joins, more generally any condition of the following form can be used:
\[\textwhere A.x is an attribute from one relation, B.y is an attribute from the other relation, and \(\Theta\) is a comparison operator (such as =, theta condition, and a join using such a condition or a conjunction (AND) of such conditions is known as a theta-join.
A theta-join using only equality comparisons (as in our example above) is further known as an equijoin.
This terminology is not especially important in understanding the algebra, but is something you may encounter if you intend a deeper study of relational algebra.
When we join books with authors we run into the issue that both relations contain an attribute named author_id. Since a relation cannot have more than one attribute with the same name, joining (or taking a cross product of) these two relations requires us to rename the attributes in some fashion. This can be done either by an explicit renaming operation prior to joining or by prepending the original relation name (as we did in our example). Because our join condition was equality on the author_id attributes, both the books.author_id and authors.author_id in the resulting relation always agree. This unnecessary redundancy can be removed using projection and renaming.
In this special situation in which we wish to join specifically by equating the attributes with the same names in both relations - subsequently removing the “duplicate” attributes - we can instead do a natural join. We can indicate a natural join using the join operator with no conditions [ 3 ] :
\[\textwhich yields the simplified relation:
The House of the Spirits
The Fellowship of the Ring
House Made of Dawn
N. Scott Momaday
A Wizard of Earthsea
Ursula K. Le Guin
Unsurprisingly, given that relations are sets, relational algebra includes the usual set operations - union, intersection, and set difference - with some restrictions. These binary operations are denoted by:
For example, let A and B be the relations below:
A ¶ B ¶ \(\text - \text\) ¶Note that union and intersection are commutative, but set difference is not.
The important restriction on set operations in relational algebra is that the relations must be compatible in terms of their schemas. The meaning of “compatible” varies, but for our purposes, assume we view the tuples in a relation as ordered lists, where each position in the list is associated with a particular attribute and type domain. Then, if we have two relations, we require that, for a given position in the tuples in either relation, the attribute and type domain are the same. For A and B shown above, we might assert that the first position corresponds to attribute x and contains character strings, while the second position (y) contains integers.
A looser requirement allows attribute names (but not type domains) to differ between relations. This requirement aligns less closely with the second definition of tuple given in the previous chapter, but it eliminates the occasional need for renaming operations prior to applying set operations. If the attribute names do not match in the two relations, we adopt the attribute names from the left-hand operand for the result relation.
While intersection is a useful operation, it is not strictly needed for the algebra, as the same result can be obtained using set difference:
The operations described above are sufficient for most query needs. However, one other binary operation, division, is typically included in the basic relational algebra. To divide a relation P by another relation R, we write:
Division is the most difficult operation to describe; in a very loose sense it acts as a kind of inverse to a cross product. That is, if P, Q, and R are relations and
\[\text= \text \times \text\]
then it is true that
\[\text\div \text = \text\]
However, the reverse is not necessarily true. Rather, let P be some relation, with attributes x and y [ 4 ] . We require that R has attribute y. Then \(\text \div \text\) will contain the values of x which are paired (in P) with every value of y listed in R.
We will start with an abstract example. Let P be the relation pictured below: