325lect09-1-projected


=====
CS 325 - Week 9 Lecture 1 - 2021-10-18
=====

=====
TODAY WE WILL
=====
*   announcements
*   intro to Normalization
*   prep for next class

*   current reading:
    *   DB Reading Packet 6 - Normalization
    *   SQL Reading Packet 5 - "order by, group by, and having"

*   you should start Homework 6 - due Friday, 11:59 pm
*   should be working on your CS 325 Project - Database Model Draft
    milestone - due Sunday, October 24, 11:59 pm

=====
Intro to Normalization and Normal Forms
=====
*   very related to functional dependencies;

*   we're assuming (mostly) so-called operational database,
    where we are updating it based on day-to-day operations;

*   For some relations, changing the data can have undesirable
    consequences -- these are called modification anomalies
    *   and it will turn out that we can eliminate certain classes
        of these by redefining relations to be in particular
	normal forms

    *   and "normalization" is the process where you redefine
        relations to get them to a particular normal form to
	prevent certain kinds of modification anomalies;

*   OK, let's start with modification anomalies!
    *   deletion anomalies - when deleting a row results in the
        loss of "additional" information it does not seem reasonable
	to lose

        *   deleting june arbro causes the loss of the info that job_class
            BioE is supposed to have a charge_per_hr of $55

        *   deleting anne ramoras causes the loss of the info that proj_num
            3 has the name satellite

    *   insertion anomalies - when you would like to store some information
        but you cannot without some other information, such that that
	restriction seems unreasonable

    *   some also include update anomalies -- but that could also be seen
        as a combo of the above;

======
*   Normalization is seeking to create/adapt sets of relations
    that will not have these modification anomalies (or that will
    have fewer of them, at least)

    We'd also like to reduce *unnecessary* data redundancy (in the original
    current relation we're discussing, if you change a project name,
    you have to change it in *all* the rows of employees who have
    done any work on that project)

    *   it turns out that the careful "breaking up" of a single
        relation into one or more other relations, including
	appropriate foreign keys, can sometimes get rid of a
	modification anomaly

    *   So, normalization is a process of carefully "breaking"
        a relation into multiple relations, each containing
	a separate "them", to reduce modification anomalies;

        (and adding in the appropriate referential integrity
	constraints -- foreign keys! 8-) 0-- to still let us
	relate what we want to relate)

=====
Normal Forms
=====
*   some people came up with a way to classify relations
    based on the types of modification anomalies to which they
    are vulnerable;

    these classifications are called NORMAL FORMS

    Here is that classification:
    *   first normal form (1NF)
    *   second normal form (2NF)
    *   third normal form (3NF)
    *   Boyce-Codd normal form (BCNF)
    *   fourth normal form (4NF)
    *   fifth normal form (5NF)
    *   sixth normal form (6NF)

    *   these are superset/subset -- that is,
        if your relations are in 2NF, they are also (by definition) in 1NF
	if your relations are in 3NF, they are also (by definition) in 1NF and 2NF
	if your relations are in BCNF, they are also (by definition) in 1NF, 2NF and 3NF
	if your relations are in 4NF, they are also (by definition) in 1NF, 2NF, 3NF, and BCNF
        and so on

    *   you COULD think of normalization as working through a series of stages
        involving these normal forms;
        *   and happily, usually it is considered sufficient for operational databases
	    if they are in 3NF (or close enough to 3NF for that scenario's purposes...)

-----
*   1NF: any (true) relation is, by definition, in 1NF

    ...the most common issue to get a set of relations to 1NF is if a multivalued
       attribute has slipped in there, and a relation is NOT to have multiple
       values in a single cell;

*   2NF: a relation is in 2NF if it is in 1NF and it has no
    PARTIAL dependencies

    *   a partial dependency is when a non-primary-key attribute is dependent
        on just PART of the primary key

*   3NF: a relation is in 3NF if it is in 2NF and has no TRANSITIVE dependencies

    *   a transitive dependency is when a non-primary-key attribute NPK1 functionally
        determines another non-primary-key attribute NPK2

        (so the primary key functionally determines NPK1 and so transitively also
	determines NPK2)

*   BCNF - a relation is in BCNF if it is in 3NF and every determinant in the table
           is a candidate key

*   4NF - a relation is in 4NF if it is in BCNF and includes no multiple sets of
    multivalued dependencies

*   ...we're stopping there!