PostGres - Getting it Together
#
Lesson Objectives- Table Normalization
- Joins
- Morning review
#
Data ModelingIt's unusual to have just one table. Typically, there are many tables to represent data in a meaningful and useful way.
Let's image we are designing an ice cream manufacturing management system.
We can start by thinking of the logistics of making one type of ice cream. What kind of data would we need?
First, just the name of the ice cream: 'Plain'
'Plain' ice cream is made up of
- milk
- sugar
- cream
We then have to think of how much of each ingredient goes into a pint
Milk | Sugar | Cream |
---|---|---|
480ml | 143g | 240ml |
Since we're manufacturing lots of ice cream. It would be useful to also have data on how much of these ingredients we have on hand
Milk | Sugar | Cream |
---|---|---|
500l | 14000g | 20l |
We want to keep track of how many pints we've sold
Flavor | Pints Sold |
---|---|
Plain | 543 |
And how many Pints we have on hand
Flavor | Pints in Storage |
---|---|
Plain | 5500 |
We'd also want to keep track of how much money we're making and how much money we're spending to keep operations going...
Having separate tables is useful. But also being able to join our tables together and look at our data in order to analyze it would be really useful as well.
We might want to look at
how many pints we have in storage vs sold to determine if we are making too much or too little
do we want to branch out to more flavors and keep track of how much of each ingredient we have and if we have enough for each flavor?
We might want to look at sales from one quarter to another
Compare sales in different states/cities
If we're just a small company we can probably get away with using excel spreadsheets. But there is a tipping point where excel just isn't powerful/useful enough.
Splitting data across multiple tables often referred to as normalization.
#
Our First TableLet's make a table that holds important information about our ice cream.
Let's say we're ready to branch out to 5 more flavors (we are currently advertising for a marketing person to assist with better ice cream names)
- plain
- coffee
- strawberry
- peanut butter
- vanilla
- chocolate
What other info is useful? For now let's list how many pints we have in storage and whether or not the ice cream has nuts in it.
We also will add a SERIAL PRIMARY KEY
, this means that each entry will get a unique key
(SERIAL is a pseudo datatype, which is 4 bytes and ranges from 1 to 2,147,283,647) and increments more info.
Primary Key adds constraints (like checking that the number is unique and not null)
If you don't have psql
open: Then in bash type
createdb sql_ice_cream
psql sql_ice_cream
If you already have psql open:
CREATE DATABASE sql_ice_cream;\c sql_ice_cream
Everyone now:
CREATE TABLE ice_creams (id SERIAL PRIMARY KEY, name VARCHAR(144), pints INT, has_nuts BOOLEAN DEFAULT false);
INSERT INTO ice_creams (name, pints, has_nuts) VALUES('Plain', 554, false),('Blueberry', 821, false ),('Strawberry', 932, false),('Peanut Butter', 22, true),('Vanilla', 404, false),('Chocolate', 203, false);SELECT * FROM ice_creams;
#
Relational DataAbove we started to think about our ice cream manufacturing data.
What kind of relations do we need with our ice creams?
One ice cream has many plants it is manufactured in. Due to concerns about allergies (and simplicity) Each plant can only manufacture one ice cream
We could make new columns for our plants
Flavor | Pints Sold | Plant 1 | Plant 2 | Plant 3 |
---|---|---|---|---|
Plain | 18 | A | B | C |
Chocolate | 120 | M | E | F |
Blueberry | 674 | G | S | |
Strawberry | 88 | J | Q | |
Peanut Butter | 273 | K | ||
Vanilla | 5 | H | N | P |
Chocolate | 444 | H | N | P |
We already have at least two problems:
- as our business grows we're going to need more and more plants, will our columns expand infinitely? How would we keep them organized? What if one plant closes? Do we move the other plants in that row over?
- Where is each plant? How many pints can it generate every month? We can't add any info about our plants to this table in sensible organized way - especially when the plants can change
#
One to Many RelationshipWe can address our problems by making another table plants
that holds information about our plants and has a foreign key
the foreign key will be the id
from our ice_cream
table.
Our Angel investors gave us money to buy more plants where we haven't had a chance to figure out which ice cream we'll be making there.
Nevertheless ONE ice cream flavor will have MANY manufacting plants. Again, for simplicity of this example, each plant can only manufacture ONE flavor.
When we create our new plants, they'll have a city (again for simplicity, but likely you'd want more information like the full address...), how many pints made, whether it has passed a food inspection, and the id
of the ice cream flavor that it will be manufacturing.
CREATE TABLE plants (id SERIAL PRIMARY KEY, city VARCHAR(144), pints_made INT, passed BOOLEAN, ice_cream_id INT);
INSERT INTO plants (city, pints_made, passed, ice_cream_id) VALUES('Stamford', 100, true, 1),('Greenwich', 20, false, 2),('Hartford', 200, true, 3),('Waterbury', null, null, null),('Darien', null, null, null),('New London', 100, true, 2),('Bridgeport', 150, true, 2),('Milford', null, false, null),('Norwalk', 40, true, 3),('Hamden', null, true, null),('New Britain', null, false, null),('Trumbull', null, null, null),('Danbury', 300, true, 3),('New Canaan', null, true, null),('Fairfield', 400, false, 4),('Stratford', 250, true, 1);
SELECT * FROM plants;
Tough to read? try toggling the extended view by using \x
Don't want to scroll to the bottom? press q
#
JOINSLooking at our two tables separately doesn't really help us figure out what is going on. We want to be able to join our data together.
There are many ways to join data. A great way to start visualizing the different types of joins is with Venn Diagrams
Caveat: Many people don't like using Venn Diagrams to demonstrate joins as they are not totally accurate as to what is actually happening. There are other visualizations out there, but they are not as simple as the Venn Diagrams. EXTRA: more accurate representation of joins
Let's look at the ice cream being produced at plants
SELECT * FROM ice_creams JOIN plants ON plants.ice_cream_id = ice_creams.id;
(press q to quit out of the data view and get your prompt back)
Ok, let's look at all the ice creams, whether or not they are at a plant
SELECT * FROM ice_creams LEFT JOIN plants ON plants.ice_cream_id = ice_creams.id;
Let's see all the plants, whether or not they are producing any ice cream
SELECT * FROM ice_creams RIGHT JOIN plants ON ice_creams.id = plants.ice_cream_id;
Let's see ALL the things.
We'll see our once ice cream that's not being produced anywhere and we can see all of our plants that do not yet have a flavor to produce.
SELECT * FROM ice_creams FULL OUTER JOIN plants ON ice_creams.id = plants.ice_cream_id;
We're seeing a lot of extra info. We can just see the ice creams and cities by altering our query
SELECT name, city FROM ice_creams FULL OUTER JOIN plants ON ice_creams.id = plants.ice_cream_id;
Finally, let's look at every combination, with a cross join:
SELECT * FROM ice_creams CROSS JOIN plants; -- wow that's every row from one table multiplied by the row of every table!
We can use what we've learned this morning to handle our budding business.
We finally have a marketing person! They have decided that each flavor will be renamed to the city of where it was first manufactured.
So Plain
will now be called The Stamford
.
We'll give our ice cream two names. The name
will be the primary marketing name. The description
will be our old name.
Our Vanilla just won an award for most provocative ice cream flavor of the year
! Let's give that one a name and get it into production and name it 'The Signature Connecticut'
Chocolate is still being tested by focus groups and is not yet ready for mass production.
ALTER TABLE ice_creams ADD COLUMN description TEXT;-- update Plain to The StamfordUPDATE ice_creams SET name = 'The Stamford', description = 'Plain' WHERE name = 'Plain';-- update Blueberry to The BridgeportUPDATE ice_creams SET name = 'The Bridgeport', description = 'Blueberry' WHERE name = 'Blueberry';-- update Strawberry to The DanburyUPDATE ice_creams SET name = 'The Danbury', description = 'Strawberry ' WHERE name = 'Strawberry';-- update Peanut butter to The FairfieldUPDATE ice_creams SET name = 'The Fairfield', description = 'Peanut Butter' WHERE name = 'Peanut Butter';-- update Vanilla to The Signature ConnecticutUPDATE ice_creams SET name = 'The Signature Connecticut', description = 'Vanilla' WHERE name = 'Vanilla';
Even though we made significant updates, our previous queries are still accurate.
SELECT city, name, description FROM ice_creams FULL OUTER JOIN plants ON ice_creams.id = plants.ice_cream_id;
#
Average Pints & Berry RecallsWe can build out more advanced queries using what we've learned this morning.
For example, what if we wanted to know the average number of pints being made of our signature The Danbury
?
SELECT AVG(pints)FROM plantsINNER JOIN ice_creamsON plants.ice_cream_id = ice_creams.idWHERE name = 'The Danbury';
Or what if there is a recall on all berries! We need to find all the plants that are using berries!
SELECT cityFROM plantsINNER JOIN ice_creamsON plants.ice_cream_id = ice_creams.idWHERE description LIKE '%berry%';
#
YOU DOWrite a query that will show all the cities that have not passed
inspection