PostGres - Getting it Together

Lesson Objectives#

Table Normalization
Joins
Morning review

Data Modeling#

It's unusual to have just one table. Typically, there are many tables to represent data in a meaningful and useful way.

Let's image we are designing an ice cream manufacturing management system.

We can start by thinking of the logistics of making one type of ice cream. What kind of data would we need?

First, just the name of the ice cream: 'Plain'

'Plain' ice cream is made up of

milk
sugar
cream

We then have to think of how much of each ingredient goes into a pint

Milk	Sugar	Cream
480ml	143g	240ml

Since we're manufacturing lots of ice cream. It would be useful to also have data on how much of these ingredients we have on hand

Milk	Sugar	Cream
500l	14000g	20l

We want to keep track of how many pints we've sold

Flavor	Pints Sold
Plain	543

And how many Pints we have on hand

Flavor	Pints in Storage
Plain	5500

We'd also want to keep track of how much money we're making and how much money we're spending to keep operations going...

Having separate tables is useful. But also being able to join our tables together and look at our data in order to analyze it would be really useful as well.

We might want to look at

how many pints we have in storage vs sold to determine if we are making too much or too little
do we want to branch out to more flavors and keep track of how much of each ingredient we have and if we have enough for each flavor?
We might want to look at sales from one quarter to another
Compare sales in different states/cities

If we're just a small company we can probably get away with using excel spreadsheets. But there is a tipping point where excel just isn't powerful/useful enough.

Splitting data across multiple tables often referred to as normalization.

Our First Table#

Let's make a table that holds important information about our ice cream.

Let's say we're ready to branch out to 5 more flavors (we are currently advertising for a marketing person to assist with better ice cream names)

plain
coffee
strawberry
peanut butter
vanilla
chocolate

What other info is useful? For now let's list how many pints we have in storage and whether or not the ice cream has nuts in it.

We also will add a SERIAL PRIMARY KEY, this means that each entry will get a unique key

(SERIAL is a pseudo datatype, which is 4 bytes and ranges from 1 to 2,147,283,647) and increments more info.

Primary Key adds constraints (like checking that the number is unique and not null)

If you don't have psql open: Then in bash type

createdb sql_ice_cream
psql sql_ice_cream

If you already have psql open:

CREATE DATABASE sql_ice_cream;\c sql_ice_cream

Everyone now:

CREATE TABLE ice_creams (id SERIAL PRIMARY KEY, name VARCHAR(144), pints INT, has_nuts BOOLEAN DEFAULT false);
INSERT INTO ice_creams (name, pints, has_nuts) VALUES('Plain', 554, false),('Blueberry', 821, false ),('Strawberry', 932, false),('Peanut Butter', 22, true),('Vanilla', 404, false),('Chocolate', 203, false);SELECT * FROM ice_creams;

Relational Data#

Above we started to think about our ice cream manufacturing data.

What kind of relations do we need with our ice creams?

One ice cream has many plants it is manufactured in. Due to concerns about allergies (and simplicity) Each plant can only manufacture one ice cream

We could make new columns for our plants

Flavor	Pints Sold	Plant 1	Plant 2	Plant 3
Plain	18	A	B	C
Chocolate	120	M	E	F
Blueberry	674	G	S
Strawberry	88	J	Q
Peanut Butter	273	K
Vanilla	5	H	N	P
Chocolate	444	H	N	P

We already have at least two problems:

as our business grows we're going to need more and more plants, will our columns expand infinitely? How would we keep them organized? What if one plant closes? Do we move the other plants in that row over?
Where is each plant? How many pints can it generate every month? We can't add any info about our plants to this table in sensible organized way - especially when the plants can change

One to Many Relationship#

We can address our problems by making another table plants that holds information about our plants and has a foreign key the foreign key will be the id from our ice_cream table.

Our Angel investors gave us money to buy more plants where we haven't had a chance to figure out which ice cream we'll be making there.

Nevertheless ONE ice cream flavor will have MANY manufacting plants. Again, for simplicity of this example, each plant can only manufacture ONE flavor.

When we create our new plants, they'll have a city (again for simplicity, but likely you'd want more information like the full address...), how many pints made, whether it has passed a food inspection, and the id of the ice cream flavor that it will be manufacturing.

CREATE TABLE plants (id SERIAL PRIMARY KEY, city VARCHAR(144), pints_made INT, passed BOOLEAN, ice_cream_id INT);
INSERT INTO plants (city, pints_made, passed, ice_cream_id) VALUES('Stamford', 100, true, 1),('Greenwich', 20, false, 2),('Hartford', 200, true, 3),('Waterbury', null, null, null),('Darien', null, null, null),('New London', 100, true, 2),('Bridgeport', 150, true, 2),('Milford', null, false, null),('Norwalk', 40, true, 3),('Hamden', null, true, null),('New Britain', null, false, null),('Trumbull', null, null, null),('Danbury', 300, true, 3),('New Canaan', null, true, null),('Fairfield', 400, false, 4),('Stratford', 250, true, 1);

SELECT * FROM plants;

Tough to read? try toggling the extended view by using \x

Don't want to scroll to the bottom? press q

JOINS#

Looking at our two tables separately doesn't really help us figure out what is going on. We want to be able to join our data together.

There are many ways to join data. A great way to start visualizing the different types of joins is with Venn Diagrams

Caveat: Many people don't like using Venn Diagrams to demonstrate joins as they are not totally accurate as to what is actually happening. There are other visualizations out there, but they are not as simple as the Venn Diagrams. EXTRA: more accurate representation of joins

Let's look at the ice cream being produced at plants

SELECT * FROM ice_creams JOIN plants ON plants.ice_cream_id = ice_creams.id;

(press q to quit out of the data view and get your prompt back)

Ok, let's look at all the ice creams, whether or not they are at a plant

SELECT * FROM ice_creams LEFT JOIN plants ON plants.ice_cream_id = ice_creams.id;

Let's see all the plants, whether or not they are producing any ice cream

SELECT * FROM ice_creams RIGHT JOIN plants ON ice_creams.id = plants.ice_cream_id;

Let's see ALL the things.

We'll see our once ice cream that's not being produced anywhere and we can see all of our plants that do not yet have a flavor to produce.

SELECT * FROM ice_creams FULL OUTER JOIN plants ON ice_creams.id = plants.ice_cream_id;

We're seeing a lot of extra info. We can just see the ice creams and cities by altering our query

SELECT name, city FROM ice_creams FULL OUTER JOIN plants ON ice_creams.id = plants.ice_cream_id;

Finally, let's look at every combination, with a cross join:

SELECT * FROM ice_creams CROSS JOIN plants; -- wow that's every row from one table multiplied by the row of every table!

We can use what we've learned this morning to handle our budding business.

We finally have a marketing person! They have decided that each flavor will be renamed to the city of where it was first manufactured.

So Plain will now be called The Stamford. We'll give our ice cream two names. The name will be the primary marketing name. The description will be our old name.

Our Vanilla just won an award for most provocative ice cream flavor of the year! Let's give that one a name and get it into production and name it 'The Signature Connecticut'

Chocolate is still being tested by focus groups and is not yet ready for mass production.

ALTER TABLE ice_creams ADD COLUMN description TEXT;-- update Plain to The StamfordUPDATE ice_creams SET name = 'The Stamford', description = 'Plain' WHERE name = 'Plain';-- update Blueberry to The BridgeportUPDATE ice_creams SET name = 'The Bridgeport', description = 'Blueberry' WHERE name = 'Blueberry';-- update Strawberry to The DanburyUPDATE ice_creams SET name = 'The Danbury', description = 'Strawberry ' WHERE name = 'Strawberry';-- update Peanut butter to The FairfieldUPDATE ice_creams SET name = 'The Fairfield', description = 'Peanut Butter' WHERE name = 'Peanut Butter';-- update Vanilla to The Signature ConnecticutUPDATE ice_creams SET name = 'The Signature Connecticut', description = 'Vanilla' WHERE name = 'Vanilla';

Even though we made significant updates, our previous queries are still accurate.

SELECT city, name, description FROM ice_creams FULL OUTER JOIN plants ON ice_creams.id = plants.ice_cream_id;

Average Pints & Berry Recalls#

We can build out more advanced queries using what we've learned this morning.

For example, what if we wanted to know the average number of pints being made of our signature The Danbury?

SELECT  AVG(pints)FROM  plantsINNER JOIN  ice_creamsON plants.ice_cream_id = ice_creams.idWHERE  name = 'The Danbury';

Or what if there is a recall on all berries! We need to find all the plants that are using berries!

SELECT  cityFROM  plantsINNER JOIN  ice_creamsON plants.ice_cream_id = ice_creams.idWHERE description LIKE '%berry%';

YOU DO#

Write a query that will show all the cities that have not passed inspection