Core Features
Terms defined: administration command, aggregation, aggregation function, cross join, exclusive or, filter, full outer join, group, in-memory database, inclusive or, join, join condition, left outer join, null, query, right outer join, ternary logic, tombstone
Selecting Constant
select 1;
1
select
is a keyword- Normally used to select data from table…
- …but if all we want is a constant value, we don't need to specify one
- Semi-colon terminator is required
Selecting All Values from Table
select * from little_penguins;
Gentoo|Biscoe|51.3|14.2|218.0|5300.0|MALE
Adelie|Dream|35.7|18.0|202.0|3550.0|FEMALE
Adelie|Torgersen|36.6|17.8|185.0|3700.0|FEMALE
Chinstrap|Dream|55.8|19.8|207.0|4000.0|MALE
Adelie|Dream|38.1|18.6|190.0|3700.0|FEMALE
Adelie|Dream|36.2|17.3|187.0|3300.0|FEMALE
Adelie|Dream|39.5|17.8|188.0|3300.0|FEMALE
Gentoo|Biscoe|42.6|13.7|213.0|4950.0|FEMALE
Gentoo|Biscoe|52.1|17.0|230.0|5550.0|MALE
Adelie|Torgersen|36.7|18.8|187.0|3800.0|FEMALE
- An actual query
- Use
*
to mean "all columns" - Use
from tablename
to specify table - Output format is not particularly readable
Administrative Commands
.headers on
.mode markdown
select * from little_penguins;
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
|-----------|-----------|----------------|---------------|-------------------|-------------|--------|
| Gentoo | Biscoe | 51.3 | 14.2 | 218.0 | 5300.0 | MALE |
| Adelie | Dream | 35.7 | 18.0 | 202.0 | 3550.0 | FEMALE |
| Adelie | Torgersen | 36.6 | 17.8 | 185.0 | 3700.0 | FEMALE |
| Chinstrap | Dream | 55.8 | 19.8 | 207.0 | 4000.0 | MALE |
| Adelie | Dream | 38.1 | 18.6 | 190.0 | 3700.0 | FEMALE |
| Adelie | Dream | 36.2 | 17.3 | 187.0 | 3300.0 | FEMALE |
| Adelie | Dream | 39.5 | 17.8 | 188.0 | 3300.0 | FEMALE |
| Gentoo | Biscoe | 42.6 | 13.7 | 213.0 | 4950.0 | FEMALE |
| Gentoo | Biscoe | 52.1 | 17.0 | 230.0 | 5550.0 | MALE |
| Adelie | Torgersen | 36.7 | 18.8 | 187.0 | 3800.0 | FEMALE |
.mode markdown
and.headers on
make the output more readable- These SQLite administrative commands
start with
.
and aren't part of the SQL standard- PostgreSQL's special commands start with
\
- PostgreSQL's special commands start with
- Each command must appear on a line of its own
- Use
.help
for a complete list - And as mentioned earlier, use
.quit
to quit
Specifying Columns
select
species,
island,
sex
from little_penguins;
| species | island | sex |
|-----------|-----------|--------|
| Gentoo | Biscoe | MALE |
| Adelie | Dream | FEMALE |
| Adelie | Torgersen | FEMALE |
| Chinstrap | Dream | MALE |
| Adelie | Dream | FEMALE |
| Adelie | Dream | FEMALE |
| Adelie | Dream | FEMALE |
| Gentoo | Biscoe | FEMALE |
| Gentoo | Biscoe | MALE |
| Adelie | Torgersen | FEMALE |
- Specify column names separated by commas
- In any order
- Duplicates allowed
- Line breaks encouraged for readability
Sorting
select
species,
sex,
island
from little_penguins
order by island asc, sex desc;
| species | sex | island |
|-----------|--------|-----------|
| Gentoo | MALE | Biscoe |
| Gentoo | MALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
| Chinstrap | MALE | Dream |
| Adelie | FEMALE | Dream |
| Adelie | FEMALE | Dream |
| Adelie | FEMALE | Dream |
| Adelie | FEMALE | Dream |
| Adelie | FEMALE | Torgersen |
| Adelie | FEMALE | Torgersen |
order by
must followfrom
(which must followselect
)asc
is ascending,desc
is descending- Default is ascending, but please specify
Exercise
Write a SQL query to select the sex and body mass columns from the little_penguins
in that order,
sorted such that the largest body mass appears first.
Limiting Output
- Full dataset has 344 rows
select
species,
sex,
island
from penguins
order by species, sex, island
limit 10;
| species | sex | island |
|---------|--------|-----------|
| Adelie | | Dream |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
- Comments start with
--
and run to the end of the line limit N
specifies maximum number of rows returned by query
Paging Output
select
species,
sex,
island
from penguins
order by species, sex, island
limit 10 offset 3;
| species | sex | island |
|---------|--------|-----------|
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
| Adelie | FEMALE | Biscoe |
offset N
must followlimit
- Specifies number of rows to skip from the start of the selection
- So this query skips the first 3 and shows the next 10
Removing Duplicates
select distinct
species,
sex,
island
from penguins;
| species | sex | island |
|-----------|--------|-----------|
| Adelie | MALE | Torgersen |
| Adelie | FEMALE | Torgersen |
| Adelie | | Torgersen |
| Adelie | FEMALE | Biscoe |
| Adelie | MALE | Biscoe |
| Adelie | FEMALE | Dream |
| Adelie | MALE | Dream |
| Adelie | | Dream |
| Chinstrap | FEMALE | Dream |
| Chinstrap | MALE | Dream |
| Gentoo | FEMALE | Biscoe |
| Gentoo | MALE | Biscoe |
| Gentoo | | Biscoe |
distinct
keyword must appear right afterselect
- SQL was supposed to read like English
- Shows distinct combinations
- Blanks in
sex
column show missing data- We'll talk about this in a bit
Exercise
-
Write a SQL query to select the islands and species from rows 50 to 60 inclusive of the
penguins
table. Your result should have 11 rows. -
Modify your query to select distinct combinations of island and species from the same rows and compare the result to what you got in part 1.
Filtering Results
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe';
| species | sex | island |
|---------|--------|--------|
| Adelie | FEMALE | Biscoe |
| Adelie | MALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
| Gentoo | MALE | Biscoe |
| Gentoo | | Biscoe |
where condition
filters the rows produced by selection- Condition is evaluated independently for each row
- Only rows that pass the test appear in results
- Use single quotes for
'text data'
and double quotes for"weird column names"
- SQLite will accept double-quoted text data but SQLFluff will complain
Exercise
-
Write a query to select the body masses from
penguins
that are less than 3000.0 grams. -
Write another query to select the species and sex of penguins that weight less than 3000.0 grams. This shows that the columns displayed and those used in filtering are independent of each other.
Filtering with More Complex Conditions
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe' and sex != 'MALE';
| species | sex | island |
|---------|--------|--------|
| Adelie | FEMALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
and
: both sub-conditions must be trueor
: either or both part must be true- Notice that the row for Gentoo penguins on Biscoe island with unknown (empty) sex didn't pass the test
- We'll talk about this in a bit
Exercise
-
Use the
not
operator to select penguins that are not Gentoos. -
SQL's
or
is an inclusive or: it succeeds if either or both conditions are true. SQL does not provide a specific operator for exclusive or, which is true if either but not both conditions are true, but the same effect can be achieved usingand
,or
, andnot
. Write a query to select penguins that are female or on Torgersen Island but not both.
Doing Calculations
select
flipper_length_mm / 10.0,
body_mass_g / 1000.0
from penguins
limit 3;
| flipper_length_mm / 10.0 | body_mass_g / 1000.0 |
|--------------------------|----------------------|
| 18.1 | 3.75 |
| 18.6 | 3.8 |
| 19.5 | 3.25 |
- Can do the usual kinds of arithmetic on individual values
- Calculation done for each row independently
- Column name shows the calculation done
Renaming Columns
select
flipper_length_mm / 10.0 as flipper_cm,
body_mass_g / 1000.0 as weight_kg,
island as where_found
from penguins
limit 3;
| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1 | 3.75 | Torgersen |
| 18.6 | 3.8 | Torgersen |
| 19.5 | 3.25 | Torgersen |
- Use
expression as name
to rename - Give result of calculation a meaningful name
- Can also rename columns without modifying
Exercise
Write a single query that calculates and returns:
- A column called
what_where
that has the species and island of each penguin separated by a single space. - A column called
bill_ratio
that has the ratio of bill length to bill depth.
You can use the ||
operator to concatenate text to solve part 1,
or look at the documentation for SQLite's format()
function.
Check Understanding
Calculating with Missing Values
select
flipper_length_mm / 10.0 as flipper_cm,
body_mass_g / 1000.0 as weight_kg,
island as where_found
from penguins
limit 5;
| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1 | 3.75 | Torgersen |
| 18.6 | 3.8 | Torgersen |
| 19.5 | 3.25 | Torgersen |
| | | Torgersen |
| 19.3 | 3.45 | Torgersen |
- SQL uses a special value
null
to representing missing data- Not 0 or empty string, but "I don't know"
- Flipper length and body weight not known for one of the first five penguins
- "I don't know" divided by 10 or 1000 is "I don't know"
Exercise
Use SQLite's .nullvalue
command
to change the printed representation of null to the string null
and then re-run the previous query.
When will displaying null as null
be easier to understand?
When might it be misleading?
Null Equality
- Repeated from earlier
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe';
| species | sex | island |
|---------|--------|--------|
| Adelie | FEMALE | Biscoe |
| Adelie | MALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
| Gentoo | MALE | Biscoe |
| Gentoo | | Biscoe |
- If we ask for female penguins the row with the missing sex drops out
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe' and sex = 'FEMALE';
| species | sex | island |
|---------|--------|--------|
| Adelie | FEMALE | Biscoe |
| Gentoo | FEMALE | Biscoe |
Null Inequality
- But if we ask for penguins that aren't female it drops out as well
select distinct
species,
sex,
island
from penguins
where island = 'Biscoe' and sex != 'FEMALE';
| species | sex | island |
|---------|------|--------|
| Adelie | MALE | Biscoe |
| Gentoo | MALE | Biscoe |
Ternary Logic
select null = null;
| null = null |
|-------------|
| |
- If we don't know the left and right values, we don't know if they're equal or not
- So the result is
null
- Get the same answer for
null != null
- Ternary logic
equality | |||
---|---|---|---|
X | Y | null | |
X | true | false | null |
Y | false | true | null |
null | null | null | null |
Handling Null Safely
select
species,
sex,
island
from penguins
where sex is null;
| species | sex | island |
|---------|-----|-----------|
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Torgersen |
| Adelie | | Dream |
| Gentoo | | Biscoe |
| Gentoo | | Biscoe |
| Gentoo | | Biscoe |
| Gentoo | | Biscoe |
| Gentoo | | Biscoe |
- Use
is null
andis not null
to handle null safely - Other parts of SQL handle nulls specially
Exercise
-
Write a query to find penguins whose body mass is known but whose sex is not.
-
Write another query to find penguins whose sex is known but whose body mass is not.
Check Understanding
Aggregating
select sum(body_mass_g) as total_mass
from penguins;
| total_mass |
|------------|
| 1437000.0 |
- Aggregation combines many values to produce one
sum
is an aggregation function- Combines corresponding values from multiple rows
Common Aggregation Functions
select
max(bill_length_mm) as longest_bill,
min(flipper_length_mm) as shortest_flipper,
avg(bill_length_mm) / avg(bill_depth_mm) as weird_ratio
from penguins;
| longest_bill | shortest_flipper | weird_ratio |
|--------------|------------------|------------------|
| 59.6 | 172.0 | 2.56087082530644 |
- This actually shouldn't work: can't calculate maximum or average if any values are null
- SQL does the useful thing instead of the right one
Exercise
What is the average body mass of penguins that weight more than 3000.0 grams?
Counting
select
count(*) as count_star,
count(sex) as count_specific,
count(distinct sex) as count_distinct
from penguins;
| count_star | count_specific | count_distinct |
|------------|----------------|----------------|
| 344 | 333 | 2 |
count(*)
counts rowscount(column)
counts non-null entries in columncount(distinct column)
counts distinct non-null entries
Exercise
How many different body masses are in the penguins dataset?
Grouping
select avg(body_mass_g) as average_mass_g
from penguins
group by sex;
| average_mass_g |
|------------------|
| 4005.55555555556 |
| 3862.27272727273 |
| 4545.68452380952 |
- Put rows in groups based on distinct combinations of values in columns specified with
group by
- Then perform aggregation separately for each group
- But which is which?
Behavior of Unaggregated Columns
select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex;
| sex | average_mass_g |
|--------|------------------|
| | 4005.55555555556 |
| FEMALE | 3862.27272727273 |
| MALE | 4545.68452380952 |
- All rows in each group have the same value for
sex
, so no need to aggregate
Arbitrary Choice in Aggregation
select
sex,
body_mass_g
from penguins
group by sex;
| sex | body_mass_g |
|--------|-------------|
| | |
| FEMALE | 3800.0 |
| MALE | 3750.0 |
- If we don't specify how to aggregate a column,
SQLite chooses any arbitrary value from the group
- All penguins in each group have the same sex because we grouped by that, so we get the right answer
- The body mass values are in the data but unpredictable
- A common mistake
- Other database managers don't do this
- E.g., PostgreSQL complains that column must be used in an aggregation function
Exercise
Explain why the output of the previous query has a blank line before the rows for female and male penguins.
Write a query that shows each distinct body mass in the penguin dataset and the number of penguins that weigh that much.
Filtering Aggregated Values
select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;
| sex | average_mass_g |
|------|------------------|
| | 4005.55555555556 |
| MALE | 4545.68452380952 |
- Using
having condition
instead ofwhere condition
for aggregates
Readable Output
select
sex,
round(avg(body_mass_g), 1) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;
| sex | average_mass_g |
|------|----------------|
| | 4005.6 |
| MALE | 4545.7 |
- Use
round(value, decimals)
to round off a number
Filtering Aggregate Inputs
select
sex,
round(
avg(body_mass_g) filter (where body_mass_g < 4000.0),
1
) as average_mass_g
from penguins
group by sex;
| sex | average_mass_g |
|--------|----------------|
| | 3362.5 |
| FEMALE | 3417.3 |
| MALE | 3729.6 |
filter (where condition)
applies to inputs
Exercise
Write a query that uses filter
to calculate the average body masses
of heavy penguins (those over 4500 grams)
and light penguins (those under 3500 grams)
simultaneously.
Is it possible to do this using where
instead of filter
?
Check Understanding
Creating In-memory Database
sqlite3 :memory:
- "Connect" to an in-memory database
- Changes aren't saved to disk
- Very useful for testing (discussed later)
Creating Tables
create table job (
name text not null,
billable real not null
);
create table work (
person text not null,
job text not null
);
create table name
followed by parenthesized list of columns- Each column is a name, a data type, and optional extra information
- E.g.,
not null
prevents nulls from being added
- E.g.,
.schema
is not standard SQL- SQLite has added a few things
create if not exists
- upper-case keywords (SQL is case insensitive)
Inserting Data
insert into job values
('calibrate', 1.5),
('clean', 0.5);
insert into work values
('mik', 'calibrate'),
('mik', 'clean'),
('mik', 'complain'),
('po', 'clean'),
('po', 'complain'),
('tay', 'complain');
| name | billable |
|-----------|----------|
| calibrate | 1.5 |
| clean | 0.5 |
| person | job |
|--------|-----------|
| mik | calibrate |
| mik | clean |
| mik | complain |
| po | clean |
| po | complain |
| tay | complain |
Following Along
- Download the examples
- Unzip that file
- To re-create this database:
.read create_work_job.sql
.read populate_work_job.sql
Exercise
Using an in-memory database,
define a table called notes
with two text columns author
and note
and then add three or four rows.
Use a query to check that the notes have been stored
and that you can (for example) select by author name.
What happens if you try to insert too many or too few values into notes
?
What happens if you insert a number instead of a string into the note
field?
Updating Rows
update work
set person = 'tae'
where person = 'tay';
| person | job |
|--------|-----------|
| mik | calibrate |
| mik | clean |
| mik | complain |
| po | clean |
| po | complain |
| tae | complain |
- (Almost) always specify row(s) to update using
where
- Otherwise update all rows in table, which is usually not wanted
Deleting Rows
delete from work
where person = 'tae';
select * from work;
| person | job |
|--------|-----------|
| mik | calibrate |
| mik | clean |
| mik | complain |
| po | clean |
| po | complain |
- Again, (almost) always specify row(s) to delete using
where
Exercise
What happens if you try to delete rows that don't exist
(e.g., all entries in work
that refer to juna
)?
Backing Up
create table backup (
person text not null,
job text not null
);
insert into backup
select
person,
job
from work
where person = 'tae';
delete from work
where person = 'tae';
select * from backup;
| person | job |
|--------|----------|
| tae | complain |
- We will explore another strategy based on tombstones below
Exercise
Saving and restoring data as text:
-
Re-create the
notes
table in an in-memory database and then use SQLite's.output
and.dump
commands to save the database to a file callednotes.sql
. Inspect the contents of this file: how has your data been stored? -
Start a fresh SQLite session and load
notes.sql
using the.read
command. Inspect the database using.schema
andselect *
: is everything as you expected?
Saving and restoring data in binary format:
-
Re-create the
notes
table in an in-memory database once again and use SQLite's.backup
command to save it to a file callednotes.db
. Inspect this file usingod -c notes.db
or a text editor that can handle binary data: how has your data been stored? -
Start a fresh SQLite session and load
notes.db
using the.restore
command. Inspect the database using.schema
andselect *
: is everything as you expected?
Check Understanding
Combining Information
select *
from work cross join job;
| person | job | name | billable |
|--------|-----------|-----------|----------|
| mik | calibrate | calibrate | 1.5 |
| mik | calibrate | clean | 0.5 |
| mik | clean | calibrate | 1.5 |
| mik | clean | clean | 0.5 |
| mik | complain | calibrate | 1.5 |
| mik | complain | clean | 0.5 |
| po | clean | calibrate | 1.5 |
| po | clean | clean | 0.5 |
| po | complain | calibrate | 1.5 |
| po | complain | clean | 0.5 |
| tay | complain | calibrate | 1.5 |
| tay | complain | clean | 0.5 |
- A join combines information from two tables
- cross join constructs their cross product
- All combinations of rows from each
- Result isn't particularly useful:
job
andname
values don't match- I.e., the combined data has records whose parts have nothing to do with each other
Inner Join
select *
from work inner join job
on work.job = job.name;
| person | job | name | billable |
|--------|-----------|-----------|----------|
| mik | calibrate | calibrate | 1.5 |
| mik | clean | clean | 0.5 |
| po | clean | clean | 0.5 |
- Use
table.column
notation to specify columns- A column can have the same name as a table
- Use
on condition
to specify join condition - Since
complain
doesn't appear injob.name
, none of those rows are kept
Exercise
Re-run the query shown above using where job = name
instead of the full table.name
notation.
Is the shortened form easier or harder to read
and more or less likely to cause errors?
Aggregating Joined Data
select
work.person,
sum(job.billable) as pay
from work inner join job
on work.job = job.name
group by work.person;
| person | pay |
|--------|-----|
| mik | 2.0 |
| po | 0.5 |
- Combines ideas we've seen before
- But Tay is missing from the table
- No records in the
job
table withtay
as name - So no records to be grouped and summed
- No records in the
Left Join
select *
from work left join job
on work.job = job.name;
| person | job | name | billable |
|--------|-----------|-----------|----------|
| mik | calibrate | calibrate | 1.5 |
| mik | clean | clean | 0.5 |
| mik | complain | | |
| po | clean | clean | 0.5 |
| po | complain | | |
| tay | complain | | |
- A left outer join keeps all rows from the left table
- Fills missing values from right table with null
Aggregating Left Joins
select
work.person,
sum(job.billable) as pay
from work left join job
on work.job = job.name
group by work.person;
| person | pay |
|--------|-----|
| mik | 2.0 |
| po | 0.5 |
| tay | |
- That's better, but we'd like to see 0 rather than a blank
Coalescing Values
select
work.person,
coalesce(sum(job.billable), 0.0) as pay
from work left join job
on work.job = job.name
group by work.person;
| person | pay |
|--------|-----|
| mik | 2.0 |
| po | 0.5 |
| tay | 0.0 |
coalesce(val1, val2, …)
returns first non-null value
Full Outer Join
- Full outer join is the union of left outer join and right outer join
- Almost the same as cross join, but consider:
create table size (
s text not null
);
insert into size values ('light'), ('heavy');
create table weight (
w text not null
);
select * from size full outer join weight;
| s | w |
|-------|---|
| light | |
| heavy | |
- A cross join would produce empty result
Exercise
Find the least time each person spent on any job.
Your output should show that mik
and po
each spent 0.5 hours on some job.
Can you find a way to show the name of the job as well
using the SQL you have seen so far?