Advanced Features
Terms defined: atomic, base case, Binary Large Object (blob), comma-separated values (CSV), consistent, denormalization, durable, infinite recursion, isolated, JavaScript Object Notation (JSON), materialized view, normal form, path expression, recursive case, recursive CTE, temporary table, trigger, upsert, view
Blobs
create table images (
name text not null,
content blob
);
insert into images (name, content) values
('biohazard', readfile('img/biohazard.png')),
('crush', readfile('img/crush.png')),
('fire', readfile('img/fire.png')),
('radioactive', readfile('img/radioactive.png')),
('tripping', readfile('img/tripping.png'));
select
name,
length(content)
from images;
| name | length(content) |
|-------------|-----------------|
| biohazard | 19629 |
| crush | 15967 |
| fire | 18699 |
| radioactive | 16661 |
| tripping | 17208 |
Exercise
Modify the query shown above to select the value of content
rather than its length.
How intelligible is the output?
Does using SQLite's hex()
function make it any more readable?
Yet Another Database
sqlite3 db/lab_log.db
.schema
CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE person(
ident integer primary key autoincrement,
details text not null
);
CREATE TABLE machine(
ident integer primary key autoincrement,
name text not null,
details text not null
);
CREATE TABLE usage(
ident integer primary key autoincrement,
log text not null
);
Storing JSON
select * from machine;
| ident | name | details |
|-------|----------------|---------------------------------------------------------|
| 1 | WY401 | {"acquired": "2023-05-01"} |
| 2 | Inphormex | {"acquired": "2021-07-15", "refurbished": "2023-10-22"} |
| 3 | AutoPlate 9000 | {"note": "needs software update"} |
- Store heterogeneous data as JSON-formatted text
(with double-quoted strings)
- Database parses the text each time it is queried, so performance can be an issue
- Can alternatively store as blob (
jsonb
)- Can't view it directly
- But more efficient
Select Fields from JSON
select
details->'$.acquired' as single_arrow,
details->>'$.acquired' as double_arrow
from machine;
| single_arrow | double_arrow |
|--------------|--------------|
| "2023-05-01" | 2023-05-01 |
| "2021-07-15" | 2021-07-15 |
| | |
- Single arrow
->
returns JSON representation of result - Double arrow
->>
returns SQL text, integer, real, or null - Left side is column
- Right side is path expression
- Start with
$
(meaning "root") - Fields separated by
.
- Start with
Exercise
Write a query that selects the year from the "refurbished"
field
of the JSON data associated with the Inphormex plate reader.
JSON Array Access
select
ident,
json_array_length(log->'$') as length,
log->'$[0]' as first
from usage;
| ident | length | first |
|-------|--------|--------------------------------------------------------------|
| 1 | 4 | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]} |
| 2 | 5 | {"machine":"Inphormex","person":["Marianne","Richer"]} |
| 3 | 2 | {"machine":"sterilizer","person":["Josette","Villeneuve"]} |
| 4 | 1 | {"machine":"sterilizer","person":["Maude","Goulet"]} |
| 5 | 2 | {"machine":"AutoPlate 9000","person":["Brigitte","Michaud"]} |
| 6 | 1 | {"machine":"sterilizer","person":["Marianne","Richer"]} |
| 7 | 3 | {"machine":"WY401","person":["Maude","Goulet"]} |
| 8 | 1 | {"machine":"AutoPlate 9000"} |
- SQLite and other database managers have many JSON manipulation functions
json_array_length
gives number of elements in selected array- Subscripts start with 0
- Characters outside 7-bit ASCII represented as Unicode escapes
Unpacking JSON Arrays
select
ident,
json_each.key as key,
json_each.value as value
from usage, json_each(usage.log)
limit 10;
| ident | key | value |
|-------|-----|--------------------------------------------------------------|
| 1 | 0 | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]} |
| 1 | 1 | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]} |
| 1 | 2 | {"machine":"WY401","person":["Gabrielle","Dub\u00e9"]} |
| 1 | 3 | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]} |
| 2 | 0 | {"machine":"Inphormex","person":["Marianne","Richer"]} |
| 2 | 1 | {"machine":"AutoPlate 9000","person":["Marianne","Richer"]} |
| 2 | 2 | {"machine":"sterilizer","person":["Marianne","Richer"]} |
| 2 | 3 | {"machine":"AutoPlate 9000","person":["Monique","Marcotte"]} |
| 2 | 4 | {"machine":"sterilizer","person":["Marianne","Richer"]} |
| 3 | 0 | {"machine":"sterilizer","person":["Josette","Villeneuve"]} |
json_each
is another table-valued function- Use
json_each.name
to get properties of unpacked array
Exercise
Write a query that counts how many times each person appears in the first log entry associated with any piece of equipment.
Selecting the Last Element of an Array
select
ident,
log->'$[#-1].machine' as final
from usage
limit 5;
| ident | final |
|-------|--------------|
| 1 | "Inphormex" |
| 2 | "sterilizer" |
| 3 | "Inphormex" |
| 4 | "sterilizer" |
| 5 | "sterilizer" |
Modifying JSON
select
ident,
name,
json_set(details, '$.sold', json_quote('2024-01-25')) as updated
from machine;
| ident | name | updated |
|-------|----------------|--------------------------------------------------------------|
| 1 | WY401 | {"acquired":"2023-05-01","sold":"2024-01-25"} |
| 2 | Inphormex | {"acquired":"2021-07-15","refurbished":"2023-10-22","sold":" |
| | | 2024-01-25"} |
| 3 | AutoPlate 9000 | {"note":"needs software update","sold":"2024-01-25"} |
- Updates the in-memory copy of the JSON, not the database record
- Please use
json_quote
rather than trying to format JSON with string operations
Exercise
As part of cleaning up the lab log database,
replace the machine names in the JSON records in usage
with the corresopnding machine IDs from the machine
table.
Refreshing the Penguins Database
select
species,
count(*) as num
from penguins
group by species;
| species | num |
|-----------|-----|
| Adelie | 152 |
| Chinstrap | 68 |
| Gentoo | 124 |
- We will restore full database after each example
Tombstones
alter table penguins
add active integer not null default 1;
update penguins
set active = iif(species = 'Adelie', 0, 1);
select
species,
count(*) as num
from penguins
where active
group by species;
| species | num |
|-----------|-----|
| Chinstrap | 68 |
| Gentoo | 124 |
- Use a tombstone to mark (in)active records
- Every query must now include it
Importing CSV Data
- SQLite and most other database managers have tools for importing and exporting CSV
- In SQLite:
- Define table
- Import data
- Convert empty strings to nulls (if desired)
- Convert types from text to whatever (not shown below)
drop table if exists penguins;
.mode csv penguins
.import misc/penguins.csv penguins
update penguins set species = null where species = '';
update penguins set island = null where island = '';
update penguins set bill_length_mm = null where bill_length_mm = '';
update penguins set bill_depth_mm = null where bill_depth_mm = '';
update penguins set flipper_length_mm = null where flipper_length_mm = '';
update penguins set body_mass_g = null where body_mass_g = '';
update penguins set sex = null where sex = '';
Exercise
What are the data types of the columns in the penguins
table
created by the CSV import shown above?
How can you correct the ones that need correcting?
Views
create view if not exists
active_penguins (
species,
island,
bill_length_mm,
bill_depth_mm,
flipper_length_mm,
body_mass_g,
sex
) as
select
species,
island,
bill_length_mm,
bill_depth_mm,
flipper_length_mm,
body_mass_g,
sex
from penguins
where active;
select
species,
count(*) as num
from active_penguins
group by species;
| species | num |
|-----------|-----|
| Chinstrap | 68 |
| Gentoo | 124 |
- A view is a saved query that other queries can invoke
- View is re-run each time it's used
- Like a CTE, but:
- Can be shared between queries
- Views came first
- Some databases offer materialized views
- Update-on-demand temporary tables
Exercise
Create a view in the lab log database called busy
with two columns:
machine_id
and total_log_length
.
The first column records the numeric ID of each machine;
the second shows the total number of log entries for that machine.
Check Understanding
Hours Reminder
create table job (
name text not null,
billable real not null
);
insert into job values
('calibrate', 1.5),
('clean', 0.5);
select * from job;
| name | billable |
|-----------|----------|
| calibrate | 1.5 |
| clean | 0.5 |
Adding Checks
create table job (
name text not null,
billable real not null,
check (billable > 0.0)
);
insert into job values ('calibrate', 1.5);
insert into job values ('reset', -0.5);
select * from job;
Runtime error near line 9: CHECK constraint failed: billable > 0.0 (19)
| name | billable |
|-----------|----------|
| calibrate | 1.5 |
check
adds constraint to table- Must produce a Boolean result
- Run each time values added or modified
- But changes made before the error have taken effect
Exercise
Rewrite the definition of the penguins
table to add the following constraints:
-
body_mass_g
must be null or non-negative. -
island
must be one of"Biscoe"
,"Dream"
, or"Torgersen"
. (Hint: thein
operator will be useful here.)
ACID
- Atomic: change cannot be broken down into smaller ones (i.e., all or nothing)
- Consistent: database goes from one consistent state to another
- Isolated: looks like changes happened one after another
- Durable: if change takes place, it's still there after a restart
Transactions
create table job (
name text not null,
billable real not null,
check (billable > 0.0)
);
insert into job values ('calibrate', 1.5);
begin transaction;
insert into job values ('clean', 0.5);
rollback;
select * from job;
| name | billable |
|-----------|----------|
| calibrate | 1.5 |
- Statements outside transaction execute and are committed immediately
- Statement(s) inside transaction don't take effect until:
end transaction
(success)rollback
(undo)
- Can have any number of statements inside a transaction
- But cannot nest transactions in SQLite
- Other databases support this
Rollback in Constraints
create table job (
name text not null,
billable real not null,
check (billable > 0.0) on conflict rollback
);
insert into job values
('calibrate', 1.5);
insert into job values
('clean', 0.5),
('reset', -0.5);
select * from job;
Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
| name | billable |
|-----------|----------|
| calibrate | 1.5 |
- All of second
insert
rolled back as soon as error occurred - But first
insert
took effect
Rollback in Statements
create table job (
name text not null,
billable real not null,
check (billable > 0.0)
);
insert or rollback into job values
('calibrate', 1.5);
insert or rollback into job values
('clean', 0.5),
('reset', -0.5);
select * from job;
Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
| name | billable |
|-----------|----------|
| calibrate | 1.5 |
- Constraint is in table definition
- Action is in statement
Upsert
create table jobs_done (
person text unique,
num integer default 0
);
insert into jobs_done values
('zia', 1);
.print 'after first'
select * from jobs_done;
.print
insert into jobs_done values
('zia', 1);
.print 'after failed'
select * from jobs_done;
insert into jobs_done values
('zia', 1)
on conflict(person) do update set num = num + 1;
.print '\nafter upsert'
select * from jobs_done;
after first
| person | num |
|--------|-----|
| zia | 1 |
Runtime error near line 15: UNIQUE constraint failed: jobs_done.person (19)
after failed
| person | num |
|--------|-----|
| zia | 1 |
\nafter upsert
| person | num |
|--------|-----|
| zia | 2 |
- upsert stands for "update or insert"
- Create if record doesn't exist
- Update if it does
- Not standard SQL but widely implemented
- Example also shows use of SQLite
.print
command
Exercise
Using the assay database,
write a query that adds or modifies people in the staff
table as shown:
personal | family | dept | age |
---|---|---|---|
Pranay | Khanna | mb | 41 |
Riaan | Dua | gen | 23 |
Parth | Johel | gen | 27 |
Normalization
-
First normal form (1NF): every field of every record contains one indivisible value.
-
Second normal form (2NF) and third normal form (3NF): every value in a record that isn't a key depends solely on the key, not on other values.
-
Denormalization: explicitly store values that could be calculated on the fly
- To simplify queries and/or make processing faster
Creating Triggers
-- Track hours of lab work.
create table job (
person text not null,
reported real not null check (reported >= 0.0)
);
-- Explicitly store per-person total rather than using sum().
create table total (
person text unique not null,
hours real
);
-- Initialize totals.
insert into total values
('gene', 0.0),
('august', 0.0);
-- Define a trigger.
create trigger total_trigger
before insert on job
begin
-- Check that the person exists.
select case
when not exists (select 1 from total where person = new.person)
then raise(rollback, 'Unknown person ')
end;
-- Update their total hours (or fail if non-negative constraint violated).
update total
set hours = hours + new.reported
where total.person = new.person;
end;
- A trigger automatically runs before or after a specified operation
- Can have side effects (e.g., update some other table)
- And/or implement checks (e.g., make sure other records exist)
- Add processing overhead…
- …but data is either cheap or correct, never both
- Inside trigger, refer to old and new versions of record
as
old.column
andnew.column
Trigger Not Firing
insert into job values
('gene', 1.5),
('august', 0.5),
('gene', 1.0);
| person | reported |
|--------|----------|
| gene | 1.5 |
| august | 0.5 |
| gene | 1.0 |
| person | hours |
|--------|-------|
| gene | 2.5 |
| august | 0.5 |
Trigger Firing
insert into job values
('gene', 1.0),
('august', -1.0);
Runtime error near line 6: CHECK constraint failed: reported >= 0.0 (19)
| person | hours |
|--------|-------|
| gene | 0.0 |
| august | 0.0 |
Exercise
Using the penguins database:
-
create a table called
species
with columnsname
andcount
; and -
define a trigger that increments the count associated with each species each time a new penguin is added to the
penguins
table.
Does your solution behave correctly when several penguins are added
by a single insert
statement?
Representing Graphs
create table lineage (
parent text not null,
child text not null
);
insert into lineage values
('Arturo', 'Clemente'),
('Darío', 'Clemente'),
('Clemente', 'Homero'),
('Clemente', 'Ivonne'),
('Ivonne', 'Lourdes'),
('Soledad', 'Lourdes'),
('Lourdes', 'Santiago');
select * from lineage;
| parent | child |
|----------|----------|
| Arturo | Clemente |
| Darío | Clemente |
| Clemente | Homero |
| Clemente | Ivonne |
| Ivonne | Lourdes |
| Soledad | Lourdes |
| Lourdes | Santiago |
Exercise
Write a query that uses a self join to find every person's grandchildren.
Recursive Queries
with recursive descendent as (
select
'Clemente' as person,
0 as generations
union all
select
lineage.child as person,
descendent.generations + 1 as generations
from descendent inner join lineage
on descendent.person = lineage.parent
)
select
person,
generations
from descendent;
| person | generations |
|----------|-------------|
| Clemente | 0 |
| Homero | 1 |
| Ivonne | 1 |
| Lourdes | 2 |
| Santiago | 3 |
- Use a recursive CTE to create a temporary table (
descendent
) - Base case seeds this table
- Recursive case relies on value(s) already in that table and external table(s)
union all
to combine rows- Can use
union
but that has lower performance (must check uniqueness each time)
- Can use
- Stops when the recursive case yields an empty row set (nothing new to add)
- Then select the desired values from the CTE
Exercise
Modify the recursive query shown above to use union
instead of union all
.
Does this affect the result?
Why or why not?
Contact Tracing Database
select * from person;
| ident | name |
|-------|-----------------------|
| 1 | Juana Baeza |
| 2 | Agustín Rodríquez |
| 3 | Ariadna Caraballo |
| 4 | Micaela Laboy |
| 5 | Verónica Altamirano |
| 6 | Reina Rivero |
| 7 | Elias Merino |
| 8 | Minerva Guerrero |
| 9 | Mauro Balderas |
| 10 | Pilar Alarcón |
| 11 | Daniela Menéndez |
| 12 | Marco Antonio Barrera |
| 13 | Cristal Soliz |
| 14 | Bernardo Narváez |
| 15 | Óscar Barrios |
select * from contact;
| left | right |
|-------------------|-----------------------|
| Agustín Rodríquez | Ariadna Caraballo |
| Agustín Rodríquez | Verónica Altamirano |
| Juana Baeza | Verónica Altamirano |
| Juana Baeza | Micaela Laboy |
| Pilar Alarcón | Reina Rivero |
| Cristal Soliz | Marco Antonio Barrera |
| Cristal Soliz | Daniela Menéndez |
| Daniela Menéndez | Marco Antonio Barrera |
Bidirectional Contacts
create temporary table bi_contact (
left text,
right text
);
insert into bi_contact
select
left, right from contact
union all
select right, left from contact
;
| original_count |
|----------------|
| 8 |
| num_contact |
|-------------|
| 16 |
- Create a temporary table rather than using a long chain of CTEs
- Only lasts as long as the session (not saved to disk)
- Duplicate information rather than writing more complicated query
Updating Group Identifiers
select
left.name as left_name,
left.ident as left_ident,
right.name as right_name,
right.ident as right_ident,
min(left.ident, right.ident) as new_ident
from
(person as left join bi_contact on left.name = bi_contact.left)
join person as right on bi_contact.right = right.name;
| left_name | left_ident | right_name | right_ident | new_ident |
|-----------------------|------------|-----------------------|-------------|-----------|
| Juana Baeza | 1 | Micaela Laboy | 4 | 1 |
| Juana Baeza | 1 | Verónica Altamirano | 5 | 1 |
| Agustín Rodríquez | 2 | Ariadna Caraballo | 3 | 2 |
| Agustín Rodríquez | 2 | Verónica Altamirano | 5 | 2 |
| Ariadna Caraballo | 3 | Agustín Rodríquez | 2 | 2 |
| Micaela Laboy | 4 | Juana Baeza | 1 | 1 |
| Verónica Altamirano | 5 | Agustín Rodríquez | 2 | 2 |
| Verónica Altamirano | 5 | Juana Baeza | 1 | 1 |
| Reina Rivero | 6 | Pilar Alarcón | 10 | 6 |
| Pilar Alarcón | 10 | Reina Rivero | 6 | 6 |
| Daniela Menéndez | 11 | Cristal Soliz | 13 | 11 |
| Daniela Menéndez | 11 | Marco Antonio Barrera | 12 | 11 |
| Marco Antonio Barrera | 12 | Cristal Soliz | 13 | 12 |
| Marco Antonio Barrera | 12 | Daniela Menéndez | 11 | 11 |
| Cristal Soliz | 13 | Daniela Menéndez | 11 | 11 |
| Cristal Soliz | 13 | Marco Antonio Barrera | 12 | 12 |
new_ident
is minimum of own identifier and identifiers one step away- Doesn't keep people with no contacts
Recursive Labeling
with recursive labeled as (
select
person.name as name,
person.ident as label
from
person
union -- not 'union all'
select
person.name as name,
labeled.label as label
from
(person join bi_contact on person.name = bi_contact.left)
join labeled on bi_contact.right = labeled.name
where labeled.label < person.ident
)
select name, min(label) as group_id
from labeled
group by name
order by label, name;
| name | group_id |
|-----------------------|----------|
| Agustín Rodríquez | 1 |
| Ariadna Caraballo | 1 |
| Juana Baeza | 1 |
| Micaela Laboy | 1 |
| Verónica Altamirano | 1 |
| Pilar Alarcón | 6 |
| Reina Rivero | 6 |
| Elias Merino | 7 |
| Minerva Guerrero | 8 |
| Mauro Balderas | 9 |
| Cristal Soliz | 11 |
| Daniela Menéndez | 11 |
| Marco Antonio Barrera | 11 |
| Bernardo Narváez | 14 |
| Óscar Barrios | 15 |
- Use
union
instead ofunion all
to prevent infinite recursion
Exercise
Modify the query above to use union all
instead of union
to trigger infinite recursion.
How can you modify the query so that it stops at a certain depth
so that you can trace its output?