Getting started with the VM
Please grab feeg6003_SQL from ngcmbits and install in the normal way. Once you've logged in, please start "DB Browser for SQLite" from the Desktop. This is all you'll really need for this workshop; the database we will be using is pre-loaded. Alternatively, you can install the database into another manager, or install another database; this is really easy to do, and we'll include a section on it at the end.
Let's talk about Databases
SQL is, as we will soon discover, a language for querying Databases, so before we start we need to make sure we understand what a database is, why we might need one.
Databases start with data. Data is fundamentally a set of values of qualitative or quantitative variables. It's important to understand that data is meaningless without structure or context; if I give you the sequence of characters:
K%4qtSYg!7ZpMJDz^CCXm$j@d!-q3$
It is effectively meaningless. If, however, the instant I tell you that it is a nuclear launch code, it's suddenly a lot more meaningful (and potentially alarming). Our data is only as good as it's structure; if somebody found this code on a piece of paper on the street, nothing would probably come of it, but if it was found by a hacker on a government website, the results would probably be undesirable.
When we write programs, we might store data in simple variables or csv files; a database is a 'step up' in complexity from this, but it has some unique advantages. It may not always be appropriate to store data like this; setting up and maintaining a database is complex. However, you gain four big avantages from a database:
-
Access: a properly database can be queried by a lot of people, very quicky.
-
Relationships: connections between data can be easily be maintained, in a modular way.
-
Updates: New data can be added, and old data can be delete quickly.
-
Searches: records can be searched for relatively easily.
More specifically, most databases used today are Relational Databases.
Relational Databases are at a basic level comprised of tables (and lots of other things that we'll worry about later). Understanding tables is the most important part of understanding SQL; if we aren't thinking about tables in the right way, SQL is really, really hard.
Fortunately, if we understand tables, SQL is really, really easy.
What is a table? It's easiest to see an example. The table below is called 'PhDStudents':
Let's get some names out of the way; this table has four fields; Name, Age, PhD and Funding. These are the columns. They correspond to a type of data.
It has four records, corresponding to the entries of Dave, James, Fred and Boris. The information corresponding to these individuals is contained in the rows, one row for each individual.
You might have seen people show tables the other way, with the fields corresponding to the rows and the records corresponding to the columns. These people are monsters, and you should probably avoid them. In all seriousness, it's best to be consistent in the way you think about databases; every database manager that I've ever encountered has organized it the way I'm showing you, so it's probably best to think about it this way. If you don't, you're just making it harder for yourself to understand.
A table is one part of a database. We'll see an example of a database that has more than one table in a moment; typically different tables will have a records that correspond, but different fields, with a 'linking' field to allow them to be joined. We'll talk about this more later, so don't worry if you can't quite picture how this will work right now.
Notice that the order of any of the fields or the records doesn't really matter here; I could put Boris first (as long as I moved up his entire row) and nobody would really mind. I could also swap the PhD and Funding column, and it wouldn't matter either. We should never assume that the order of the records in any table is necessarily meaningful; if we want some kind of specified order, we should insert a field called something like 'order' and give each person a number, so we can order them if they get mixed up.
This seems simple so far, but we've just covered a lot of concepts that are vital to the way SQL works, so it's worth going over them, even though they may seem obvious.
Game of Thrones
There are a lot of boring example databases out there. We could have used the semi-famous (free) 'northwind' database contains a lot of data on a fictitious specialty foods company. We didn't. Instead...
Our example database, as you might have already seen, is called GameOfThrones.sqlite. Somebody (specifically the reddit user 'Mynotoar', to give appropriate credit) put together a spreadsheet with the 2434 characters from George R R Martin's 'A Song of Ice and Fire' series (Game of Thrones to most of us), complete with a lot of interesting information about them. It was fairly easy to turn this into a database, with a handful of tables. This should be pre-loaded into DB Browser for SQLite, in the right panel of which under 'Tables' you should be able to see four tables (Books_Table, Characters_Table, ForenameFrequency_Table and Houses_Table). If the database is not loaded, click "open database" and you will find the .sqlite file in the "SQL" folder on the desktop. Note that it's not actually common for tables to actually have the word 'table' in the name, but we've put it in here to make everything clear.
For anyone worried about spoilers, there is a field called 'Alive' that shows if a character is alive at the end of the most recent book. None of our examples reference this field, so you should be able to safely avoid it. Consider it an incentive to write your queries well. One piece of advice here; avoid the all * (asterisk) syntax if you are worried about this (but we'll talk about this later).
SQL
So we've got this far, and not really talked about SQL yet. If you go onto the Wikipedia page for SQL, it will say that SQL is the standard language for storing, manipulating and retrieving data from databases. This is a fairly good description; we use SQL as a 'tool' to interact with databases. Wikipedia will also insist that SQL stands for 'Structured Query Language'; this isn't actually (technically) correct. IBM originally created the language in the 1970s, and wanted to call it SEQUEL (which is what you will often hear it called instead of S.Q.L). This stood for 'Structured English QUery Language'; or a language that created queries for databases using structured english. Unfortunately the patent for this was taken by an airline company, so they shortened the acronym to SQL, which it has remained to this day. People therefore often think it's okay to just remove the 'English' from the acronym, but this implies that the 'Structured' acts upon 'language' instead, and a structured programming language is an actual (separate and unrelated) concept.
But enough history/grammar. Why is SQL the most well known and well used database query language? Partly because it is the most well known and well used query language (it's been around 44 years!); in a circular fashion, effectively everyone uses it so if you don't use it your databases won't work with anyone else's. It's not just that though; SQL is both powerful and easy to understand (unlike any of its competitors). It's also capable of handling both 'detail' (small, complex queries) and extremely large, broad queries equally as well.
If you've only every used 'procedural' languages before (Python, C, Fortran etc), you might be surprised by the way SQL works. You tell SQL what you want, not how to get there, and it figures out how to do it. This can make it initially quite hard to understand.
In general, in SQL you write a 'query' which is sent to the database, performs some action, and may return some information to you. A single query may be very long, or very short. It's also common to write a series of queries that execute in sequence in a script.
There are many different types of SQL database available, and all have distinct advantages/disadvantages. We are using SQLite, primarily because it is lightweight and our database isn't really that complex, and is embedded in the software rather than requiring a server; you will be querying and modifying your own database on your own (virtual) machine. It's common however for a SQL database to run on a server, where it can be queried and/or edited by multiple users. Every time you search on an online store for a product, an SQL query is constructed and sent to the server hosting the database, which returns a list of products matching your search parameters.
As an aside, almost every modern major data breach (e.g. TalkTalk in 2015) was conducted using SQL; or more specifically, a technique called 'SQL Injection'.
And if you continue reading, I'll show you how to do it.
Anyway, for the moment we are querying an SQLite database, using a GUI called "DB Browser for SQLite" (an original name). If you prefer, you can also send queries directly from the command line (but we won't be covering that here).
SELECT
All programming languages start with Hello World. SQL is no different. In the box to the right of the DB browser, type:
Press the run button. This should output the following:
You can see that this has returned what looks like a 1x1 cell, with 'Hello World' within and above it. You can also add:
SELECT 'Hello World' AS Output
What you are actually doing here is telling the database to return you the text 'Hello World', and if you add the AS keyword, to label the field that this text appears in as 'Output'. This isn't good SQL. You're not actually querying a database, just displaying some text. A much better example is:
SELECT Character, Title, Gender
FROM Characters_table
This returns:
Here we've requested (and been returned) every record in the fields 'Character', 'Title' and 'Gender' from the table 'Characters_table'. The structure of a query is easy to understand; the field names after the SELECT call the fields, and the name after the FROM specifies the table.
SQL queries contain 'SQL Keywords'; we've already encountered SELECT, FROM and AS. It's syntactically correct to write them lowercase (try it!) but most SQL developers write them uppercase to differentiate them from other things (strings, variable names etc). It's also worth noting at this point that SQL does not have strict rules on whitespace, nor do returns it matter; you can write an SQL query on one line, or on as many lines as you like. Technically, we should end our queries with a semicolon (like in C) but in SQLite we get away with it; this is because we are only executing one query at a time. If we were to write two queries at once, we would need to end them with semicolons, so the database knows when one query ends and another begins.
You can also select all fields using the * (all) symbol:
But I would advise not doing this if you are trying to avoid spoilers! This will effectively show you the entire table.
Overall, SELECT is basically 'print' or 'display', but is probably most accurately described as 'select'. Amazing.
Where
The WHERE statement adds a condition, and it is placed after the FROM statement.
SELECT <field1>, <field2>
FROM <table>
WHERE <condition>
A simple example:
SELECT Character, Title, Gender
FROM Characters_Table
WHERE Gender = 'M'
This statement selects the fields Character, Title and Gender from the table Characters_Table, but only the rows where the Gender field has the value M (Male). Try it with Gender = 'F'.
Note that we don't actually need to select the field to use the WHERE condition:
SELECT Character, Title
FROM Charcters_Table
WHERE Gender = 'M'
Notice that even though we didn't select gender, all the names that are returned are male. This is because the query first looks at the Table we specified with the FROM, then filters it according to the WHERE, then from the product grabs and displays only the fields after the SELECT.
Other useful keywords and functions
The DISTINCT statement is how we get only unique entries from data. We apply DISTINCT right before we specify a field name after a select:
SELECT <other fields>, DISTINCT <field1>, <other fields>
FROM <table>
Notice how DISTINCT and field1 are not comma separated. We can think of DISTINCT as acting on field1. This will return only the rows where field1 is unique. An example:
SELECT DISTINCT Forename
FROM Characters_Table
This returns every unique forename in the database.
The order in which this happens is just the order in which the names appear in the database. For example, if Jon Snow is before Jon Smith in the database, then Jon Snow will be selected, but Jon Smith will not. As we saw before, this is semi-arbitrary, depending on how the database happens to be ordered at the time. In the case above, this isn't hugely important, as we are just interested in the first names, but it's likely that this isn't going to be enough. We will see later how you can order a database first before selecting.
Another useful function is the COUNT() function. This is literally a function in the same sense as in other programming languages, and it counts the number of rows in any 'table' it is fed. An example:
SELECT COUNT(*)
FROM <table>
This will return the number of rows in 'table'. All that is actually happening here is that the FROM is grabbing the table, and then the COUNT() is being applied to it. This returns a number (2434), which the SELECT prints, just like in the "Hello World" case we saw at the start. Because COUNT() only counts rows, we could also just put COUNT(rowname), where rowname is any field in the table, and it would give the same results. Most SQL developers just put COUNT(*), to save typing if they just want the number of rows in a table.
A more interesting application of this is:
SELECT count(DISTINCT Forename)
FROM Characters_Table
Woah! Mind blown! Although what is actually happening here does make sense; FROM grabs the table, which is then filtered into a smaller table by DISTINCT, acting on forename. This is then counted by the count function, and SELECT prints it. We have just retrieved the number of unique Forenames in the databases.
There are a few other useful keywords/conditions that we should talk about: AND/OR which lets us use multiple conditions (exactly like &&/|| in C) and ORDER BY which is placed at the end of the statement and sorts the selected result. E.g.
SELECT *
FROM <table>
WHERE <condition1> OR <condition2>
ORDER BY <field>
This will return all the fields, and all the rows of table where the conditions are met, ordered by field. If field is numeric, it will order numerically; if it's a string, it will order alphabetically.
More advanced techniques
The LIKE Keyword can be used as a less rigid filter in conjunction with a WHERE:
SELECT Character, Surname
FROM Characters_Table
WHERE Surname LIKE 'Sta%'
Here % is a wildcard that means any combination of characters: in effect we are looking for surnames that have 'Sta' at the front.
This is where things start to get hard to understand (hopefully you're not lost already!).
It’s possible to nest SELECT statements using the IN keyword. This is best illustrated with an example:
SELECT Character, Surname
FROM Characters_Table
WHERE Character
IN(SELECT POV FROM Books_Table WHERE Book = 'A Game of Thrones')
This is basically selecting the character that appears as a point of view character that appears in the book 'A Game of Thrones'. It's best to think about the statement in brackets first; we are creating an imaginary table, which has the field POV from Books_Table, on the condition that the book is 'A Game of Thrones'. This is basically a list of names, that all are the point of view characters from the book. The outer statement takes the Characters table, and compares the field Character to the imaginary table we just created. It only keeps the rows of the Characters table that match the names in the nested table. From this, it then only selects the fields Character and Surname.
Phew! Why would this be useful? Why can't we just select POV? The answer is the surname, which the Books_table doesn't have. We just connected the information in two table togethers, and ended up with a result that we wouldn't have otherwise have aquired. This is the essence of relational databases, and make sure you understand it.
Joins
Joins separate the men from the boys. A join is an instruction to combine data from two sets of data; normally two tables. We just did this in a sense with a nested select, but a join is a more direct way of doing this.
As an aside, this is a good time to talk about keys. SQL convention dictates that every row in a database must have a unique value so it can be identified; this is called a primary key. This is an actual construct inside SQL, not just an abstraction, but how this is exactly implemented is a little beyond the scope of this blog. If you click on the arrow to the right of the table name in the database browser you can see that you can see the fields the table has; the first in each case is a field called 'ID'. The symbol next to this field looks different to the others; you can see that (if you squint) it has a small key icon next to it. This is because it is the primary key of the table, a unique identifier of each row. What exactly this is can vary table to table; here it is just a number, but it might be a code or other sequence of characters. The only requirement is that it must be unique for each row.
There is also another type of key, called a foreign key. We don't have any examples in our tables, but a foreign key constraints the values of one column in one table to values in another table. A field like this would be a prime to perform a join.
In essence, a join behaves as follows: given two tables, A and B, an inner join will match each row in table A with rows in table B, on the specified join field. In the case where there are multiple identical entries on the join field, every possible combination will be created, potentially leading to some extremely large resulting tables (for this reason, it's extremely common to nest joins with SELECTS DISTINCTS, but don't worry about this too much yet). What you are presented with at the end depends on the exact 'type' of join.
There are a few different types of join: Inner Join, Left Outer Join, Right Outer Join, Full Outer Join, Cross Join and Self Join. The join type essentially specifies how the database should handle the entries that don't match. Sound confusing? Some Venn diagrams always help.
Inner join
The concept of an inner join is easy; only show the parts of both tables that match each other. Throw everything else away.
An example of an inner join is shown below. Note that we only actually need to use the keyword JOIN here; inner joins are so prevalent that the JOIN keyword is shorthand for INNER JOIN, although you can write this too.
SELECT Characters_Table.Character, Characters_Table.Surname, Houses_Table.Words
FROM Characters_Table
JOIN Houses_Table ON Characters_Table.Surname = Houses_Table.House
Woah, what is happening here? Why do we have to put full stops and table names where they shouldn't be? Firstly, this is because we are now referring to two separate tables in the same statement, so it's not obvious to SQL which fields we mean on their own. You can 'call' which table a field is from at any time if you want to, you just never need to until you reach this point. The notation is simply table.field, as you can see above.
Secondly, we can see here is that the JOIN statement goes after the table name referenced in the FROM. This is basically SELECT fields FROM (A JOIN B), but we don't write the brackets (although this whole thing might be a bit less confusing if we did). A and B are joined into an imaginary table, and from that we grab the fields we actually want to display or 'select' using the select statement.
So what is this statement doing? First it grabs the Houses_Table table, which contains the info on all of the houses (ruling families) in Game of Thrones. We then also grab the Characters_Table table, which is all the info on the characters. It then does a JOIN, which effectively cycles through all the rows in the Houses table, and compares the field 'House' (which is the name of the house) to the surname field in the Characters_Table table. If the fields match, a copy of the rows from each table are basically appended together, and every match is stacked into a new imaginary table. It is from this table that the next operation happens, in the case above the select. So for the house 'Stark', we would expect a row for each character, containing the same house information but different character information. The end result, with the select, is the Character’s forename, surname and their associated house words.
If there were duplicated houses with the same name, we would have a situation where every characters entry was duplicated twice in the resulting table; the join would match them to both the houses.
The INNER part of this join just means that in the end result we don't see any houses that don't have characters, or characters that don't belong to any of the houses.
Left Outer Join
But what if we did want to show houses that have no members? To do this, we would need a left outer join.
This is exactly the same as an inner join, but at the end we are also given the non-matching rows from the first 'left' table in the join. This begs the question of what happens to the fields in these rows that come from the other table; they are simply populated with NULL values.
Right Outer Join
A right outer join is the same, but just the other way around:
The trouble is, a right outer join does not exist in SQLite, because there isn't a situation where you can't just rearrange the tables to use a left join. How do we use a left join?
SELECT <fields>
FROM <TableA>
LEFT JOIN <TableB> ON <field>
Notice that the 'outer' keyword isn't really a keyword; SQL knows what you mean. An example follows:
SELECT Character_Table.Character, House_Table.Region
FROM Characters_Table
LEFT JOIN Houses_Table ON Characters_Table.Surname = Houses_Table.House
WHERE Characters_Table.Forename = 'Jon'
What does this do? First it left joins the characters table to houses, so all characters have at least one row, and those that have houses have the house info appended (characters without these houses have NULLs in these fields). If any houses have the same name twice (which is unlikely), then this may lead to duplicates. From this, we only accept rows where the 'Forename' field from the characters table is 'Jon', which will remove a lot of rows! Lastly we select from this the Character name fields and the regions that their house is from. What we end up with is the full names of everyone called Jon, and the region that their house 'lives'.
You will immediately see, if you know anything about Game of Thrones, that Jon Snow hasn't got a region, which is probably inappropriate considering his attachment to the north. This is because he's not a Stark, so hasn't been matched to a house. This is an example of real problems that can occur with data; common sense might match data in a certain way, but a database might match in another.
Full Outer Join
There is a final type of join (of this type): full outer join. This is exactly what it sounds like; it will show all the data from both tables, matching and non matching.
Once more, to confuse matters SQLite doesn't support full outer joins. But you can 'fake' it, like this:
SELECT DISTINCT
FROM (<Table_A> LEFT JOIN <Table_B>)
UNION ALL (<Table_B> LEFT JOIN <Table_A>)
What we are actually doing here is joining both ways, appending the data (this is what UNION ALL does) and filtering for distinct rows. This is the same as a full outer join.
There are two other types of joins; Cross and Self. I won't really cover both here, but Cross basically combines every row with one table from every row in another table (this isn't very useful in reality). Self join is used when you need to join a table to itself (which can be useful).
Populating and Modifying Tables
We've spent a lot of time looking at how we select (or retrieve) data from tables so far. What we're really doing is 'grabbing' a local copy of the data for our own uses. What we will show you next is how you modify the actual tables that make up your database. In most databases, how this can be done is strictly controlled; don't expect to ever have the kind of power we are about to show you if you ever go and work with a real database at a real company (so enjoy it).
It is relatively easy, to create, populate, modify and delete tables in SQL. It's especially easy to delete them, so you should always keep backups. The first command we will learn is the insert statement: this adds a new row to the table. It's structure is as follows:
INSERT INTO <Table_Name> (<Field_1>, <Field_2>)
VALUES (<Value1>, <Value2>)
You are not required to provide data for every column in the table (unspecified fields will be populated with NULLs). An example with the real database is:
INSERT INTO Characters_Table (Character, Forename, Surname)
VALUES ("Enrico Degregori", "Enrico", "Degregori")
Which does what you expect. Note that Enrico has unwisely left his status as 'Dead' or 'Alive' ambiguous. Whether or not NULL means he is alive or dead, it can't be good (maybe he is the Night King!).
The DELETE statement just deletes an entire row. It's not subtle, it just kills the entire row. If we want more control we need to use the update statement (which will be discussed in a moment).
We can put Enrico out of his misery like this:
DELETE FROM Characters_Table
WHERE Forename = "Enrico"
Anyway, we might want more control over editing values, and this can be done with the UPDATE statement. For example, if we wanted to add the title of "Night King" to Enrico (imagine we didn't just delete him), we would do:
UPDATE Characters_Table
SET Title = "Night King"
WHERE Character = "Enrico Degregori"
As you can imagine, if our condition is not clever, we might accidentally give all characters this title. If we did:
UPDATE Characters_Table
SET Title = "Night King"
WHERE 1=1
We would essentially be updating the rows where the laws of one being equal to itself still apply (specifically, all of them). It's easy to see how you could accidentally modify the table to lose a lot of information in one update statement. There is no 'undo' statement in SQL. In the game of SQL you keep backups, or you die.
Using SQL in Python
It's actually really, really easy to use SQL with Python. There are various python packages that can allow you to talk to an SQL server, but it all boils down to functions that send strings (that you can just type or construct) as SQL queries straight to the database, and the returned data will be imported as a data structure of some kind. A simple set of commands using a package sqlite3 is shown below:
import sqlite3
connection = sqlite3.connect("<databasename.db>")
cursor = connection.cursor()
sql_command = """ <SQL COMMANDS> """
cursor.execute(sql_command)
connection.commit()
connection.close()
The point is that once you've got a database set up for your data, you can use it to store, modify and filter it really easily. One of the advantages of SQL is that you can construct SQL queries using other code, and then send them onto your database. This is exactly what a website does when you search for a product, which leads me on to...
SQL Injection
So you've got this far, so it's only fair that we talk about how you can use SQL to bring down, steal from, and get sent to prison because of, other people's databases. When you enter text in any website that has some kind of 'search' or database related functionality, the website will use your search terms to construct an SQL query. How complex this is will vary depending on the website, but imagine if I enter "Toothbrush" in an imaginary online store, it might construct a query like:
SELECT *
FROM Products_Table
WHERE Name = "Toothbrush"
It then sends this to the SQL database, which will duly return all items with the name "Toothbrush". It does this by sandwiching your query in the statements: SELECT * FROM Products_Table WHERE Name = " and "; You can already see the danger here. If I were to enter " anywhere the search field, it would 'escape' the text of the product name and allow me to enter any code I want for example, a mundane use of this might be if I enter:
This would construct and send:
SELECT * FROM Products_Table
WHERE Name = “Toothbrush” OR 1=1;”;
This would send me all products, not just one with the name toothbrush. The strange command "; after the first would probably be just throw an error with the database. This isn't particularly malicious, but I could instead type:
Toothbrush"; DROP TABLE Products_Table;
This would construct and send:
SELECT *
FROM Products_Table
WHERE Name = "Toothbrush";
DROP TABLE Products_Table;";
This will do two things; firstly return me a list of all Toothbrushes, and then permanently delete the entire table "Products_Table" (where all the products are). The website will be unusable until the table can be restored from a backup. This is damaging, but there are more subtle things people can do; there are tables of tables in SQL, so if one experimented enough it would be entirely possible to acquire things like usernames, passwords, credit card information or worse.
In reality, there are several steps that developers can take (and are indeed required by law) to make this a lot harder than the examples that I have shown above. Websites scan the inputted text and send escape characters (e.g. ") as the character rather than the escape command, or just block or ignore certain characters entirely. Permissions can also help, as if the commands sent by the website can only do a limited number of tasks then the damage they can do is limited. Nonetheless, it is possible to find vulnerabilities in these measures, and websites suffer from SQL injection attacks all the time. TalkTalks 2015 attack was performed using a denial of service attack as a distraction to perform SQL injection on TalkTalks servers. The estimated damages to the company were in the region of £40 million.
Needless to say, never attempt this (unless it's on your own website); the bottom line is that in accessing anyone else's private data, you are breaking the law.
Exercises
- Exercise 1. Write and execute a query that shows the forename and the surname of a character, and the book they appear in. Solution
- Exercise 2. Write and execute a query that shows the name of first chapters of each book (show the name of the book too). Solution
- Exercise 3. Write and execute a query that shows all the houses in the region 'North'. Solution
- Exercise 4. Write and execute a query that counts the total number of forenames and also the number of unique forenames, ans displays both. Only use a single query. Solution
- Exercise 5. Write and execute a query to count the number of dead Targaryens. Solution
- Exercise 6. Write and execute a query to check that the number of characters with the word 'Stark' contained in their character is the same as the number of characters with the surname 'Stark'. Solution Efficient Solution
- Exercise 7. Write and execute a query that shows the characters with forename first letter "S" and last letter "a" that appear in the first book, the chapter name and the chapter number. Solution
- Exercise 8. Write and execute a query that shows the house, number of people in each house, appended with house words. Solution
- Exercise 9. Write and execute a query that adds the title of Lady to Sansa Stark. Solution
- Exercise 10. Write and execute a query that gives to Jon Snow the same title of Eddard Stark. Solution