Intro to SQL

Learning Objectives

Understand what is meant by a "relational" and "non-relational" database and their different use cases
Understand how to create a database and database table through the terminal
Recognise the different data types that can be assigned to columns in a database table
Be able to insert, retrieve and update data using common SQL operations
Understand the concept of atomicity
Understand what is meant by SQL injection

Introduction

Being able to retain and access data is a vital part of any application. That may sound like a bad thing (who wants records of what they viewed on eBay being used to serve targeted ads?) but in practice there are a lot of good things that it leads to. We may not like cookies tracking what we view on eBay, but if we weren't retaining data somewhere we also wouldn't be able to read a seller's feedback and would have to re-enter our details every time we buy something.

The information is stored in a database which may or may not be part of the same application. Databases vary wildly in size, structure and use case but they all have something in common: we put data into them so that we can access it again when we need it later. It doesn't matter if we are working with a small test database running locally or a massive third-party data center, we will be doing the same basic things every time. Here we will learn how to manipulate a database in isolation but they can be integrated with applications written in other languages too, for example by using some the libraries available to Java's Spring framework.

Types of Database

Databases come in two different types: relational and non-relational. The fundamentals are the same, but there is a key difference in terms of structure. Everything we enter into a relational database must follow a clearly-defined format and we can be sure when we read from it that the results will contain the data we need. Non-relational databases are much more flexible and give the user a greater degree of freedom in terms of how they are utilised.

The two types are often referred to as SQL and NoSQL databases. SQL stands for Structured Query Language but it in fact covers many different varieties - examples include MySQL, PostgreSQL, H2 and many more. Each has slight differences, but the fundamentals are the same. Often these fundamentals are taken care of for us by a tool known as an Object-Relational Mapper or ORM, but knowing how to manually check what the tools are doing for us is still a vitally important skill.

The type of database we use will be determined by a combination of business needs and the quality of the data we are working with. Data quality involves a number of different factors including structure, consistency and completeness. We can design an application to steer a user towards providing data in the format we need it, but this won't always be possible.

Consider a scenario where we ask users to register for a service. An online form can be designed to force the user to complete every section and format their responses in the required way, for example dates as DD/MM/YY. A paper submission of the same data doesn't have those restrictions and users can give details in a different format, make mistakes which could otherwise be picked up by a computer or even skip sections completely. A dataset with these inconsistencies is known as a dirty dataset and would be very difficult to force into a relational structure with its tight constraints on entries. However, it would easily fit into a non-relational system where each entry can be structured differently form the others.

Having our data stored in a format which accurately reflects what a user provides needs to be balanced against it being in a format which our applications can use. Dirty data can be restructured into a format suitable for use with a relational system (a process known as data cleaning) but this has the inherent risk of potentially losing important data or having the raw data distorted to fit the desired structure. The potential impact of this must be considered before deciding how to proceed.

In this course we will focus on relational databases and will require our data to be supplied in an appropriate format. We will, however, still likely encounter scenarios where a non-relational system could work as well.

Setting Up a Database

Recommend VSCode for this code-along

The data we enter into our database will be stored in tables. Tables consist of columns, which denote the values being stored, and rows, which represent the entries made. In order to set up a table in our database we need to specify the name we want to refer to it by, which should be plural. In this example we're going to keep track of actors and the movies they appear in.

We can begin by creating a psql database locally, to store our data. createdb is a command we have access to as we have previously installed Postgres.

createdb cinema

Now that our database exists, we can create a file to store our SQL code.

touch cinema.sql

We can now start by creating a movies table. We need to define the columns, including the type of data stored in each. These types are similar to those we've seen already, but have slightly different names. By specifying a type for each column we help prevent data being entered inaccurately.

Once we store our data we'll need to refer back to it at some point, but that can be quite a challenge. If we have two movies with the same title, how can we tell them apart? To address this we'll add a column called id which will store a unique identifier for each row. This will have a special type called SERIAL, which looks like any other number but has special significance to the database management tool. Serials are assigned automatically when a row is added and won't be reused when a row is deleted, removing a potential source of user error.

-- cinema.sql

CREATE TABLE movies (
    id SERIAL,
    title VARCHAR(255),
    duration INT,
    rating VARCHAR(255)
);

Note the semicolon! Semicolons are required in SQL, easily missed, and the source of many errors. You should also pay attention to commas (,) as they are also a common source of bugs in a SQL program.

VARCHAR(255) denotes a VARiable length CHARacter with an associated maximum size in memory. We often use 255 as it is the smallest amount of space the system can allocate, but we can provide a smaller number here to limit the number of characters entered.

Great! We now have the instructions for PSQL to create a table for us. But how do we run this code? We can use the database manager through Terminal using the command:

psql -d cinema -f cinema.sql

The general format is psql -d database_name -f path/to/file.sql

We should receive some feedback (see below) which indicates that we have successfully created a movies table.

CREATE TABLE

We can also check that this has worked by writing a SELECT query. We will talk more about reading from our database shortly, but let us write some code right now that allows us to look at our movies table. This lets us visually see the work we are doing as we go along.

-- cinema.sql

CREATE TABLE movies(
    id SERIAL,
    title VARCHAR(255),
    duration INT,
    rating VARCHAR(255),
)

SELECT * FROM movies;       -- ADDED

In order to see the result of the select, we will need to run the file again (psql -d cinema -f cinema.sql). Before we do that, we should note that throughout this lesson (and a future lesson on Advanced Select) we will be running this code many times. To avoid having multiple entries of the same data, are going to add another command to the top of our file:

-- cinema.sql

DROP TABLE movies;  -- ADDED

CREATE TABLE movies(
    id SERIAL,
    title VARCHAR(255),
    duration INT,
    rating VARCHAR(255),
)

Now we can run our file again to see the table we created.

psql -d cinema -f cinema.sql

Basic Operations - CRUD

Create

There's no point in having a database if we don't store any information in it. The keyword we use adding new information is INSERT and the query follows this format:

INSERT INTO table_name (column1, column2, column3) VALUES (value1, value2, value3);

We'll add some movies to our table:

INSERT INTO movies (title, duration, rating) VALUES ('Alien', 117, '18');
INSERT INTO movies (title, duration, rating) VALUES ('The Imitation Game', 114, '12A');
INSERT INTO movies (title, duration, rating) VALUES ('Iron Man', 126, '12A');
INSERT INTO movies (title, duration, rating) VALUES ('The Martian', 144, '12A');

Note that we don't need to include the id column here. Because we gave it the type SERIAL the database manager will take care of that for us and auto-assign a value to each new entry.

What happens if we miss out some of the information?

INSERT INTO movies (title, rating) VALUES ('Braveheart', 'PG');

Because we're using a relational database there needs to be something in the duration column, but we haven't provided anything. The database manager ensures there's a value there by inserting NULL instead. We can force the user to provide a value by adding NOT NULL after the column's data type when we set up our table.

Read

Just like there's no point in having an empty database, there's no point putting things in if we're never going to look at them again. We can read information from the database using the SELECT keyword. The simplest thing we could do is read everything from our table, which we do with this query:

SELECT * FROM movies;

Obviously that's not always going to be practical. We can limit the results in two ways: by column and by row. Replacing the * with one or more column names will only give us certain pieces of information back:

SELECT title FROM movies;

We get the requested information for each row in the table. If we want to limit the number of rows we can do so using the WHERE keyword:

SELECT * FROM movies WHERE rating = 'PG';

We aren't limited to checking equality, we can compare values (>, <) and check negatives using NOT. We can combine the two as well:

SELECT title FROM movies WHERE duration > 120;

Update

Once something has been added to our database we may want to change it at a later point. The UPDATE keyword lets us do this and the syntax is similar (although not exactly the same) as adding a new item. We need to be more careful here though, since we don't want to accidentally modify more than we need to. Just like we do when we SELECT, we can use the WHERE keyword to limit the rows we are updating. Usually we will use the id property to ensure we only update a specific row, but there may be occasions where we want to use something else to update multiple rows at once.

UPDATE movies SET rating = '12A' WHERE id = 5;

The syntax is slightly different if we want to update multiple columns, where we need to add brackets around the column names and values:

UPDATE movies SET (duration, rating) = (178, '15') WHERE id = 5;

Delete

The final core action will enable us to remove entries from our database. Again we need to be careful and use WHERE to make sure we only remove what we want to.

DELETE FROM movies WHERE id = 5;

If we leave out the WHERE clause we'll end up deleting everything in the table!

DELETE FROM movies;

Data Hygiene

Atomicity

It's not uncommon for a change in our data to be refected in multiple places at once. Consider a hypothetical scenario where once person is buying something from another. Person 1 starts off with £10 in their account and pays it to person 2, who had £0 to start with.

UPDATE people SET account_balance = 0 WHERE id = 1;
UPDATE people SET account_balance = 10 WHERE id = 2;

When these two statements together are executed together they form a transaction. Transactions can be made up of as many or as few statements as necessary, but it is critical that every statement is executed correctly. We call such a transaction an atomic transaction.

In practice there are two steps to complete this transaction: removing the money from person 1's account and adding it to person 2's account. If either statement fails to execute then we have an inconsistency: either both people will have £10 in their accounts or both will have £0. When we design a system to use atomic transactions we ensure that our program has checks in place to prevent the transaction being completed if a statement fails. Exactly how this is implemented varies between systems but from the user's perspective it means that they only see a change in the state of the application if the the transaction has completed fully.

We can take direct control of this transaction process if we wish and manually instruct the database tools when to begin, end and reset them. The keywords BEGIN and COMMIT mark the beginning and end of the transaction, as shown below.

BEGIN;
UPDATE people SET account_balance = 0 WHERE id = 1;
UPDATE people SET account_balance = 10 WHERE id = 2;
COMMIT;

Once we pass the BEGIN command nothing will actually happen to the database until we reach the COMMIT command. This gives us some opportunity to correct any errors which have crept in before executing them. The ROLLBACK command will disregard any commands entered since BEGIN. Alternatively we can use it in conjunction with SAVEPOINT to roll back to a specific point rather than simply starting again. Let's imagin we accidentally transfer too much money to person 2. Using ROLLBACK we have a chance to fix the mistake before we COMMIT.

BEGIN;
UPDATE people SET account_balance = 0 WHERE id = 1;
SAVEPOINT money_withdrawn;
UPDATE people SET account_balance = 100 WHERE id = 2;
ROLLBACK TO money_withdrawn;
UPDATE people SET account_balance = 10 WHERE id = 2;
COMMIT;

SQL Injection

Just like any other language, SQL isn't immune to someone attempting to do something untoward with it. In fact it's even more vulnerable since the results of anything other than a SELECT query are permanent. There are many ways in which a database could be attacked, but one of the simplest is SQL injection.

Consider a scenario where someone is registering for a service. They would complete a form asking for their username and password, let's imagine they're not particularly creative or security-conscious and go with myName and abcd1234. Our application would present them with a form before taking the values they submit and inserting them into an SQL query to add them to the users table which would look like this:

INSERT INTO users (username, password) VALUES ('myName', 'abcd1234');

The values have been taken from the form, which means anything the user writes there will be added to the. SQL injection is a process where SQL code is submitted to the form, resulting in unintended and potentially damaging consequences for the database. Submitting ''); DROP TABLE users; as a password, for example, looks like nonsense. But it turns the query into:

INSERT INTO users (username, password) VALUES ('myName', ''); DROP TABLE users;

All of our user data gets deleted! This sounds bad enough, but if a SELECT query is added our data could be exposed to the outside world, or someone's password could be changed and their account stolen with an UPDATE.

There are a number of ways to prevent SQL injection, with an extensive list maintained by OWASP. Most modern development frameworks support the use of prepared statements to submit the query and data separately, meaning the database management system is substituting in the potentially harmful code as a value to store in the table instead of receiving it as a query to be executed.