Part 1. SQL & Analysis Fundamental Concepts

Norma López-Sancho
Mar 24, 2024
10 min read

Updated: Apr 12, 2024

On my previous guide we created a docker container with MySQL server and connected it with MySQL workbench, which is an excellent way of having your own tools to practice SQL and make it real-world like.

This guide will be separated in several parts and we will have a go onto reviewing some basic statements that will quickly evolve into a more rounded intermediate knowledge as it is approached from a data analysis mindset.

At the same time, we will start to get familiarised with the Workbench.

So let’s play a bit, shall we?

1. Creating tables

On your Workbench select the connection we have just created (or whichever one you have and want to use)

Open a new query tab if you do not have one already, write the below statement

SHOW DATABASES;

and press run

Let’s create a little database to get to know the workbench and as a review of some fundamental SQL statements, such as create database, tables, registers, and some intermediate concepts around them that will add a much more rounded knowledge on data management.

CREATE DATABASE play_ground

Of course, use with whichever_name_you_want!

And now show databases again:

We have somewhere to play!

If you want to fully delete the database you will need to:

DROP DATABASE name_database

To work on your database you can either use the following command:

USE play_ground;

Or make sure you select the correct database on the navigator panel on the left, being on Schemas tab.

If yours does not appear refresh your schemas by pressing on the loop/circle arrow, or in case you have lost connection to the server press the last icon on the top right of the below image ‘Reconnect to DBMS’

The workbench has capabilities as a day to day tool with shortcuts as to bypass having to write SQL but instead of exploring this now, let’s create a table the old fashion way, as raw SQL language is fundamental to be able to develop scripts:

CREATE TABLE employees (
    employee_id INT AUTO_INCREMENT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(50),
    hire_date DATE
);

employee_id INT AUTO_INCREMENT PRIMARY KEY, this will set an INTeger (whole number data type) which will act as unique identifier for each individual record/row (PRIMARY KEY) and will automatically generate an incrementing value for each new row being inserted.

Also we are using VARCHAR because these are variable length fields, and the advantage over CHAR is that will only consume storage space for the actual length of the data.

Refresh your schema again and pressing the drop down arrows our freshly baked table will appear!

Let’s create another table, department but first a TIP in case you have not realised by now:

You can keep writing your SLQ queries in the same query panel, just select the one you want to run and will do that one alone rather than the whole set.

Selection can be done by dragging the mouse over the numbers on the left (or just clicking on the number if 1 row statement)

Or even cooler, you could just use this icon instead of the usual one:

This will run the query to the left of the cursor (so from left of ; in where your mouse is up to previous ; )

CREATE TABLE department (
	department_id INT AUTO_INCREMENT PRIMARY KEY,
	department_name VARCHAR(100) NOT NULL,
	location VARCHAR (200)
);

The only new command that we are using here is NOT NULL, which won’t allow this field to be left empty when creating a new row, while in ‘location’ not having specified it, allows NULL values.

Now let’s imagine our company has a complex structure like in Consulting Firms or Shared Services and one employee can be related to several departments. This would be considered a many-to-many relationship between employees and departments and creating a junction (index) table will help representing this complexity and can also improve query performance and simplify data management.

There are other scenarios in which you could use a junction table; for example if you are using JOIN too often to bring certain columns from two or more tables and it’s jeopardising the performance of the query, creating a junction table can improve query execution time.

So here we go:

CREATE TABLE employee_department (
    employee_id INT,
    department_id INT,
    PRIMARY KEY (employee_id, department_id),
    FOREIGN KEY (employee_id) REFERENCES employees(employee_id),
    FOREIGN KEY (department_id) REFERENCES departments(department_id)
);

Two new commands we find in this statement:

PRIMARY KEY (employee_id, department_id) this will ensure that each combination of both 'employee_id’ and ‘department_id’ is a unique identifier.

FOREIGN KEY (employee_id) REFERENCES employees(employee_id) this sets a constrain in the ‘employee_id’ column and will ensure that every value in here must exist in the ‘employee_id’ column of the employees table.

Let’s dive a bit deeper into this concept:

FOREIGN KEY is the constraint command that maintains the referential integrity in the database. It ensures that the values inserted into the child column must already exist in the parent column.

In the case that the value is deleted or updated in the parent column, the foreign key constraint can also be further defined and specify what action should be taken on the child column

REFERENCES is the clause that establishes a relationship between the parent column and and the column(s) that the foreign key constraint is referring to.

Similarly, if our company has different offices (national or international level) you will likely have a table with office_address, and we could create an index table (department_office) to reflect another many-to-many relationship in our database.

Then, why have we created a location column in our department table? Just wanna have fun (and practice)..deleting now!:

ALTER TABLE department
DROP COLUMN location;

2. Inserting values in a table

Ok now, let’s step back a bit and create some records in our employee table, yes?

Given than in both tables we specified an AUTO_INCREMENT as attribute for the PRIMARY KEY, we won’t need to to fill ‘employee_id’ or the ‘department_id’ when INSERTing records (rows, tuples) INTO the tables.

INSERT INTO employees (first_name, last_name, email)
VALUES ('Horza', 'Gobuchul', 'HGobuchul@havingfun.com'),
       ('Bail','Channis','BChannis@havingfun.com'),
       ('Salvor','Hardin','BChannis@havingfun.com'),
       ('Derfel','Gadarn',NULL);

This code is pretty much self explanatory, in the INSERT INTO clause, first you choose your table, in this case, employees, then you write the column names in which you want to insert data, and in VALUES you will set the info in the specified order of the columns.

Note that for text columns you need the quotation marks (' '), something you don’t need for numerical columns.

Let’s do one more insert:

INSERT INTO employees (first_name, last_name, email)
VALUES ('Shandril', 'Shessair','');

As you can see for Derfel Gadarn, we have set email as NULL while for Shandril Shessair we have set it as ' '

Also, we have a column in the table, called ‘hire_date’ that we have not even mentioned in the statement!

Let’s see how the commands have handled the different scenarios:

Setting a value as NULL (Deferl Gadarn)
Setting a value as ‘’ (Shandril Shessair)
Not setting any value at all (hire_date column)

SELECT * FROM employees; -- this selects all the columns of the table

--If you want specific columns you would:
SELECT column1, column2, ...
FROM table_name;

For the first scenario, of course, the command has set the value as NULL

But as you can see for scenario number 3 it also sets it as NULL, so actually, you don’t need to specify anything (scenario 1) for the statement to do its thing.

For the second scenario, has left the column empty where we haveve specified ' ' (=empty string), this only works because the column does not have a NOT NULL restriction, an empty value wouldn’t be allowed otherwise.

Ok, but why my ID is starting from 4 rather than from 1?

When a table is created using the AUTO_INCREMENT, the database will automatically generate unique values for each new record inserted starting from 1 (unless other initial value has been specified)

If your ID is starting at other number there may be different scenarios affecting it:

There were previous records on your table and has created next available ID after the highest existing value.
There were previous records inserted into a table that were later dropped or the table was truncated (removing all rows while keeping the structure of the table) and the auto-increment counter has not reseted to 1 so will continue from next available value as if records were still there. Though by default when you perform a TRUNCATE TABLE your_table_name; it does reset the AUTO_INCREMENT, but you know how is life, sometimes things fail!
The AUTO_INCREMENT counter has been set to a specific value in the table option.
There have been failed insert attempts and the counter may have been triggered even if the record was not successfully inserted.

Let’s put on our Sherlock hat and investigate further to try seeing what may be our scenario.

For that we can check the existing records and the current value of the auto-increment counter

SELECT * FROM employees;
SHOW TABLE STATUS LIKE ‘employees';

In the result grid we can confirm that we only have 5 rows but the auto-increment counter is 9

In our specific example is easy to play Sherlock, as we are the ones who created the table and we know the only scenario that can apply here, have you guessed?

Yes, failed attempts on inserting rows!

Does it have a fix?

Well.. yes and no.

In reality is not even worth it, the ID is just meant to be a unique identifier, it does not matter what the number is or if the database starts at 20, it does not really provide information to know which record was entered first or last, you have better means to date data entry information.

Having said that, if you would still like to fill any gap you could specify the next row’s ID by altering the counter:

ALTER TABLE employees AUTO_INCREMENT = 1;

This will set as 1 the next inserted record, but only the next…

Without diving deeper into this statement, given that in our particular scenario we don’t consider it useful, let’s jump onto how to get information on when the registers were inserted:

We can do it in different ways:

ALTER TABLE employee
ADD COLUMN created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP

ALTER TABLE employee
ADD COLUMN created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP;

In both examples ALTER TABLE will ADD a new COLUMN on the existent table employee.

TIMESTAMP is the data type while CURRENT_TIMESTAMP is the value (default)

This command will automatically set the ‘created_at’ column to the current date/hour in which the new record has been inserted in the table.

On the second statement, by adding ON UPDATE CURRENT_TIMESTAMP the timestamps in ‘created_at’ column will be updated whenever the row is modified after insertion.

But if you really want to keep track of when was created AND when was it modified then we can do:

ALTER TABLE employees
ADD COLUMN created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
ADD COLUMN updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;

The only thing we have done here is combining both statements but in separated columns

That way you ensure keeping strict record of both creation and modification as traceability is always a really useful thing to have.

Of course you could create these two columns at the same time of table creation, remember how we created it?

It would be as simple as adding these two new statements within it:

CREATE TABLE employees (
    employee_id INT AUTO_INCREMENT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(50),
    hire_date DATE
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;
);

The problem of TIMESTAMP columns inserted a posteriori and not at the moment of table creation is that will set all the existing records with the timestamp in which has been implemented:

It would be important to take note of this to avoid bias if these columns are going to be used for data analysis, as to know that XXX date/hour must be considered as NULL.

But if so, why not setting the existing rows as NULL now!:

UPDATE employees
SET created_at = NULL,
    updated_at = NULL
WHERE employee_id BETWEEN 4 AND 10;

Even though for this kind of scenario WHERE clause would not be necessary given that we have created those 2 columns a posteriori, therefore affects the whole table; wanted to use the opportunity to stress the importance of always using WHERE, to make sure we are targeting only the rows we want to modify, otherwise I can see us having nightmares and we need to rest properly, sleep is very important!

And now is the perfect time to play with some modifications to our NULL & Empty values and see how our 'updated_at' column behaves.

The most obvious field to use when updating a record is the ID, as it’s our unique identifier for a row and no mistakes are possible (unless you use the wrong ID, of course!)

UPDATE employees
SET email = 'DGadarn@havingfun.com'
WHERE employee_id = 7;

Or in the case of not knowing the ID (which you would because you can search for it) we can use the values you know, as long as you make sure that those are unique in the whole dataset to avoid the risk of modifying other registers in our table:

UPDATE employees
SET email = 'SShessair@havingfun.com' 
WHERE first_name = 'Shandril' 
AND last_name = ‘Shessair';

As you can see, Shandril Shessair mail has not been updated, why?

TROUBLESHOOTING

Error Code: 1175. You are using safe update mode and you tried to update a table without a WHERE that uses a KEY column. To disable safe mode, toggle the option in Preferences -> SQL Editor and reconnect

There we have, as mentioned, is risky business to do changes by other than a unique identifier, but if we know that’s our only record with those exact values, and mostly because is our own play ground and is good to try things to understand what happens!

We can do what the error suggest but in a much easier way by:

SET SQL_SAFE_UPDATES = 0;

Now run your UPDATE statement and:

And then once we have done our task, make sure we enable it again as not to cause issues with the integrity of the table and so you are prompted with the same safety issues whenever something is risky:

SET SQL_SAFE_UPDATES = 1

Let’s say you have inserted a row by mistake and you want to delete it:

DELETE FROM employees
WHERE employee_id = 3;

If you would want to delete all rows, in theory you could use DELETE FROM, but in reality you would use TRUNCATE TABLE which we have already mentioned.

TRUNCATE TABLE is faster and less resource intensive as does not generate individual delete operations and a log for each row.

It also resets the auto-increment counters.

The only advantage of DELETE FROM is that can be rolled back (potentially..)

As general rule, you would only use DELETE FROM always accompanied by a WHERE clause to target specific rows

You can, for example delete all the rows with values between two specific numbers:

DELETE FROM table_name
WHERE column_name BETWEEN lower_bound AND upper_bound;

Psst psst.. did you notice a funny hour in the ‘updated_at’ column on the two previous images on the data we have been modifying?

This is because I changed the time zone on my laptop, definitely not updating records at those hours!

We want our registers to have the correct clock, no?

So how can we modify it in our mysql database?

Glad you asked!

We will dive into clocks, formats, and the power of views on Part 2!

Happy coding!

CODING FOR ANALYSIS

Part 1. SQL & Analysis Fundamental Concepts

1. Creating tables

2. Inserting values in a table

Recent Posts

Comments

CODING FOR ANALYSIS