Part 2. SQL & Analysis Fundamental Concepts

Norma López-Sancho
Apr 11, 2024
13 min read

Updated: Apr 13, 2024

On Part 1 we started to explore basic SQL statements using MySQL Workbench on our own MySQL server within a docker container:

We created tables, altered their structure, inserted new records, and updated existing ones.

Used timestamps, learnt about Safe Updates and saw the difference between Truncating and Deleting tables. And to keep it spicy, we went through some troubleshooting.

In this Part 2, not less spicy, we will learn more about timestamps, timezones, how to update our tables using Spreadsheets and .csv files and how useful Views are.

And while doing this will also learn how to navigate certain security parameters and permissions configuration of your server which it’s definitely worth being aware of.

We will also touch base with JOIN (briefly though, more of it in another guide!)

Let’s get started!

The last thing we saw were funny clocks appearing on our timestamps after changing the time zone of our laptop:

1:30 & 1:43 am, no way!

We can fix this in MySQL database by making sure is set up to our time zone:

SET GLOBAL time_zone = 'Europe/London';

And if now we update the 2 records that got the wrong time zone, by doing a fake modification (as in change something to change it back again, or you can always fill a NULL value if you have the info for it). There we have, a correct time zone in our ‘updated_at’ column:

TROUBLESHOOTING

If does not take effect immediately check if has been really applied correctly:

SELECT @@global.time_zone;

If all good there, try the following options:

Restart your MySQL workbench (you can quit the app and then reconnect to your database again)

Double check that that’s actually the same time zone you have set in your laptop.

Review the MySQL configuration file (my.cnf or my.ini) to ensure that there are no conflicting time zone settings specified

Restart the container

Apart of the one we used, here you have some examples:

Named Time Zones:

'UTC' (Coordinated Universal Time)
'GMT' (Greenwich Mean Time)
'America/New_York'
'Asia/Tokyo'
'Australia/Sydney'

Offset Time Zones:

'+00:00' (UTC)
'-05:00' (Eastern Standard Time, USA)
'+02:00' (Central European Time)
'-08:00' (Pacific Standard Time, USA)

Offset Time Zones work over the UTC, but they do not account for daylight saving time changes, if you need these to be taken into account is best to use Named Time Zones.

If you want information about all the available timezones in MySQL you can do a basic

SELECT *
FROM mysql.time_zone_name;

Or customise the query to filter or sort results based on what you are fishing for!

The power of a spreadsheet for importing data in a table

Let’s suppose we first need to know which registers need updating on our database table, we can export this into a spreadsheet, fill the gaps there, and then import them at once rather than doing a complicated statement including tons of records to update:

1. Finding the values:

SELECT employee_id, first_name, last_name, hire_date
FROM employees
WHERE hire_date IS NULL;

We want to update hire_date which is empty, having this in our result grid we could use the button with the floppy disk in our workbench

But even better: we could do the selection and the export all at once by using a statement!

This way we up the game on SQL script writing.

Let’s also use the opportunity to start applying one of the many conventions on how SQL statements should be written for an optimal readability:

SELECT 
    employee_id, 
    first_name, 
    last_name, 
    hire_date

INTO OUTFILE 
    '/Users/Normasworld/Downloads/Null hire date employees.csv'
FIELDS TERMINATED BY 
    ','
FROM 
    employees
WHERE 
    hire_date 
IS NULL
;

INTO OUTFILE: tells SQL where we want to download the file

FIELDS TERMINATED BY: tells SQL how we want the columns to be separated, in this case, commas.

WHERE … IS NULL: We are requesting only the rows in which our ‘hire_date’ column is Null.

Was not necessary for us as the column is fully empty; but more often than not, your table will have records with data and others without and you want to target just the empty values.

And here we go:

Error Code: 1290. The MySQL server is running with the --secure-file-priv option so it cannot execute this statement

One of the options, would be to give permissions to the folder we are going to keep using by:

SET GLOBAL
secure_file_priv = '/path/to/allowed/directory'

Substitute path with the one you want to use, for us is /Users/Normasworld/Downloads

SET GLOBAL will change the settings only in the current version, once the server restarts will revert to what it was originally. If the change needs to be permanent you could use SET PERSIST.

But for security reasons, is very likely this command is not allowed.

You can check if the variable is enabled in your Mysql configuration:

SHOW VARIABLES LIKE 'persist_only';

If nothing comes up you may need to change it in you my.cnf or my.ini file and add the line persist_only=ON or remove the # symbol at the start to uncomment it.

Of course if says OFF you would need to change to ON

For the change to apply is necessary to restart the server, but before making changes in the configuration files, make sure you back everything up (you may have a oh..my.. moment otherwise, true story!)

Going back to our secure_file_priv file, depends on which version of MySQL you are using, may be read only.

Which is the case for the version we are using (8.0) therefore we cannot set the path ourselves.

This is another security measure to prevent unauthorised users from changing the directory from which files can be read or written.

We will have to locate the MySQL configuration file (my.cnf or my.ini) to see what it says (it will state where allowed the directory)

Common locations for a docker container include /etc/mysql/my.cnf and /etc/my.cnf

Go to your MySQL server container in docker and show container actions on the three dots, which will bring the following menu:

View files and search for the common locations but if you are struggling to find it or you want to verify before going on an Indiana Jones mission, you could always copy your container ID (under your container name) and run the following command in your terminal (command console):

docker exec -it <your_container_id> bash

And then try to find one of the common names (my.cnf if using Mac, my.ini if Windows)

find / -name my.cnf

This will access all the folders and throw a permission denied in those that are not within your privileges to see, but at the end we got this:

There we go, now we really know for a fact that is in /etc/my.cnf.

Once you locate it press the Open File Editor on the top(ish) right of your docker container , or right click>edit file

For our secure-file-priv it says the following: secure-file-priv=/var/lib/mysql-files

As mentioned before, the ‘secure-file-priv’ option specifies the directory where MySQL will allow file operations such as LOAD DATA INFILE and SELECT ... INTO OUTFILE meaning MySQL will only allow file operations within this directory.

Now that we have the allowed path, can re-write the statement to:

SELECT 
    employee_id, 
    first_name, 
    last_name, 
    hire_date

INTO OUTFILE 
    '/var/lib/mysql-files/Null hire date employees.csv'
FIELDS TERMINATED BY 
    ','
FROM 
    employees
WHERE 
    hire_date 
IS NULL
;

And there we go:

If you cannot see it yet, make sure you refresh your container (MySQL workbench will ask you for the pass if you do this).

Selecting the file and opening the editor we can see our data:

Now we can export it wherever we want by right clicking on top of it & saving it.

If your dataset is very small like this one, you could drag your mouse over the data, copy and paste in your spreadsheet directly.

Either way, here we have, the data we need to modify, in a spreadsheet and ready for an update :

Once you have filled your data Save As or Export to a .csv file.

And the cherry on top, let’s use it to update our table, for which we need to follow these steps:

Create a temp table:

You have two ways of creating it:

A. From scratch adding only the fields you need:

CREATE TEMPORARY TABLE temp_table (
    employee_id INT,
    hire_date DATE
);

Here we want to update hire_date by using the employee_id, so we don’t need anything else!

B. Copying the structure of your original table

CREATE TEMPORARY TABLE temp_table LIKE employee;

The advantage of A over B is that minimises unnecessary overhead in temporary storage.

Now, if your temp table has only 2 fields and your .csv has more.. is actually ok, we can bypass it, will explain later!

Though for the sake of not getting puzzled with the examples I give later on, bear in mind that we have created the table with the 4 fields, employee_id, first_name, last_name, and hire_date.

Copy the file into our container permitted directory

We have two options for this, though for what we learning here, use the first:

A. Manually import:

Go to the folder we already know is permitted, this is var/lib/mysql-files right click and import:

B. Do it through the terminal:

If you would happen to be creating a script (not with SQL alone, as it does not allow terminal commands) you could use

docker cp /your/document/path.csv your_container_name_or_ID/var/lib/mysql-files/

TROUBLESHOOTING

If you are having troubles with the permissions of the file and terminal is throwing you an ‘operation not permitted error’, check the file has the appropriate permissions to allow docker to read it

ls -l /your/document/path.csv

If you get -rw-r--r-- means (rw-) owner has read and write permissions; (r--) group/staff has read-only permissions and (r--) everyone else has also read-only permissions

So you could go into your file and modify it manually, as docker may need write permissions to be able to copy the file (group/staff):

Or we could do via the following command in the terminal

chmod o+w /your/document/path.csv

chmod 666 /your/document/path.csv

If does not allow you to modify, try with Sudo in front of the code.

If nothing works, you would need to speak with those who manage the permissions to use the terminal and these commands.

Load the data of our .csv into the temp table.

This is the main statement, but we need to consider some things before using it to modify it to our specific needs:

LOAD DATA INFILE ‘/var/lib/mysql-files/your_file.csv'
INTO TABLE temp_table
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES(employee_id, first_name,last_name,@hire_date) 
SET hire_date = STR_TO_DATE(@hire_date, '%d/%m/%y');

FIELDS TERMINATED BY ‘;’ this is whatever the .csv is using to separate the columns

LINES TERMINATED BY ' \n' in a .csv file, each row corresponds to a line of text and lines can be terminated with newline (\n), carriage return (\r), or a combination of both (\r\n).

ENCLOSED BY ' “ ' each field (column content) may be enclosed with quotation marks

IGNORE 1 LINES (employee_id, first_name,last_name,@hire_date) this specifies to ignore the header row (in case your csv contains the column names) and maps the data from the .csv file to the appropriate columns on our temp_table.

If we had created our temp table with only 2 fields (employee_id and hire_date), but our .csv has the four of them, you can set the other two as @dummy_variable so (employee_id, @dummy_variable, @dummy_variable, @hire_date).

This is used as a placeholder to ignore the columns you want to ignore which is a handy technique for importing data from .csv files, especially when the file structure doesn’t perfectly match the table schema.

@hire_date is a user-defined variable used to temporarily store the hire date data from the CSV file, which is later used in the SET clause to define what has to happen to it before finally inserting it into the temp_table

SET hire_date = STR_TO_DATE(@hire_date, ‘%d/%m/%y’): MySQL date structure is YYYY/MM/DD, we use this to convert the date string to a valid MySQL format

If your date is DD/MM/YY: %d/%m/%y
If your date is DD/MM/YYYY: %d/%m/%Y
If your date has no leading zeroes (ie. 5/1/24): %e/%c/%y

These are some of the most common formats when working with data in a spreadsheet, but of course there are many possibilities as tastes, so if your date format is different or you want to know more, have a surf in the web!

Also fundamental to be able to adjust our statement, is understanding how our .csv is being formatted:

> Go to your file and open in any text editor; if working with docker is best to just go to the one there and open file editor:

Numbers app .csv separates by semicolons and creates extra ones at the end due to empty columns, thing that does not happen in MS Excel (which separates by commas, and does not set any dodgy symbols at the end)

Our final code will look like this:

LOAD DATA INFILE '/var/lib/mysql-files/your_file.csv'
INTO TABLE temp_table
FIELDS TERMINATED BY ';'
IGNORE 1 LINES(employee_id, first_name,last_name,@hire_date)
SET hire_date = STR_TO_DATE(@hire_date, '%d/%m/%y');

FIELDS TERMINATED BY ‘;’ should also into account those extra semicolons at the end of each row, but in reality, MySQL does not really like it and will throw an error

And LINES TERMINATED BY is not suitable for this because is for other scenarios like \n, \r or \r\n and here we go:

TROUBLESHOOTING:

Error Code: 1262. Row 1 was truncated; it contained more data than there were input columns

Therefore we need to make sure those semicolons at the end do not exist (is as easy as deleting the empty extra columns on your Numbers spreadsheet before creating the .csv or removing the semicolons directly on your container file editor if not too many records,

4. Now our last step JOIN our temp_table to update our employees table:

UPDATE employees e
JOIN 
	temp_table t ON e.employee_id = t.employee_id
SET 
	e.hire_date = t.hire_date;

UPDATE employees e:

e after your table name is creating an alias for your table, this is a temporary name to facilitate the SET statement by using a shorter and more convenient expression on our code, which also makes it easier to read.

Usually these names are representative , we are choosing e because we are referring to the employee table

The same way we are using temp_table as t

JOIN temp_table t ON e.employee_id = t.employee_id

This is where the JOIN occurs, which by the way is an INNER join. This is because it joins the employee table with the temp_table based on where the ‘employee_id’ is coincidental between both tables

SET e.hire_date = t.hire_date

This is where we specify the column in the employees table (e) that will be updated with whichever values we have in the temp_table (t) whenever the JOIN finds those matching values between the two tables.

And here we go! our ‘hire_date’ column updated on our employees table all at once.

This bulk approach is way more efficient as you get to combine the power of a spreadsheet with a simple JOIN rather than using a lengthy UPDATE/SET/WHERE clause for each row.

When dealing with a big dataset, that kind of statement would get really messy real quick. Of course if it’s just one or two records, you can get away with it, but beyond that? No thanks!

What if we would want to see the date in a different format?

You could give the format on SELECT statement:

SELECT 
	first_name,
	last_name,
	DATE_FORMAT(hire_date, '%d-%m-%Y') AS hire_date
FROM 
	employees;

But this approach can become lengthy in a heartbeat and since it’s not permanent needs to be executed each time, potentially compromising execution times and performance.

A much better idea is the concept of Views

Views

Instead, we can create a formatted view of our table, will explain in a heartbeat what they are and how cool is to have them but first the code:

Let’s say we want our hire_date show in the European format, and our timestamp columns (created_at and updated_at) with not only the European format but also without the clock:

CREATE VIEW ft_employees AS
SELECT 
    employee_id, 
    first_name, 
    last_name, 
    DATE_FORMAT(hire_date, '%d-%m-%Y') AS hire_date,
    DATE_FORMAT(created_at, '%d-%m-%Y') AS created_at,
    DATE_FORMAT(updated_at, '%d-%m-%Y') AS updated_at
FROM 
    employees;

And here we have:

But I forgot a column! How do I modify my view now, do I need to drop it and create it again?

Well, nope, you just replace it adjusting your code to whatever you want, you can add the missing ‘email’, and let’s also add the clock back to ‘updated_at’ but without the seconds:

CREATE OR REPLACE VIEW ft_employees AS
SELECT 
    employee_id, 
    first_name, 
    last_name,
    email,
    DATE_FORMAT(hire_date, '%d-%m-%Y') AS hire_date,
    DATE_FORMAT(created_at, '%d-%m-%Y') AS created_at,
    DATE_FORMAT(updated_at, '%d-%m-%Y %H:%i') AS updated_at
FROM 
    employees;

This will create it if does not exist and replace it if does exists. If you have not created the table though, be careful with replacements as not to undo other colleagues work!

You can see the existent views in your schema menu:

Or you can always run the statement:

SELECT table_name
FROM information_schema.views
WHERE table_schema = 'play_ground'; -- Use 'your_database_name'

table_name is how the column in the information_schema.views is called, but you can always SELECT * to get more info about it.

In the unlikely chance that you are not able to replace, you can indeed drop it and create it again

DROP VIEW ft_employees

Remember, dropping views or tables cannot be undone, so is always good to proceed with mindfulness (maybe a great opportunity to take a couple of deep breaths before clicking run!)

And by the way, on clock formats H works similarly as Y/y, H being the 24 hours clock, and h being a 12 hours format. If you want the 12 hours format then add %p for AM/PM: %h:%I %p (notice the space, you can use it or not, or whatever symbol you please)

From an analysis and optimisation perspective I would say, why adding length and mixing integers with strings when we can establish if am or pm just by a 24h clock.

But dates and clocks formats are like colours, every person has their preference!

Though, I would say, if there’s a convention in your company, that’s the one to follow.

Here we have created a view for a very simple transformation but Views can handle pretty complex queries while serving as a central point of integration.

Here the main advantages of using Views:

Simplified Data Access & Code Reusability: Complex queries are transformed into a user-friendly virtual table. This means you can access the data with a simple SELECT statement and on top of this you can build your own queries within the view saving you the time to run previous scripts or starting from scratch.
Enhanced Security: Views allows you to limit access to sensitive or confidential information ensuring it is only available to authorised users.
Data consistency: Provides all users with a standardised view of the data, which reduces confusion and ensures reliable decision-making across an organisation.
Performance Optimisation: By precomputing complex queries and storing the results in a virtual table, you can significantly improve query performance. This means faster access to data and smoother operations for your applications.
Maintenance, Duplications and Errors: Is easier to locate and update the necessary logic within the view than having to sift through numerous queries scattered throughout the codebase, reducing duplication and the chance of introducing errors which can easily happen when manually updating the same logic in several locations.
Improved Traceability: When all related queries are over a view, is easier to track down where certain logic or transformations are applied, which can be really handy for troubleshooting and understand the flow of data within the system.
Encourages Collaboration: Acting as a common interface for accessing and querying data, views facilitate collaboration among teams by offering a shared resource for data analysis and reporting.

So if you want data that is easy to access, secure, and consistent, views are your new best friend. They will help you get the job done faster and with fewer headaches.

On Part 3 we will keep working on the use of spreadsheets to update our tables and see how timestamps behave in different scenarios. We will also review some statements and, of course, add some things on top!

The knowledge will sink and we will master the concepts!

Happy coding!

CODING FOR ANALYSIS

Part 2. SQL & Analysis Fundamental Concepts

Recent Posts

Comments

CODING FOR ANALYSIS