Part 3. SQL & Analysis Fundamental Concepts

Norma López-Sancho
Apr 12, 2024
9 min read

We ended our Part 2 talking about the advantages of views.

Let’s take a tiny step back (and forward!), and now that we learnt how to use spreadsheets to update our tables we will use the technique to fill our whole departments table, who is feeling a bit sad and empty:

We will add our timestamps columns as well, always handy to have as a way of controlling data manipulation, and talking about control.. let’s point fingers here!

It’s a very good idea to have ’created_by’ and ‘updated_by’ columns in your tables to store the user IDs or usernames of the individuals who created or last modified a row.

And no, is not really for pointing fingers to be honest, but is incredibly handy for troubleshooting and traceability.

Anyway this would need to be implemented mainly at app management levels.

If your app has authentication methods to provide user IDs or usernames you can leverage this to set an automated way of filling these type of columns, and would work similar to a timestamp but retrieving the user session info.

If you are reading these guides, most likely is not within your role to set up something like this, but if it’s something we can push to be implemented, definitely worth it.

Having said that, let’s forget about pointing fingers (for now..) and add our timestamps plus a parent_id column to represent sub-departments on a complex hierarchy:

ALTER TABLE department 
ADD COLUMN parent_id INT,-- it will be a number, therefore data type integer 
ADD COLUMN created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
ADD COLUMN updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP;

The parent_id column will indicate the ID of the parent department (or NULL if is itself a parent or has no child).

Now that we have tweaked our table, let’s fill it with data by creating our content in a spreadsheet and importing as we did before (part 2)

Despise having set our department_id to be an AUTO_INCREMENT primary key column, we are setting up the initial data of the table and establishing the relationships between parent-child departments, so we will provide the ID ourselves so we can use it for mapping the hierarchy in our spreadsheet.

A bit of a refreshment of what we did in part 2, and also because the steps vary a bit given our department table is completely empty, therefore no sense of doing a temp table as nothing to join it with:

Export to/Save As your file in .csv
Import your .csv into your permitted folder within the docker container(/var/lib/mysql-files)
LOAD the DATA INTO your table

LOAD DATA INFILE '/var/lib/mysql-files/Department_table.csv'
INTO TABLE department
FIELDS TERMINATED BY ';'
IGNORE 1 LINES;

The problem with the above statement is that our CSV like this:

And our table has created_at and updated_at columns

If you run the statement as it is, 2 things will happen:

1. Getting the below warning:

24 row(s) affected, 64 warning(s): 1366 Incorrect integer value: '' for column 'parent_id' at row 1 1265 Data truncated for column 'created_at' at row 1 1265 Data truncated for column 'updated_at' at row 1 […]

Which is not really an issue but can potentially be.

This has happened because the empty cells where there is no parent_id.

MySQL is taking it as an error and has set them as 0 (is an INT column, so no empty strings allowed), which in a way is perfect, but you would need to validate (we never trust an error without verifying) that the info has really been inserted correctly adding an extra task.. not great

In here is easy to see at a glance that all went well, given that the dataset is small, but if your table import is big, yup, no way by just eyeballing.

It would be much better set parent_id as 0 or NULL for those rows without a formal parent in your spreadsheet before uploading the data so you don’t have any warning (more on this later, as zero poses its risks too).

2. The timestamps error:

Error Code: 1262. Row 1 was truncated; it contained more data than there were input columns

Well, this is not nice..

I think by now we can figure it out why, it has to do with the ;

Given our .csv had 3 semicolons at the end, is taking it as empty values for our created and updated columns, throws the error as not the normal format and sets as zero.

The semicolons are created in Number app from Mac whenever you have empty columns in the spreadsheet, thing you do not have to worry if working with MS Excel.

Anyway is as easy as deleting the columns you won’t use.

Though then you would get this…:

Oh my..

Well, we could include that data in our spreadsheet timestamp columns using the formula =NOW( ) to have it like this in the .csv:

But now that we have created the timestamps in our file…do you remember that date formats in MySQL are in the American way? This just keeps getting better, ey?

Of course it would be easier to set the correct format in your spreadsheet in the first place, but let’s keep things spicy for training purposes

We would need to do some transformations so MySQL understands our more…normalised kind of date:

TRUNCATE TABLE department; -- this is to empty the table so we can start again
/* 
Use the CSV file in our container to update an empty table, 
and tell MySQL in which format my date is (23/3/24/ 19:14) given that differs from how MySQL formats dates/timestamps
*/
LOAD DATA INFILE '/var/lib/mysql-files/Department_table_with_timestamp.csv'
INTO TABLE department
FIELDS TERMINATED BY ';'
IGNORE 1 LINES(department_id, department_name, parent_id, @created_at, @updated_at)
SET 
   created_at = STR_TO_DATE(@created_at, '%d/%c/%y %H:%i'),
   updated_at = STR_TO_DATE(@updated_at, '%d/%c/%y %H:%i’);
/* 
 %c because month does not have leading zero (otherwise would be %m)
 %e would be used if same scenario (ie. If was 9th and is expressed 9/3/24)
 %y because is a 2 digits year number (%Y for 4)
 %H:%i because is a 24 hour clock (%h:%I for 12 hour format like 01:45 PM)
*/

And here we have:

Let’s be franc, I gave you the worst solution first, but I think is useful to see complexity to be able to recognise simplicity.

Seeing how a code behaves gives you a much deeper understanding on how a statement really works and in which scenarios an approach you have taken before, may not apply and need something different.

Errors and wrong approaches are gems on our learning path. One thing that I find with courses is that they have a very protective environment in which you are led to the solutions way to directly, and you miss a lot of knowledge and roundness with this “easiness”.

In case you are following all parts, in the previous one we had to go through this transformation because we were updating the hire_date column, therefore was absolutely necessary to transform the dates from our file format to MySQL format.

In this case scenario is not, as we don’t really need to touch the timestamps at all, we can let the database do its thing!

If we go back to how our .csv was created in the first place:

The only thing we need to solve our issues is to remove those painful extra semicolon at the end (and add those zeroes where no parent), and instead of creating the timestamps ourselves we would target the LOAD INFILE to only those 3 columns:

LOAD DATA INFILE '/var/lib/mysql-files/Department_table_with_timestamp.csv'
INTO TABLE department
FIELDS TERMINATED BY ';'
IGNORE 1 LINES(department_id, department_name, parent_id);

With this we tell MySQL that those are the only columns we want to update with our LOAD DATA INFILE.

Now, let’s have a talk… about the parent_id column, all good!

Depending on the data conventions of your company you need to understand what is the value that NULL and Empty Strings has, how is it used in your business?

We do like dashes or an empty cells idea where there is no parent_id because is more visually striking.

But dashes may not be too conventional and empty, same as NULL, can have its ambiguity meaning data was unknown or left without updating properly; so having it as a zero can speak much louder and state that data was not forgotten but has no parent indeed.

Also, you can always convert it later in a view or on extraction to fit your preferences.

The most important thing is normalisation, as long as data is always treated the same way, does not matter how you decide to put it, but that normalisation, needs to be in line with your business conventions.

Makes a world of a difference when companies and employees have that data mindfulness and is standardised properly, analysis becomes much easier, otherwise the majority of your time as an analyst (sad but true) will be spent cleaning and normalising data before you can even start analysing it.

So if you would want to convert it to NULL in the table per se:

UPDATE department
SET parent_id = NULL
WHERE parent_id = 0;

For empty string or to any other symbol, you would first need to change your column setting, as we had it as INTEGER:

ALTER TABLE your_table MODIFY COLUMN your_column VARCHAR(255);
-- Set parent_id to empty string for departments without a parent
UPDATE department
SET parent_id = '' -- or '-' if you want the dash
WHERE parent_id = 0;

If you don’t know what data type your column is because you did not create it or do not remember (it happens..), is really easy to find out in the workbench:

Or you can always do the below statement, which provides further info:

SHOW COLUMNS FROM department;

Given our (initial..) decision on using zero for parent_id let’s remove the possibility of NULL values, so whomever that creates a row, has always to put a zero if there’s no parent (given that is also INT type, no other value would be allowed, so either a proper parent, or a zero):

ALTER TABLE department
MODIFY parent_id INT NOT NULL;
-- ; because is the end of this command, but we can run both together
-- below for seeing the change but bringing only this column:
SHOW COLUMNS FROM department LIKE 'parent_id';

Now.. this can be risky too, they may key 9 instead of 0, given that we are forcing them to enter a value even if no parent.

Wouldn’t be better then to leave as NULL despise the ambiguity? Probably, yes.

But then again, what’s the convention for NULL in your company?

The less error prone would be to set as No Parent for those instances.

Removes the ambiguity (data has not mistakenly left without updating)
Numbers cannot be entered by mistake producing errors in the data

So let’s alter our table once more:

ALTER TABLE department 
MODIFY COLUMN parent_id VARCHAR(255) NOT NULL; 
UPDATE department
SET parent_id = 'No Parent' 
WHERE parent_id = 0;

And let’s select only 2 of our rows to see so we play a bit with the basic statements and 'rest' the brain!:

SELECT department_id, parent_id
FROM department
WHERE department_id IN (1,16);

But to get a grasp of the visual component , here in full:

I would say is indeed striking, you can really quickly see those with parent and less prone to mistake than a zero. Win Win!

And of course still equally filterable using the following statement

SELECT *
FROM department
WHERE parent_id != 'No Parent'; -- you can also use <>

NOT NULL can be more intuitive, but the less we allow data being ambiguous, the better for analysis and certainty on its results.

Also if we would want to add further restrictions into the column, as to try to enforce as much as possible users setting it as No Parent and avoid 0 at all costs, we could add a constraint (and some kind of devilish laugh along with it!):

ALTER TABLE departments
ADD CONSTRAINT check_parent_id
CHECK (parent_id <> 0);

CHECK: This is where the constraint parameters are set.

If you encounter a case scenario in which you have a VARCHAR column but what you want to enforce that the column only contains letters (both uppercase and lowercase) and no numbers, you can use the regular expression ‘^[A-Za-z]+$’

ALTER TABLE your_table
ADD CONSTRAINT check_your_column_name 
CHECK (parent_id 
	  REGEXP '^[A-Za-z]+$');

REGEXP: this sets the Regular Expression.

It’s a pattern-matching technique essentially used to ask the database to find rows where a particular column’s value matches a specific pattern defined by the REGEXP clause.

For example, you might use REGEXP to search for all rows where a column contains a certain sequence of characters, or where it matches a particular format like a phone number or email address.

And here what the metacharacters above mean:

^: Anchors the match at the start of the string. It ensures that the pattern starts at the beginning of the string.

[A-Za-z]: Denotes a character class that matches any uppercase or lowercase letter. [A-Za-z] matches any single letter from A to Z (uppercase) or a to z (lowercase).

+: Matches one or more occurrences of the preceding character class. In this case, it matches one or more letters.

$: Anchors the match at the end of the string. It ensures that the pattern ends at the end of the string.

Regular expressions can be quite powerful and flexible, allowing you to create complex search patterns. They can be a bit tricky to master though, so let’s talk about this at a later time!

Part 4 will be around the concept of index tables and a brief introduction to Joins.

Until then, Happy coding!

CODING FOR ANALYSIS

Part 3. SQL & Analysis Fundamental Concepts

Recent Posts

Commentaires

CODING FOR ANALYSIS