SQL aka structured query language is the base for the Pizza (Analytics and Data Science). While everyone goes gung-ho about the toppings and cheese (machine learning & fancy algorithms). SQL is critical for being able to query structured data for clear insights and start analysis with right questions. Thus the need to learn SQL basics and beyond.
“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.”
- Abraham Lincoln
Who is this for ?
These notes/pointers are folks (executives, new analysts etc) who are looking to setup or learn sql basics to be able to do basic analytics from time to time. Anyone, who’s already worked extensively with data and does so on a regular basis, can skip this.
Why is SQL necessary ? I’m planning to do AI/ML
The analytics setup at any firm starts with 3 steps: Analytical Base + Data Engineering (Cheese) Middle Layer + Final Toppings of Data Science to customise and specialise the Pizza. Go through this write up to understand how to setup an analytics program from scratch. To be able to setup the base correctly, one needs nothing more than a good DB setup and good old SQL. Learn SQL !
The first setup of requirements for learning SQL is to be able to practice it throughly on a test work bench. Where, one can quickly iterate over queries while learning basic commands etc. This may not necessarily true for an internal database setup in the early days, where one might not have a proper engine setup correctly. The data might not be correct and we have no feedback loop on what’s a correct outcome. LearnSQL comes in handy here!
Querying from a Single Table
The simplest of the requirements will be around querying from a single table. We might be required to fetch only a subset of data. The commands help filter or choose a subset of rows or columns once the base data has been decided.
- SELECT * : Choose all columns present in the underlying table
- FROM : To choose the table of choice
- WHERE : Helps filter rows basis conditions.
- >, <, =, != : Greater and Less than. Equal to and not equal to conditions. Goes with WHERE
- BETWEEN: When choosing between a range of values. You can do the opposite by also using NOT BETWEEN.
- OR/AND : Operands to combine multiple conditions, possibly over multiple columns with different constraints.
- (Parenthesis) : Useful, when adding having complex and long conditions, that are mix of OR & ANDs.
- % (Percent) : The percentage sign is useful for partial match like trying to match strings starting with particular letter or sequence of letters and so on.
- One caveat, SQL is case sensitive when matching strings. ‘FORD’ is not same as ‘Ford’.
- SQL commands and syntax are not case sensitive. Thus WHERE and where are both treated equally by the SQL engine.
- NULL : This keyword is used when one of the column values is missing. It’s a special case and is checked by using IS NULL or de-selected using IS NOT NULL. Equal to (=) does not work with NULL.
- Arithmetic: We can do the basic arithmetic operations like multiply, divide, add and subtract numbers.
Now the goal is to take this further, by being able to query and fetch data from multiple tables. The same set of commands are useful but we will need to go beyond those by learning JOINs and beyond.
- e.g. SELECT * FROM table1
- JOIN table2 ON table1.id = table2.id
- The writing of the table names as prefix is optional when the column names are unique and different between the tables.
- But mentioning them explicitly irrespective is a good practice.
GROUP BY, STATISTICS and beyond
Besides being able to query and fetch data from multiple tables and filtering rows basis conditions, the next step is to be able to do summary stats. This requires a couple of commands like GROUP BY and so on.
- e.g. SELECT * FROM table1 ORDER BY table1.columnx
- It’s the same set of data but the rows are now ordered by the values in columnx
- DISTINCT: Useful when selecting or looking for only unique entries or unique combinations of occurrences instead of all.
- COUNT: Helps count rows in the specific output. Know the count of instances meeting our specific conditions.
- MAX/MIN: Max and Min help choose the maximum or the minimum values from a column in our tables.
- AVG: Average is one way to compute the average of the values from a specific column.
- e.g. SELECT column1, COUNT(column2) FROM table1 GROUP BY column1
- We are selecting the column1 and the count of rows by using column1 in group by. It tells how many times, column1 occurs in the table.
- In terms of order of choosing GROUP BY and WHERE. We first filter rows of interest using WHERE and then Group them using GROUP BY.
- SELECT column1, COUNT(column2) FROM table1 GROUP BY column1 HAVING COUNT(column2) > 2
- The same query but this time filtering for rows where the count is greater than 2
JOINs in Detail
There are broadly four joins in total, JOIN aka LEFT JOIN, INNER JOIN, RIGHT JOIN and CROSS JOIN. The differentiation is best visualised using the Venn diagram with the syntax being the same as above with change in the key words.
Subqueries are useful when we need to run a query within a query to be able to fetch the desired results. It’s kinda similar to the recursive functions in programming. Calling a function or query within another. There are often multiple ways to execute something through a sub-query vs a JOIN.
- e.g. Filter for all values from Table1 for some values less than a threshold in Table2.
- We could do a LEFT JOIN and then explicitly filter for values in Table1 or we run a subquery within.
- Correlated Subquery: Subqueries can also depend on values from the main query. Thus correlated or dependent subqueries.
- ANY/ALL : These are some of operators useful when carrying out subqueries. Useful to compare when conditions from the main query compared against the output from the internal query.
- AS : Sometimes, we might want to run a bunch of subqueries and save their outputs in form of temporary tables. We can give them names aka alias using the AS key word.
- We created a new column using a subquery and name it as count
- The main query was fetching name from mountain.
- The subquery fetched the column count.
Besides these, there are a few basic set of commands to be able to combine outputs from multiple queries based on the set theory, namely UNION, INTERSECT.
- UNION: This combines the output from multiple queries. Like if you are fetching IDs from two queries and want a OR condition or a UNION, this command can help us combine the output and remove duplicates.
- INTERSECT: This is a representation of the AND condition, when the outputs occur in both the queries, we will have the outputs.
- EXCEPT/MINUS: This is basically outputs from one query subtracted from another.
Why buy a paid course? SQL is available freely
So, this wraps up all basic commands and their intent/usage patterns. The fastest way to pick these up is to go ahead and learn SQL basics from this course. It’s paid thus the platform works smoothly, you can practice different commands mentioned with a beautiful UI/UX. The money goes a long way in ensuring, you focus on learning. This course can be done dedicatedly in 2–3 days and goes a long way in making you self reliant.