Introduction

SQL stands for Structured Query Language. It is a programming language designed for managing data in a relational database. SQL SQL has a variety of functions that allow its users to read, manipulate, and change data. In this notebook, present basic to advanced level SQL programing for data analysis, for example filtering, aggregating, grouping, joining tables, subqueries, window functions and so forth.

What is SQL

SQL ( Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system. SQL is used for:

  • create databases
  • delete databases
  • create tables in a database
  • update data in a table
  • delete data from a table
  • read data from a table
  • insert data in a table
  • delete database tables
  • and many more database operations

All data such as Employee data, customer data, 3rd Party data will be stored in database or databases; then companies retrieve data by sql query to analysis and make business decision.

image.png

We should connect to server, which is a physical Machine, to access the data. Under each server, there are databases that all data are structured in tables. There are Schema Under each database. Schema is ownership structure. Within each database, there may be different owners (Schema). For example, sales team have their own schema. Under each Schema, there are objects like tables. For example, Person.salary table means it is owned by Person Schema. See Figure below:

image-2.png

Databases are normalized before they use. Normalization is the process of organizing data in database and creating tables and establishing relationships between those tables to make the database more flexible. Figure below shows how normalization works. All tables can be joined with ID table:

image-4.png

Select Records

All the tables below are Persona schema (owner): image.png

Top recodes for large data can be achieved by: image-2.png

WHERE CLAUSE, LIKE, IN and BETWEEN

The WHERE clause, LIKE, IN are used to filter records.

It is used to extract only those records that fulfill a specified condition.

LIKE operator finds a pattern and is the same as = for string. Not LIKE is the same as !=.

image.png

The queries below finds any records with Name string starts with ‘A’, end ‘D’ and have "-". The % wildcard will match zero, one, or many characters in text. image.png

The IN operator allows to specify multiple values in a WHERE clause. The IN operator is a shorthand for multiple OR conditions.

image.png

The BETWEEN operator selects values within a given range. The values can be numbers, text, or dates.

image.png

_ match a single character:

image.png

Join Multiple Tables

A JOIN clause is used to combine rows from two or more tables, based on a related column between them that have the same key. Here are the different types of the JOINs in SQL:

(INNER) JOIN

INNER JOIN is the same as JOIN: it Returns records that have matching values in both tables. See Figure below:

image-2.png

We use (INNER) JOIN most of the times.

LEFT (OUTER) JOIN

LEFT JOIN is the same as LEFT (OUTER) JOIN: it returns all records from the left table, and the matched records from the right table. See Figure below:

image.png

RIGHT (OUTER) JOIN

RIGHT JOIN is the same as RIGHT (OUTER) JOIN: it returns all records from the right table, and the matched records from the left table. See Figure below: image-2.png

FULL (OUTER) JOIN

FULL JOIN is the same as FULL (OUTER) JOIN: it returns all records when there is a match in either left or right table. See Figure below: image-2.png

ANTI LEFT JOIN

image.png

ANTI RIGHT JOIN

image.png

ANTI OUTER JOIN

image.png

CROSS JOIN

image.png

UNION

The UNION operator combines the result-set of two or more SELECT statements. There are conditions for this operator:

  • Every SELECT statement within UNION must have the same number of columns
  • The columns must also have similar data types
  • The columns in every SELECT statement must also be in the same order

The UNION operator selects only distinct values by default. We can use UNION ALL to allow duplicate values.

image.png

Date and Time Data Types

Data type Description
DATE A date. Format: YYYY-MM-DD. The supported range is from '1000-01-01' to '9999-12-31'
DATETIME(fsp) A date and time combination. Format: YYYY-MM-DD hh:mm:ss. The supported range is from '1000-01-01 00:00:00' to '9999-12-31 23:59:59'. Adding DEFAULT and ON UPDATE in the column definition to get automatic initialization and updating to the current date and time
TIMESTAMP(fsp) A timestamp. TIMESTAMP values are stored as the number of seconds since the Unix epoch ('1970-01-01 00:00:00' UTC). Format: YYYY-MM-DD hh:mm:ss. The supported range is from '1970-01-01 00:00:01' UTC to '2038-01-09 03:14:07' UTC. Automatic initialization and updating to the current date and time can be specified using DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP in the column definition
TIME(fsp) A time. Format: hh:mm:ss. The supported range is from '-838:59:59' to '838:59:59'
YEAR A year in four-digit format. Values allowed in four-digit format: 1901 to 2155, and 0000.MySQL 8.0 does not support year in two-digit format.

Aliasing in SQL

Aliasing can be applied for both table name and column name. The reason to use aliasing is because of:

  • Less code
  • Readability
  • Better Column Names

Figure below shows a query with aliasing for tables:

image.png

Figure below shows a query with aliasing for both tables and column names:

image-2.png

Joining Multiple Columns and String Values

We can merge multiple columns by + sign: Column1+Column2.... see examples below:

image.png

Merging columns and adding a character between two columns:

image.png

Apply aliasing after merging columns: image.png

We can use Concat for newer version instead of +.

image.png

Aggregating Data

  • GROUP BY: groups rows that have the same values into summary rows
  • GROUP BY is often used with aggregate functions (COUNT(*) (number of records) , MAX() (Maximum), MIN() (Minimum), SUM(), AVG() (Average) to group the result-set by one or more columns.
  • Null values are excluded from aggregation.
  • ORDER BY: sort the result-set in ascending (DESC) or descending (ASC) order.

See example below

  • GROUP BY and COUNT(*) image.png


  • GROUP BY, ORDER BY and COUNT(*) image-2.png


  • GROUP BY, ORDER BY, MIN() and MAX() image-3.png


  • GROUP BY, ORDER BY, MIN(), MAX() and COUNT(*)
    image-4.png


  • GROUP BY and SUM() image-6.png


  • GROUP BY, SUM() and AVG image-7.png

WHERE clause does not get aggregated values. Instead, we should use HAVING Clause.
image.png

Extracting Characters from Strings

We can use Left, Right and Substringto extract characters of records. See examples below:

  • Using LEFT image.png
  • Using SUBSTRING to Extract 3 characters from a string, starting in position 2: image.png

String Length and Character Position

LENis used to calculate length of string record: image.png

CHARINDEX is used to find position of a character in string: image-2.png

Creating a Database

We can use CREATE TABLE to design a table manually or programmatically:

image-3.png

We should always select a column IDENTITY as primary key. It is for connecting tables which is an ID number that does not allow NULL.

Adding data to a Table

Right click on database and click on “Edit Top 200 Rows” and then add data: image.png

Insert and Update Statement

For automatically insert in your database:

INSERT INTO table_name (column1, column2, column3...)

VALUES (value1,value2,value3...)

image.png

UPDATE can be used to add records: image.png

Delete, Drop and Alter Table

DELETE * FROM Current_Customers, will delete all the records from the table but the design will not disappear. Drop statement totally removes a table. See examples below:

image.png

The ALTER TABLE statement is used to add, delete, or modify columns in an existing table.

The ALTER TABLE statement is also used to add and drop various constraints on an existing table.

image.png

image-2.png

image-3.png

IS NULL and IS NOT NULL

image.png

ANY and ALL

The ANY and ALL operators allow to perform a comparison between a single column value and a range of other values.

The ANY operator:

  • returns a boolean value as a result
  • returns TRUE if ANY of the subquery values meet the condition

ANY means that the condition will be true if the operation is true for any of the values in the range.

Table below is Products

image.png

Table below is OrderDetails

image-3.png

query below show Any: image.png

image-2.png

Condition Meaning
x = ANY (…) The values in column c must match one or more values in the set to evaluate to true.
x != ANY (…) The values in column c must not match one or more values in the set to evaluate to true.
x > ANY (…) The values in column c must be greater than the smallest value in the set to evaluate to true.
x < ANY (…) The values in column c must be smaller than the biggest value in the set to evaluate to true.
x >= ANY (…) The values in column c must be greater than or equal to the smallest value in the set to evaluate to true.
x <= ANY (…) The values in column c must be smaller than or equal to the biggest value in the set to evaluate to true.

The ALL operator:

  • returns a boolean value as a result
  • returns TRUE if ALL of the subquery values meet the condition
  • is used with SELECT, WHERE and HAVING statements

ALL means that the condition will be true only if the operation is true for all values in the range.

image.png

Condition Meaning
c > ALL(…) The values in column c must greater than the biggest value in the set to evaluate to true.
c >= ALL(…) The values in column c must greater than or equal to the biggest value in the set to evaluate to true.
c < ALL(…) The values in column c must be less than the lowest value in the set to evaluate to true.
c >= ALL(…) The values in column c must be less than or equal to the lowest value in the set to evaluate to true.
c <> ALL(…) The values in column c must not be equal to any value in the set to evaluate to true.
c = ALL(…) The values in column c must be equal to any value in the set to evaluate to true.

Over Clause

The OVER clause in SQL Server is used with PARTITION BY to break up the data into partitions. Here is an example:

image-2.png

Over clause can be used with with Lead function which give image.png

Where 1=1

The statement where 1=1 in SQL means true. It is the same operation as running the select statement without the where clause.

In most cases, you will only need to use this clause when you need to build dynamic SQL statements. Once you apply the where 1=1 clause, all the subsequent statements after it can start with the ‘and’ keyword.

Consider an example where you are guessing an id of a column. Assuming you are not sure whether that id exists on the database, you can use something like where 1=1 to return all the rows even if the target id is not on the database. image.png

The query above uses an or statement. Hence, only one of the conditions needs to be true for the query to work.

Even if there is no user with an id of 10, the 1=1 will always evaluate to true, and the query will fetch all the rows in the specified table.

CONCAT_WS

The CONCAT_WS() function adds two or more strings together with a separator.

CONCAT_WS(separator, string1, string2, ...., string_n)

Add strings together. Use - to separate the concatenated string values:

image.png

Here is another example: image.png

Views

  • A view is a virtual table that is the result of a saved SQL SELECT statement
  • When accessed, views automatically update in response to updates in the underlying data

CREATE VIEW employee_years AS

SELECT id, name, year

FROM employees;

Then we can make a query to get the result from view:

SELECT id, name

FROM employee_years;

Two Popular SQL Flavors

PostgreSQL

  • Created at the University of California, Berkeley
  • Free and open-source relational database system
  • "PostgreSQL" refers to both the PostgreSQL database system

SQL Server

  • Created by Microsoft
  • Has free and paid versions
  • Queries using T-SQL

The difference between different types of SQL is similar to American and British English. There are key function in common. For example PostgreSQL uses LIMIT while SQL uses TOP at the end of query.

ROUND()

Round a number to a specific decimal

image.png Image retrieved from https://www.datacamp.com/

We can also round it to whole number: image-2.png Image retrieved from https://www.datacamp.com/

Negative value will round the digits from left: image-3.png

Image retrieved from https://www.datacamp.com/

TRUNC() Function

Return a number truncated to decimal places.

TRUNC(number, decimals)

if decimals>0 it returns number of digits after decimals:

SELECT TRUNC(15.47895, 2)= 15.47

if decimals<0 it returns number of digits before decimals:

SELECT TRUNC(12345, -3)= 12000

In [ ]:
 

CASE Expression

The CASE expression goes through conditions and returns a value when the first condition is met (like an if-then-else statement). So, once a condition is true, it will stop reading and return the result. If no conditions are true, it returns the value in the ELSE clause.

If there is no ELSE part and no conditions are true, it returns NULL.

image.png

See example below:

image-2.png

CASE statement can be located inside WHERE cluse.It can be ended to END IS NOT NULL to not have Null values:

image-2.png

CASE WHEN with Aggregate

  • CASE statements are great for
    • Categorizing data
    • Filtering data
    • Aggregating data

image.png Image retrieved from https://www.datacamp.com/

  • CASE WHEN with COUNT

image.png Image retrieved from https://www.datacamp.com/

  • CASE WHEN with SUM

image-2.png Image retrieved from https://www.datacamp.com/

image-3.png Image retrieved from https://www.datacamp.com/

In [ ]:
 

Subqueries

A subquery is a query nested inside another query: image.png

Subqueries can be placed in any part of a query:

  • SELECT
  • FROM
  • WHERE
  • GROUP BY

It can return scaler, a list or a table.

A subquery is nested inside another query that can be run on its own.

The example below shows a subquery inside a where clause: image.png

The query image-2.png can be run on its own.

SQL first process inside subquery as a single value table or list then the outer query is processed

Subquery in FROM

Subquery in where clause returns a single column, but what if we want to get more complex query. Subquery in from gives restructure and transform your data. For example:

  • Transform Data
  • Prefilter Data
  • Calculate aggregates of aggregates

Usually we should make the subquery first and then add to main query.

image.png Image retrieved from https://www.datacamp.com/

See the query below that used subquery to make a new table using AS then select its table in SELECT statement: image.png

Below is a complex subquery in FROM within outer query:

image.png

Subquery in Select

  • Subquery in Select should return a single value, otherwise it will generate an error
  • Should properly filter both the main and the subquery. Since subquery is processed before the main query, we only need to include the relevant filter int the subquery as well as the main query. image.png. See query below:

Correlated subquery

  • Correlated subquery uses values from outer query to generate a result
  • Subquery is re-run for every row generated in the final data set
  • It is usually used for advanced joining, filtering ...
Simple Subquery Correlated Subquery
Can be run independently from main query Is dependent on the main query to execute
Evaluated only once Evaluated in loops. This leads to significantly slow down query runtime.

Figure below shows simple subquery on the left and correlated subquery on the right image-2.png

Another complex correlated subquery:

image.png

Nested Subqueries

Subqueries nested inside other subqueries.

The EXTRACT() function extracts a part from a given date. For example EXTRACT(part FROM date): SELECT EXTRACT(WEEK FROM "2017-06-15");

Query below shows an example of simple subquery:

image.png

Another complex example of subquery:

image.png

WITH clause (Common Table Expressions)

The SQL WITH clause was introduced by Oracle. The SQL WITH clause allows a sub-query block a name, which can be referenced in several places within the main SQL query. It can be considered as a function.

The WITH clause is considered “temporary” because the result is not permanently stored anywhere in the database schema. image-2.png

  • CTE is only run once and store in memory. This leads to improve the amount of time to run your query.

  • CTE leads to improve organization of queries for long query

See complex example below:

image.png

Only the first CTE is declared using WITH. Afterward, use the cte's name, away, and AS.

Here is another complex cte query

image.png

Window Functions

It is impossible to get every single none aggregate value without `GROUPING BY`: it is not possible to get aggregate values with none aggregate values. Query returns an error. To resolve this, we can use window function. The PARTITION BY clause allows to calculate separate "windows" based on columns that we want to divide results.

Over

  • Calculation is performed on an already generated result set (a window)
  • Aggregated calculations can be applied without grouping data as Running totals, rankings, moving averages

Both queries below give the same answer. The left query is subquery but right query is using window function

image.png

The query below shows what the rank of matches based on number of goals. Window functions allows to creates a RANK() of information according to any variable and also sort the data. When setting this up, you will need to specify what column/calculation you want to use to calculate your rank. This is done by including an ORDER BY clause inside the OVER() clause:

image-2.png

This option is available in PostgreSQL, Oracle, MySQL, SQL Server....but NOT SQLite.

The OVER() clause allow to pass an aggregate function down a data set, similar to subqueries in SELECT. However, the OVER() clause has significant benefits over subqueries in select: 1- queries will run faster, 2- the OVER() clause has a wide range of additional functions and clauses.

OVER with a PARTITION

  • Partition allows you to calculate separate values for each categories.
  • Using only one column of data for different calculation
  • It is simply calculated byPARTITION BY within OVER:

AVG(ABC) OVER(PARTITION BY DEF)

First use an aggregate function to calculate aggregate values then such as average of ABC in above, then add an OVER clause afterward and inside the parentheses, say PARTITION BY followed by the column you want to average by (DEF)

Query below shows how many goals scored in each match, and how to compare with the season's average??

image.png

We can use partition by with multiple columns just having multiple columns after PARTITION BY. See query below:

image.png

Sliding Windows

  • Sliding functions perform calculation relative to the current row of data set
  • Can be used to calculate one row of data set at a time.
  • Can be partitioned by one or more columns

The code is written as

ROWS BETWEEN <start> AND <finish>

  • PRECEDING and FOLLOWING are the rows before and after the current row, respectively
  • UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGshowing the row from beginning or the end of data set in your calculation
  • CURRENT ROW tells sql you want to stop running at current row

The query below calculates sum of goals manchester city played as home team during '2011/2012' and then turn the calculation into running total order by day of the match and calculating from beginning of data set to the current row

image.png

The query below calculate the sum of goals between previous and current match.

image.png

COALESCE() Function

  • Operates row by row
  • Returns first non-NULL value

image.png

CAST() Function

One variable type can be converted to another type via cast() function. For example, the query below givs integer of 14.3 which is 14.

image.png

We can do this for entire column: image-2.png

Casting with ::

Another way of casting is using :: notation

image.png

image-2.png