Introduction
SQL stands for Structured Query Language. It is a programming language designed for managing data in a relational database. SQL SQL has a variety of functions that allow its users to read, manipulate, and change data. In this notebook, present basic to advanced level SQL programing for data analysis, for example filtering, aggregating, grouping, joining tables, subqueries, window functions and so forth.
SQL ( Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system. SQL is used for:
All data such as Employee data, customer data, 3rd Party data will be stored in database or databases; then companies retrieve data by sql query to analysis and make business decision.
We should connect to server, which is a physical Machine, to access the data. Under each server, there are databases that all data are structured in tables. There are Schema Under each database. Schema is ownership structure. Within each database, there may be different owners (Schema). For example, sales team have their own schema. Under each Schema, there are objects like tables. For example, Person.salary table means it is owned by Person Schema. See Figure below:
Databases are normalized before they use. Normalization is the process of organizing data in database and creating tables and establishing relationships between those tables to make the database more flexible. Figure below shows how normalization works. All tables can be joined with ID
table:
All the tables below are Persona schema (owner):
Top recodes for large data can be achieved by:
The WHERE
clause, LIKE
, IN
are used to filter records.
It is used to extract only those records that fulfill a specified condition.
LIKE
operator finds a pattern and is the same as =
for string. Not LIKE
is the same as !=
.
The queries below finds any records with Name string starts with ‘A’, end ‘D’ and have "-". The %
wildcard will match zero, one, or many characters in text.
The IN operator allows to specify multiple values in a WHERE clause. The IN operator is a shorthand for multiple OR conditions.
The BETWEEN
operator selects values within a given range. The values can be numbers, text, or dates.
_
match a single character:
A JOIN
clause is used to combine rows from two or more tables, based on a related column between them that have the same key. Here are the different types of the JOINs in SQL:
INNER JOIN
is the same as JOIN
: it Returns records that have matching values in both tables. See Figure below:
We use (INNER) JOIN most of the times.
LEFT JOIN
is the same as LEFT (OUTER) JOIN
: it returns all records from the left table, and the matched records from the right table. See Figure below:
RIGHT JOIN
is the same as RIGHT (OUTER) JOIN
: it returns all records from the right table, and the matched records from the left table. See Figure below:
FULL JOIN
is the same as FULL (OUTER) JOIN
: it returns all records when there is a match in either left or right table. See Figure below:
The UNION
operator combines the result-set of two or more SELECT statements. There are conditions for this operator:
The UNION
operator selects only distinct values by default. We can use UNION ALL
to allow duplicate values.
Data type | Description |
---|---|
DATE | A date. Format: YYYY-MM-DD. The supported range is from '1000-01-01' to '9999-12-31' |
DATETIME(fsp) | A date and time combination. Format: YYYY-MM-DD hh:mm:ss. The supported range is from '1000-01-01 00:00:00' to '9999-12-31 23:59:59'. Adding DEFAULT and ON UPDATE in the column definition to get automatic initialization and updating to the current date and time |
TIMESTAMP(fsp) | A timestamp. TIMESTAMP values are stored as the number of seconds since the Unix epoch ('1970-01-01 00:00:00' UTC). Format: YYYY-MM-DD hh:mm:ss. The supported range is from '1970-01-01 00:00:01' UTC to '2038-01-09 03:14:07' UTC. Automatic initialization and updating to the current date and time can be specified using DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP in the column definition |
TIME(fsp) | A time. Format: hh:mm:ss. The supported range is from '-838:59:59' to '838:59:59' |
YEAR | A year in four-digit format. Values allowed in four-digit format: 1901 to 2155, and 0000.MySQL 8.0 does not support year in two-digit format. |
Aliasing can be applied for both table name and column name. The reason to use aliasing is because of:
Figure below shows a query with aliasing for tables:
Figure below shows a query with aliasing for both tables and column names:
We can merge multiple columns by +
sign: Column1+Column2.... see examples below:
Merging columns and adding a character between two columns:
Apply aliasing after merging columns:
We can use Concat for newer version instead of +.
GROUP BY
: groups rows that have the same values into summary rowsGROUP BY
is often used with aggregate functions (COUNT(*)
(number of records) , MAX()
(Maximum), MIN()
(Minimum), SUM()
, AVG()
(Average) to group the result-set by one or more columns.ORDER BY
: sort the result-set in ascending (DESC
) or descending (ASC
) order.See example below
GROUP BY
and COUNT(*)
GROUP BY
, ORDER BY
and COUNT(*)
GROUP BY
, ORDER BY
, MIN()
and MAX()
GROUP BY
, ORDER BY
, MIN()
, MAX()
and COUNT(*)
GROUP BY
and SUM()
GROUP BY
, SUM()
and AVG
WHERE
clause does not get aggregated values. Instead, we should use HAVING
Clause.
We can use Left
, Right
and Substring
to extract characters of records. See examples below:
LEFT
SUBSTRING
to Extract 3 characters from a string, starting in position 2:
LEN
is used to calculate length of string record:
CHARINDEX
is used to find position of a character in string:
We can use CREATE TABLE
to design a table manually or programmatically:
We should always select a column IDENTITY as primary key. It is for connecting tables which is an ID number that does not allow NULL.
Right click on database and click on “Edit Top 200 Rows” and then add data:
For automatically insert in your database:
INSERT INTO
table_name (column1, column2, column3...)
VALUES
(value1,value2,value3...)
UPDATE
can be used to add records:
DELETE * FROM
Current_Customers, will delete all the records from the table but the design will not disappear. Drop
statement totally removes a table. See examples below:
The ALTER TABLE
statement is used to add, delete, or modify columns in an existing table.
The ALTER TABLE
statement is also used to add and drop various constraints on an existing table.
The ANY
and ALL
operators allow to perform a comparison between a single column value and a range of other values.
The ANY
operator:
ANY
means that the condition will be true if the operation is true for any of the values in the range.
Table below is Products
Table below is OrderDetails
query below show Any
:
Condition | Meaning |
---|---|
x = ANY (…) | The values in column c must match one or more values in the set to evaluate to true. |
x != ANY (…) | The values in column c must not match one or more values in the set to evaluate to true. |
x > ANY (…) | The values in column c must be greater than the smallest value in the set to evaluate to true. |
x < ANY (…) | The values in column c must be smaller than the biggest value in the set to evaluate to true. |
x >= ANY (…) | The values in column c must be greater than or equal to the smallest value in the set to evaluate to true. |
x <= ANY (…) | The values in column c must be smaller than or equal to the biggest value in the set to evaluate to true. |
The ALL
operator:
ALL means that the condition will be true only if the operation is true for all values in the range.
Condition | Meaning |
---|---|
c > ALL(…) | The values in column c must greater than the biggest value in the set to evaluate to true. |
c >= ALL(…) | The values in column c must greater than or equal to the biggest value in the set to evaluate to true. |
c < ALL(…) | The values in column c must be less than the lowest value in the set to evaluate to true. |
c >= ALL(…) | The values in column c must be less than or equal to the lowest value in the set to evaluate to true. |
c <> ALL(…) | The values in column c must not be equal to any value in the set to evaluate to true. |
c = ALL(…) | The values in column c must be equal to any value in the set to evaluate to true. |
The OVER
clause in SQL Server is used with PARTITION BY
to break up the data into partitions. Here is an example:
Over
clause can be used with with Lead
function which give
The statement where 1=1 in SQL means true. It is the same operation as running the select statement without the where clause.
In most cases, you will only need to use this clause when you need to build dynamic SQL statements. Once you apply the where 1=1 clause, all the subsequent statements after it can start with the ‘and’ keyword.
Consider an example where you are guessing an id of a column. Assuming you are not sure whether that id exists on the database, you can use something like where 1=1 to return all the rows even if the target id is not on the database.
The query above uses an or statement. Hence, only one of the conditions needs to be true for the query to work.
Even if there is no user with an id of 10, the 1=1 will always evaluate to true, and the query will fetch all the rows in the specified table.
The CONCAT_WS()
function adds two or more strings together with a separator.
CONCAT_WS(separator, string1, string2, ...., string_n)
Add strings together. Use -
to separate the concatenated string values:
Here is another example:
SELECT
statementCREATE VIEW
employee_years AS
SELECT
id, name, year
FROM
employees;
Then we can make a query to get the result from view:
SELECT
id, name
FROM
employee_years;
PostgreSQL
SQL Server
The difference between different types of SQL is similar to American and British English. There are key function in common. For example PostgreSQL uses LIMIT
while SQL uses TOP
at the end of query.
Round a number to a specific decimal
Image retrieved from https://www.datacamp.com/
We can also round it to whole number: Image retrieved from https://www.datacamp.com/
Negative value will round the digits from left:
Image retrieved from https://www.datacamp.com/
Return a number truncated to decimal places.
TRUNC(number, decimals)
if decimals>0 it returns number of digits after decimals:
SELECT TRUNC(15.47895, 2)= 15.47
if decimals<0 it returns number of digits before decimals:
SELECT TRUNC(12345, -3)= 12000
The CASE
expression goes through conditions and returns a value when the first condition is met (like an if-then-else statement). So, once a condition is true, it will stop reading and return the result. If no conditions are true, it returns the value in the ELSE clause.
If there is no ELSE part and no conditions are true, it returns NULL.
See example below:
CASE
statement can be located inside WHERE
cluse.It can be ended to END IS NOT NULL
to not have Null values:
Image retrieved from https://www.datacamp.com/
COUNT
Image retrieved from https://www.datacamp.com/
SUM
Image retrieved from https://www.datacamp.com/
Image retrieved from https://www.datacamp.com/
A subquery is a query nested inside another query:
Subqueries can be placed in any part of a query:
SELECT
FROM
WHERE
GROUP BY
It can return scaler, a list or a table.
A subquery is nested inside another query that can be run on its own.
The example below shows a subquery inside a where
clause:
The query can be run on its own.
SQL first process inside subquery as a single value table or list then the outer query is processed
Subquery in where clause returns a single column, but what if we want to get more complex query. Subquery in from gives restructure and transform your data. For example:
Usually we should make the subquery first and then add to main query.
Image retrieved from https://www.datacamp.com/
See the query below that used subquery to make a new table using AS
then select its table in SELECT
statement:
Below is a complex subquery in FROM
within outer query:
Simple Subquery | Correlated Subquery |
---|---|
Can be run independently from main query | Is dependent on the main query to execute |
Evaluated only once | Evaluated in loops. This leads to significantly slow down query runtime. |
Figure below shows simple subquery on the left and correlated subquery on the right
Another complex correlated subquery:
Subqueries nested inside other subqueries.
The EXTRACT()
function extracts a part from a given date. For example EXTRACT(part FROM date)
: SELECT EXTRACT(WEEK FROM "2017-06-15");
Query below shows an example of simple subquery:
Another complex example of subquery:
The SQL WITH
clause was introduced by Oracle. The SQL WITH
clause allows a sub-query block a name, which can be referenced in several places within the main SQL query. It can be considered as a function.
The WITH
clause is considered “temporary” because the result is not permanently stored anywhere in the database schema.
CTE is only run once and store in memory. This leads to improve the amount of time to run your query.
CTE leads to improve organization of queries for long query
See complex example below:
Only the first CTE is declared using WITH. Afterward, use the cte's name, away, and AS.
Here is another complex cte query
It is impossible to get every single none aggregate value without `GROUPING BY`: it is not possible to get aggregate values with none aggregate values. Query returns an error. To resolve this, we can use window function. The PARTITION BY clause allows to calculate separate "windows" based on columns that we want to divide results.
Both queries below give the same answer. The left query is subquery but right query is using window function
The query below shows what the rank of matches based on number of goals. Window functions allows to creates a RANK()
of information according to any variable and also sort the data. When setting this up, you will need to specify what column/calculation you want to use to calculate your rank. This is done by including an ORDER BY
clause inside the OVER()
clause:
This option is available in PostgreSQL, Oracle, MySQL, SQL Server....but NOT SQLite.
The OVER()
clause allow to pass an aggregate function down a data set, similar to subqueries in SELECT
. However, the OVER()
clause has significant benefits over subqueries in select: 1- queries will run faster, 2- the OVER()
clause has a wide range of additional functions and clauses.
PARTITION BY
within OVER
:AVG
(ABC) OVER
(PARTITION BY
DEF)
First use an aggregate function to calculate aggregate values then such as average of ABC in above, then add an OVER
clause afterward and inside the parentheses, say PARTITION BY
followed by the column you want to average by (DEF)
Query below shows how many goals scored in each match, and how to compare with the season's average??
We can use partition by with multiple columns just having multiple columns after PARTITION BY
. See query below:
The code is written as
ROWS
BETWEEN
<start
> AND
<finish
>
PRECEDING
and FOLLOWING
are the rows before and after the current row, respectivelyUNBOUNDED PRECEDING
AND
UNBOUNDED FOLLOWING
showing the row from beginning or the end of data set in your calculationCURRENT ROW
tells sql you want to stop running at current rowThe query below calculates sum of goals manchester city played as home team during '2011/2012' and then turn the calculation into running total order by day of the match and calculating from beginning of data set to the current row
The query below calculate the sum of goals between previous and current match.
One variable type can be converted to another type via cast()
function. For example, the query below givs integer of 14.3 which is 14.
We can do this for entire column:
Another way of casting is using ::
notation