Introduction
SQL stands for Structured Query Language. It is a programming language designed for managing data in a relational database. SQL SQL has a variety of functions that allow its users to read, manipulate, and change data. In this notebook, present basic to advanced level SQL programing for data analysis, for example filtering, aggregating, grouping, joining tables, subqueries, window functions and so forth.
Table of Contents
- 1 What is SQL
- 2 Select Records
- 3 WHERE CLAUSE, LIKE, IN and BETWEEN
- 4 Join Multiple Tables
- 5 Date and Time Data Types
- 6 Aliasing in SQL
- 7 Joining Multiple Columns and String Values
- 8 Aggregating Data
- 9 Extracting Characters from Strings
- 10 String Length and Character Position
- 11 Creating a Database
- 12 Adding data to a Table
- 13 Insert and Update Statement
- 14 Delete, Drop and Alter Table
- 15 IS NULL and IS NOT NULL
- 16 ANY and ALL
- 17 Over Clause
- 18 Where 1=1
- 19 CONCAT_WS
- 20 Views
- 21 Two Popular SQL Flavors
- 22 ROUND()
- 23 TRUNC() Function
- 24 CASE Expression
- 25 Subqueries
- 26 Window Functions
- 27 COALESCE() Function
- 28 CAST() Function
- 29 Casting with ::
What is SQL¶
SQL ( Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system. SQL is used for:
- create databases
- delete databases
- create tables in a database
- update data in a table
- delete data from a table
- read data from a table
- insert data in a table
- delete database tables
- and many more database operations
All data such as Employee data, customer data, 3rd Party data will be stored in database or databases; then companies retrieve data by sql query to analysis and make business decision.
We should connect to server, which is a physical Machine, to access the data. Under each server, there are databases that all data are structured in tables. There are Schema Under each database. Schema is ownership structure. Within each database, there may be different owners (Schema). For example, sales team have their own schema. Under each Schema, there are objects like tables. For example, Person.salary table means it is owned by Person Schema. See Figure below:
Databases are normalized before they use. Normalization is the process of organizing data in database and creating tables and establishing relationships between those tables to make the database more flexible. Figure below shows how normalization works. All tables can be joined with ID
table:
Select Records¶
All the tables below are Persona schema (owner):
Top recodes for large data can be achieved by:
WHERE CLAUSE, LIKE, IN and BETWEEN¶
The WHERE
clause, LIKE
, IN
are used to filter records.
It is used to extract only those records that fulfill a specified condition.
LIKE
operator finds a pattern and is the same as =
for string. Not LIKE
is the same as !=
.
The queries below finds any records with Name string starts with ‘A’, end ‘D’ and have "-". The %
wildcard will match zero, one, or many characters in text.
The IN operator allows to specify multiple values in a WHERE clause. The IN operator is a shorthand for multiple OR conditions.
The BETWEEN
operator selects values within a given range. The values can be numbers, text, or dates.
_
match a single character:
Join Multiple Tables¶
A JOIN
clause is used to combine rows from two or more tables, based on a related column between them that have the same key. Here are the different types of the JOINs in SQL:
(INNER) JOIN¶
INNER JOIN
is the same as JOIN
: it Returns records that have matching values in both tables. See Figure below:
We use (INNER) JOIN most of the times.
LEFT (OUTER) JOIN¶
LEFT JOIN
is the same as LEFT (OUTER) JOIN
: it returns all records from the left table, and the matched records from the right table. See Figure below:
RIGHT (OUTER) JOIN¶
RIGHT JOIN
is the same as RIGHT (OUTER) JOIN
: it returns all records from the right table, and the matched records from the left table. See Figure below:
FULL (OUTER) JOIN¶
FULL JOIN
is the same as FULL (OUTER) JOIN
: it returns all records when there is a match in either left or right table. See Figure below:
ANTI LEFT JOIN¶
ANTI RIGHT JOIN¶
ANTI OUTER JOIN¶
CROSS JOIN¶
UNION¶
The UNION
operator combines the result-set of two or more SELECT statements. There are conditions for this operator:
- Every SELECT statement within UNION must have the same number of columns
- The columns must also have similar data types
- The columns in every SELECT statement must also be in the same order
The UNION
operator selects only distinct values by default. We can use UNION ALL
to allow duplicate values.
Date and Time Data Types¶
Data type | Description |
---|---|
DATE | A date. Format: YYYY-MM-DD. The supported range is from '1000-01-01' to '9999-12-31' |
DATETIME(fsp) | A date and time combination. Format: YYYY-MM-DD hh:mm:ss. The supported range is from '1000-01-01 00:00:00' to '9999-12-31 23:59:59'. Adding DEFAULT and ON UPDATE in the column definition to get automatic initialization and updating to the current date and time |
TIMESTAMP(fsp) | A timestamp. TIMESTAMP values are stored as the number of seconds since the Unix epoch ('1970-01-01 00:00:00' UTC). Format: YYYY-MM-DD hh:mm:ss. The supported range is from '1970-01-01 00:00:01' UTC to '2038-01-09 03:14:07' UTC. Automatic initialization and updating to the current date and time can be specified using DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP in the column definition |
TIME(fsp) | A time. Format: hh:mm:ss. The supported range is from '-838:59:59' to '838:59:59' |
YEAR | A year in four-digit format. Values allowed in four-digit format: 1901 to 2155, and 0000.MySQL 8.0 does not support year in two-digit format. |
Aliasing in SQL¶
Aliasing can be applied for both table name and column name. The reason to use aliasing is because of:
- Less code
- Readability
- Better Column Names
Figure below shows a query with aliasing for tables:
Figure below shows a query with aliasing for both tables and column names:
Joining Multiple Columns and String Values¶
We can merge multiple columns by +
sign: Column1+Column2.... see examples below:
Merging columns and adding a character between two columns:
Apply aliasing after merging columns:
We can use Concat for newer version instead of +.
Aggregating Data¶
GROUP BY
: groups rows that have the same values into summary rows
GROUP BY
is often used with aggregate functions (COUNT(*)
(number of records) ,MAX()
(Maximum),MIN()
(Minimum),SUM()
,AVG()
(Average) to group the result-set by one or more columns.
- Null values are excluded from aggregation.
ORDER BY
: sort the result-set in ascending (DESC
) or descending (ASC
) order.
See example below
GROUP BY
andCOUNT(*)
GROUP BY
,ORDER BY
andCOUNT(*)
GROUP BY
,ORDER BY
,MIN()
andMAX()
GROUP BY
,ORDER BY
,MIN()
,MAX()
andCOUNT(*)
GROUP BY
andSUM()
GROUP BY
,SUM()
andAVG
WHERE
clause does not get aggregated values. Instead, we should use HAVING
Clause.
Extracting Characters from Strings¶
We can use Left
, Right
and Substring
to extract characters of records. See examples below:
- Using
LEFT
- Using
SUBSTRING
to Extract 3 characters from a string, starting in position 2:
String Length and Character Position¶
LEN
is used to calculate length of string record:
CHARINDEX
is used to find position of a character in string:
Creating a Database¶
We can use CREATE TABLE
to design a table manually or programmatically:
We should always select a column IDENTITY as primary key. It is for connecting tables which is an ID number that does not allow NULL.
Adding data to a Table¶
Right click on database and click on “Edit Top 200 Rows” and then add data:
Insert and Update Statement¶
For automatically insert in your database:
INSERT INTO
table_name (column1, column2, column3...)
VALUES
(value1,value2,value3...)
UPDATE
can be used to add records:
Delete, Drop and Alter Table¶
DELETE * FROM
Current_Customers, will delete all the records from the table but the design will not disappear. Drop
statement totally removes a table. See examples below:
The ALTER TABLE
statement is used to add, delete, or modify columns in an existing table.
The ALTER TABLE
statement is also used to add and drop various constraints on an existing table.
IS NULL and IS NOT NULL¶
ANY and ALL¶
The ANY
and ALL
operators allow to perform a comparison between a single column value and a range of other values.
The ANY
operator:
- returns a boolean value as a result
- returns TRUE if ANY of the subquery values meet the condition
ANY
means that the condition will be true if the operation is true for any of the values in the range.
Table below is Products
Table below is OrderDetails
query below show Any
:
Condition | Meaning |
---|---|
x = ANY (…) | The values in column c must match one or more values in the set to evaluate to true. |
x != ANY (…) | The values in column c must not match one or more values in the set to evaluate to true. |
x > ANY (…) | The values in column c must be greater than the smallest value in the set to evaluate to true. |
x < ANY (…) | The values in column c must be smaller than the biggest value in the set to evaluate to true. |
x >= ANY (…) | The values in column c must be greater than or equal to the smallest value in the set to evaluate to true. |
x <= ANY (…) | The values in column c must be smaller than or equal to the biggest value in the set to evaluate to true. |
The ALL
operator:
- returns a boolean value as a result
- returns TRUE if ALL of the subquery values meet the condition
- is used with SELECT, WHERE and HAVING statements
ALL means that the condition will be true only if the operation is true for all values in the range.
Condition | Meaning |
---|---|
c > ALL(…) | The values in column c must greater than the biggest value in the set to evaluate to true. |
c >= ALL(…) | The values in column c must greater than or equal to the biggest value in the set to evaluate to true. |
c < ALL(…) | The values in column c must be less than the lowest value in the set to evaluate to true. |
c >= ALL(…) | The values in column c must be less than or equal to the lowest value in the set to evaluate to true. |
c <> ALL(…) | The values in column c must not be equal to any value in the set to evaluate to true. |
c = ALL(…) | The values in column c must be equal to any value in the set to evaluate to true. |
Over Clause¶
The OVER
clause in SQL Server is used with PARTITION BY
to break up the data into partitions. Here is an example:
Over
clause can be used with with Lead
function which give
Where 1=1¶
The statement where 1=1 in SQL means true. It is the same operation as running the select statement without the where clause.
In most cases, you will only need to use this clause when you need to build dynamic SQL statements. Once you apply the where 1=1 clause, all the subsequent statements after it can start with the ‘and’ keyword.
Consider an example where you are guessing an id of a column. Assuming you are not sure whether that id exists on the database, you can use something like where 1=1 to return all the rows even if the target id is not on the database.
The query above uses an or statement. Hence, only one of the conditions needs to be true for the query to work.
Even if there is no user with an id of 10, the 1=1 will always evaluate to true, and the query will fetch all the rows in the specified table.
CONCAT_WS¶
The CONCAT_WS()
function adds two or more strings together with a separator.
CONCAT_WS(separator, string1, string2, ...., string_n)
Add strings together. Use -
to separate the concatenated string values:
Here is another example:
Views¶
- A view is a virtual table that is the result of a saved SQL
SELECT
statement - When accessed, views automatically update in response to updates in the underlying data
CREATE VIEW
employee_years AS
SELECT
id, name, year
FROM
employees;
Then we can make a query to get the result from view:
SELECT
id, name
FROM
employee_years;
Two Popular SQL Flavors¶
PostgreSQL
- Created at the University of California, Berkeley
- Free and open-source relational database system
- "PostgreSQL" refers to both the PostgreSQL database system
SQL Server
- Created by Microsoft
- Has free and paid versions
- Queries using T-SQL
The difference between different types of SQL is similar to American and British English. There are key function in common. For example PostgreSQL uses LIMIT
while SQL uses TOP
at the end of query.
ROUND()¶
Round a number to a specific decimal
Image retrieved from https://www.datacamp.com/
We can also round it to whole number: Image retrieved from https://www.datacamp.com/
Negative value will round the digits from left:
Image retrieved from https://www.datacamp.com/
TRUNC() Function¶
Return a number truncated to decimal places.
TRUNC(number, decimals)
if decimals>0 it returns number of digits after decimals:
SELECT TRUNC(15.47895, 2)= 15.47
if decimals<0 it returns number of digits before decimals:
SELECT TRUNC(12345, -3)= 12000
CASE Expression¶
The CASE
expression goes through conditions and returns a value when the first condition is met (like an if-then-else statement). So, once a condition is true, it will stop reading and return the result. If no conditions are true, it returns the value in the ELSE clause.
If there is no ELSE part and no conditions are true, it returns NULL.
See example below:
CASE
statement can be located inside WHERE
cluse.It can be ended to END IS NOT NULL
to not have Null values:
CASE WHEN with Aggregate¶
- CASE statements are great for
- Categorizing data
- Filtering data
- Aggregating data
Image retrieved from https://www.datacamp.com/
- CASE WHEN with
COUNT
Image retrieved from https://www.datacamp.com/
- CASE WHEN with
SUM
Image retrieved from https://www.datacamp.com/
Image retrieved from https://www.datacamp.com/
Subqueries¶
A subquery is a query nested inside another query:
Subqueries can be placed in any part of a query:
SELECT
FROM
WHERE
GROUP BY
It can return scaler, a list or a table.
A subquery is nested inside another query that can be run on its own.
The example below shows a subquery inside a where
clause:
The query can be run on its own.
SQL first process inside subquery as a single value table or list then the outer query is processed
Subquery in FROM¶
Subquery in where clause returns a single column, but what if we want to get more complex query. Subquery in from gives restructure and transform your data. For example:
- Transform Data
- Prefilter Data
- Calculate aggregates of aggregates
Usually we should make the subquery first and then add to main query.
Image retrieved from https://www.datacamp.com/
See the query below that used subquery to make a new table using AS
then select its table in SELECT
statement:
Below is a complex subquery in FROM
within outer query:
Subquery in Select¶
- Subquery in Select should return a single value, otherwise it will generate an error
- Should properly filter both the main and the subquery. Since subquery is processed before the main query, we only need to include the relevant filter int the subquery as well as the main query. . See query below:
Correlated subquery¶
- Correlated subquery uses values from outer query to generate a result
- Subquery is re-run for every row generated in the final data set
- It is usually used for advanced joining, filtering ...
Simple Subquery | Correlated Subquery |
---|---|
Can be run independently from main query | Is dependent on the main query to execute |
Evaluated only once | Evaluated in loops. This leads to significantly slow down query runtime. |
Figure below shows simple subquery on the left and correlated subquery on the right
Another complex correlated subquery:
Nested Subqueries¶
Subqueries nested inside other subqueries.
The EXTRACT()
function extracts a part from a given date. For example EXTRACT(part FROM date)
: SELECT EXTRACT(WEEK FROM "2017-06-15");
Query below shows an example of simple subquery:
Another complex example of subquery:
WITH clause (Common Table Expressions)¶
The SQL WITH
clause was introduced by Oracle. The SQL WITH
clause allows a sub-query block a name, which can be referenced in several places within the main SQL query. It can be considered as a function.
The WITH
clause is considered “temporary” because the result is not permanently stored anywhere in the database schema.
CTE is only run once and store in memory. This leads to improve the amount of time to run your query.
CTE leads to improve organization of queries for long query
See complex example below:
Only the first CTE is declared using WITH. Afterward, use the cte's name, away, and AS.
Here is another complex cte query
Window Functions¶
It is impossible to get every single none aggregate value without `GROUPING BY`: it is not possible to get aggregate values with none aggregate values. Query returns an error. To resolve this, we can use window function. The PARTITION BY clause allows to calculate separate "windows" based on columns that we want to divide results.
Over¶
- Calculation is performed on an already generated result set (a window)
- Aggregated calculations can be applied without grouping data as Running totals, rankings, moving averages
Both queries below give the same answer. The left query is subquery but right query is using window function
The query below shows what the rank of matches based on number of goals. Window functions allows to creates a RANK()
of information according to any variable and also sort the data. When setting this up, you will need to specify what column/calculation you want to use to calculate your rank. This is done by including an ORDER BY
clause inside the OVER()
clause:
This option is available in PostgreSQL, Oracle, MySQL, SQL Server....but NOT SQLite.
The OVER()
clause allow to pass an aggregate function down a data set, similar to subqueries in SELECT
. However, the OVER()
clause has significant benefits over subqueries in select: 1- queries will run faster, 2- the OVER()
clause has a wide range of additional functions and clauses.
OVER with a PARTITION¶
- Partition allows you to calculate separate values for each categories.
- Using only one column of data for different calculation
- It is simply calculated by
PARTITION BY
withinOVER
:
AVG
(ABC) OVER
(PARTITION BY
DEF)
First use an aggregate function to calculate aggregate values then such as average of ABC in above, then add an OVER
clause afterward and inside the parentheses, say PARTITION BY
followed by the column you want to average by (DEF)
Query below shows how many goals scored in each match, and how to compare with the season's average??
We can use partition by with multiple columns just having multiple columns after PARTITION BY
. See query below:
Sliding Windows¶
- Sliding functions perform calculation relative to the current row of data set
- Can be used to calculate one row of data set at a time.
- Can be partitioned by one or more columns
The code is written as
ROWS
BETWEEN
<start
> AND
<finish
>
PRECEDING
andFOLLOWING
are the rows before and after the current row, respectivelyUNBOUNDED PRECEDING
AND
UNBOUNDED FOLLOWING
showing the row from beginning or the end of data set in your calculationCURRENT ROW
tells sql you want to stop running at current row
The query below calculates sum of goals manchester city played as home team during '2011/2012' and then turn the calculation into running total order by day of the match and calculating from beginning of data set to the current row
The query below calculate the sum of goals between previous and current match.
COALESCE() Function¶
- Operates row by row
- Returns first non-NULL value
CAST() Function¶
One variable type can be converted to another type via cast()
function. For example, the query below givs integer of 14.3 which is 14.
We can do this for entire column:
Casting with ::¶
Another way of casting is using ::
notation
- Home
-
- Prediction of Movie Genre by Fine-tunning GPT
- Fine-tunning BERT for Fake News Detection
- Covid Tweet Classification by Fine-tunning BART
- Semantic Search Using BERT
- Abstractive Semantic Search by OpenAI Embedding
- Fine-tunning GPT for Style Completion
- Extractive Question-Answering by BERT
- Fine-tunning T5 Model for Abstract Title Prediction
- Image Captioning by Fine-tunning ViT
- Build Serverless ChatGPT API
- Statistical Analysis in Python
- Clustering Algorithms
- Customer Segmentation
- Time Series Forecasting
- PySpark Fundamentals for Big Data
- Predict Customer Churn
- Classification with Imbalanced Classes
- Feature Importance
- Feature Selection
- Text Similarity Measurement
- Dimensionality Reduction
- Prediction of Methane Leakage
- Imputation by LU Simulation
- Histogram Uncertainty
- Delustering to Improve Preferential Sampling
- Uncertainty in Spatial Correlation
-
- Machine Learning Overview
- Python and Pandas
- Main Steps of Machine Learning
- Classification
- Model Training
- Support Vector Machines
- Decision Trees
- Ensemble Learning & Random Forests
- Artificial Neural Network
- Deep Neural Network (DNN)
- Unsupervised Learning
- Multicollinearity
- Introduction to Git
- Introduction to R
- SQL Basic to Advanced Level
- Develop Python Package
- Introduction to BERT LLM
- Exploratory Data Analysis
- Object Oriented Programming in Python
- Natural Language Processing
- Convolutional Neural Network
- Publications