SQL for Data Science: Introduction & Tutorial

When it comes to working with large datasets, one of the most powerful tools at a data scientist's disposal is SQL (Structured Query Language). SQL allows for efficient management, manipulation and retrieval of data from relational databases. This makes it an essential skill for anyone in the field of data science.

In this article, we'll introduce SQL and how it can be used for data science. We'll cover the basics of SQL syntax and querying databases and also introduce some advanced techniques used for more complex analysis.

Getting started with SQL

Introduction to SQL

Structured Query Language, or SQL, is a programming language used for managing relational databases. It allows users to store, manipulate, and retrieve data stored in the database.

With the increasing demand for data-driven decision-making and big data analytics, knowledge of SQL has become an essential skill for data scientists. In this tutorial, we will cover the basics of SQL and introduce you to how it can be applied to data science.

(Related reading: query languages.)

Role of SQL in data science

Data science is a multidisciplinary field that combines different methods, algorithms, and tools to extract useful knowledge and insights from data. SQL plays a crucial role in this process by providing a standardized way to store and manage data.

SQL can handle large datasets efficiently and allows for complex queries to be executed quickly. This makes it an ideal tool for data scientists working with large amounts of data. When training machine learning models, data scientists will be required to extract large amounts of relevant data from a data source. Using SQL in such scenarios can help streamline the data extraction process and save time.

Additionally, many companies use relational databases to store and organize their data. As a result, being proficient in SQL is often a requirement for data science job positions.

Basic Structure of SQL

SQL follows a simple structure that involves three main components:

  1. Data Definition Language (DDL)
  2. Data Manipulation Language (DML)
  3. Data Control Language (DCL)

Data Definition Language (DDL)

DDL is used for creating, modifying, and deleting database objects such as tables, views, and indexes. It allows users to define the structure of the database and specify the relationships between different entities.

Examples of DDL:

Data Manipulation Language (DML)

DML is used to manipulate data stored in the database. This includes inserting, updating, deleting, and retrieving data from tables.

Examples of DML:

Data Control Language (DCL)

DCL is used to control access to the database and its objects. It includes granting or revoking privileges to users and managing security permissions.

Examples of DCL:

Setting up a SQL environment

Before starting with SQL for use in data science, you'll have to pick a relational database tool and IDE (integrated development environment). Popular options include:

For beginners who are new to SQL, and plan to start with application to data science, either Jupyter Notebooks or Jupyterlab. This makes for a good transition from Python to SQL by using the sqlite3 library within a Jupyter Notebook.

Alternatively, you can also use the MySQL connector library in Python as well. Using SQL with Python in this manner makes for better-integrated data analysis.

Basic SQL Syntax

To understand the grammar of SQL, you'll need to explore the syntax of the language.

SQL follows a specific syntax for writing queries, which consists of keywords, clauses, expressions, and operators. A basic SQL query has the following structure:

```sql
SELECT column1, column2
FROM table_name
```

This query selects specific columns from a table and applies a condition to filter the results. The keywords SELECT and FROM are part of the SQL syntax and must be written in uppercase letters.

Core SQL queries for Data Analysis

To perform basic SQL queries, you'll need to know some core functions and clauses used in SQL. Here are some common ones.

Data retrieval with SELECT

The SELECT statement is the foundation of data retrieval in SQL, allowing users to specify the exact data needed from a database. You can retrieve specific columns or use wildcards to select all data from a table.

Specifying data sources using FROM

FROM specifies the table or tables to query data from. Tables can also be joined together with the FROM clause.

Filtering results Using WHERE

The WHERE clause allows users to specify conditions for filtering results. It allows you to choose the rows that meet a specific condition you have chosen.

This is useful when dealing with large datasets and wanting to extract only relevant information.

Sorting results with ORDER BY

Next, the ORDER BY is used to sort the results of a query in ascending or descending order. It takes one or more column names as arguments and sorts the results based on those columns.

Here are the two main functions used with ORDER BY:

This clause is typically appended at the end of a query to define how the returned data should be ordered for better exploratory data analysis.

Example:

```sql
SELECT *
FROM customer
>WHERE country = 'USA'
ORDER BY last_name ASC;
```

This query selects all columns and rows from a customer table where the country is USA, then sorts the results by last name in ascending order.

Summarizing Data

To present data in a more readable and concise way, SQL has several ways for summarizing data. These include:

Aggregate functions

To make better sense of data, you'll have to create aggregates. An aggregate function in SQL is used to perform calculations on the data retrieved from a query.

Some common aggregate functions include:

GROUP BY for aggregation

The GROUP BY clause is used to group data by one or more columns, and then an aggregate function such as SUM, MAX, MIN, AVG, or COUNT can be applied.

Example of GROUP BY being used with the SUM aggregation:

```sql
SELECT country, SUM(total_sales)
FROM sales
GROUP BY country;
```

This query groups the sales data by country and calculates the total sales for each country.

HAVING clause for filtered aggregates

HAVING complements the aggregation process by filtering aggregated data that matches specific criteria. It provides a more refined output, retaining only the most relevant insights for analysis.

While WHERE filters rows before aggregation, HAVING deals with rows after they're bunched together. It allows us to apply conditions on the summary to ensure the final display of results is precisely tuned to the question at hand.

Here is an example of how the HAVING clause can be used:

```sql
SELECT Country, COUNT(*) AS TotalCustomers
FROM Customers
GROUP BY Country
HAVING COUNT(*) > 10;
```

This statement will return a list of countries with more than ten customers. The use of the HAVING clause ensures only relevant data is returned from the aggregation.

Advanced SQL techniques

JOIN operations for multiple data relationships

JOIN is a fundamental operation in SQL, bringing together data across tables. Data comes from multiple sources and may be stored independently, making JOINs vital for connecting the dots.

Here is an example of the INNER JOIN syntax:

```sql
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID;
```

This statement will return the order ID, customer name, and order date for all orders that have a matching customer ID in both the Orders and Customers tables.

The primary key of customer ID in the Customers table will be matched with the foreign key of customer ID in the Orders table. This process is very much similar to that of the VLOOKUP function within Microsoft Excel.

For use cases in data science, this can be particularly useful, as data may need to be combined from multiple tables to provide a comprehensive and accurate view of the data.

Subqueries for nested data extraction

Data scientists can also use subqueries to expand upon to create more complex SQL queries.

Subqueries are queries nested within the main query and help extract data from a specific subset of tables, thus refining results to a higher degree.

Subqueries can be used in various ways, such as filtering out records based on an aggregate function or comparing values across different tables. Their versatility simplifies complex data manipulation and allows for more precise analysis.

Here's a simple example to show it's syntax:

```sql
SELECT *
FROM Customers
WHERE Country IN (SELECT Country FROM Suppliers);
```

This statement will return all customers who share a country with any of the suppliers.

Common Table Expressions (CTEs)

A Common Table Expression (CTE) is a table that exists only temporarily within a SQL statement. They are not stored as objects but offer a convenient way to store and retrieve data for use within a single query. CTEs can be used to:

They are essentially queries within queries, similar to subqueries, but with the added benefit of improved readability and maintainability.

Here's an example of a CTE query:

```sql
WITH temp_table AS (
  SELECT employee_name, salary FROM employees WHERE department = 'Sales'
)
SELECT * FROM temp_table WHERE salary > 50000;
```

In this example, the CTE (temp_table) is used to store a subset of data from the employees table and then retrieve only the relevant information in the subsequent query. This not only simplifies the query but also makes it easier to understand and modify in case of any changes.

CTEs can also be recursive, which means they can refer to themselves within the same query. This is especially useful for dealing with hierarchical data such as organizational charts or family trees.

This method simplifies the process of generating sequences and patterns, which are often encountered in data science tasks. CTEs can also be used to create temporary views within a query, making it easier to read and understand complex data relationships.

Window functions for advanced analytics

For more advanced use cases, window functions will be needed to gain more control over the data you need to extract.

A window function in SQL is a calculation performed across a set of rows that are related to the current row. This is similar to how we use aggregate functions, with the additional ability to group and partition data to a more granular level.

With these functions, users can perform calculations over a specified range of rows, which are particularly useful in data partitioning. This allows the user to define groups of data upon which certain calculations will be performed or filters applied; otherwise known as "windows".

With this level of granularity, window functions help to refine aggregation results and provide more insightful reports. Some popular examples of window functions include:

Let's have a look at what a ROW_NUMBER SQL code example looks like:

```sql
SELECT EmployeeID, OrderID, OrderDate,
      ROW_NUMBER() OVER(PARTITION BY EmployeeID ORDER BY OrderDate) AS RowNumber
FROM Orders;
```

This statement uses the ROW_NUMBER window function to assign a unique row number value over each employee's orders.

Final thoughts

SQL is a powerful language, and its capabilities go far beyond just querying data from databases. With its versatile clauses and operations, it enables users to manipulate large datasets in various ways, providing the necessary data preparation needed for downstream data science applications like machine learning.

As clean data is always needed for accurate results in data science and AI, SQL will continue to play a crucial role in data science and analytics.

FAQs about SQL

What is SQL?
SQL (Structured Query Language) is a programming language designed for managing and querying data in relational databases.
Why is SQL important for data science?
SQL is important for data science because it allows data scientists to efficiently access, manipulate, and analyze large datasets stored in relational databases.
What are some common SQL operations used in data science?
Common SQL operations in data science include SELECT, JOIN, GROUP BY, ORDER BY, and filtering data using WHERE clauses.
Can SQL be used with other data science tools?
Yes, SQL can be integrated with other data science tools and programming languages such as Python and R to enhance data analysis workflows.
Is SQL only used for querying data?
No, SQL is also used for inserting, updating, and deleting data, as well as managing database structures.

Related Articles

How to Use LLMs for Log File Analysis: Examples, Workflows, and Best Practices
Learn
7 Minute Read

How to Use LLMs for Log File Analysis: Examples, Workflows, and Best Practices

Learn how to use LLMs for log file analysis, from parsing unstructured logs to detecting anomalies, summarizing incidents, and accelerating root cause analysis.
Beyond Deepfakes: Why Digital Provenance is Critical Now
Learn
5 Minute Read

Beyond Deepfakes: Why Digital Provenance is Critical Now

Combat AI misinformation with digital provenance. Learn how this essential concept tracks digital asset lifecycles, ensuring content authenticity.
The Best IT/Tech Conferences & Events of 2026
Learn
5 Minute Read

The Best IT/Tech Conferences & Events of 2026

Discover the top IT and tech conferences of 2026! Network, learn about the latest trends, and connect with industry leaders at must-attend events worldwide.
The Best Artificial Intelligence Conferences & Events of 2026
Learn
4 Minute Read

The Best Artificial Intelligence Conferences & Events of 2026

Discover the top AI and machine learning conferences of 2026, featuring global events, expert speakers, and networking opportunities to advance your AI knowledge and career.
The Best Blockchain & Crypto Conferences in 2026
Learn
5 Minute Read

The Best Blockchain & Crypto Conferences in 2026

Explore the top blockchain and crypto conferences of 2026 for insights, networking, and the latest trends in Web3, DeFi, NFTs, and digital assets worldwide.
Log Analytics: How To Turn Log Data into Actionable Insights
Learn
11 Minute Read

Log Analytics: How To Turn Log Data into Actionable Insights

Breaking news: Log data can provide a ton of value, if you know how to do it right. Read on to get everything you need to know to maximize value from logs.
The Best Security Conferences & Events 2026
Learn
6 Minute Read

The Best Security Conferences & Events 2026

Discover the top security conferences and events for 2026 to network, learn the latest trends, and stay ahead in cybersecurity — virtual and in-person options included.
Top Ransomware Attack Types in 2026 and How to Defend
Learn
9 Minute Read

Top Ransomware Attack Types in 2026 and How to Defend

Learn about ransomware and its various attack types. Take a look at ransomware examples and statistics and learn how you can stop attacks.
How to Build an AI First Organization: Strategy, Culture, and Governance
Learn
6 Minute Read

How to Build an AI First Organization: Strategy, Culture, and Governance

Adopting an AI First approach transforms organizations by embedding intelligence into strategy, operations, and culture for lasting innovation and agility.