SQL is regarded as one of the most important skills for data engineers. You can see it appear in every job offer, every youtube video, and every blog. Why is it that data engineers are bound this technology? In this blog we answer that question.
Let’s talk about data
Before talking about SQL, we must talk about data. Data is getting generated everyday at an astronomical scale, and data engineers are at the front line of dealing with this data.
Depending on the company and it’s needs, the data engineer can deal with different types of data coming from different sources such as the Sales department ,the Marketing department, User Acquisition, Logging Systems, APIs, Application Databases, Web Scraping, Lead Generation.. and so much more. The data can be either Structured (such as RDBMS, CSV, Excel), Semi-structured (such as JSON, APIs, Logging system), or Unstructured (such as images, videos, files)
Data is stored in databases, it can be a Relational database (SQL Database), or a None-relational database (NoSQL Database).
SQL databases, also refered to as RDBMS (Relations Database Management System, store data in tables, each table is made out of rows and columns, that way the data is fully structured. One database can have many related tables (relations). Related tables have common column that we use to JOIN the data from one table to another.
Examples of SQL databases are: MySQL, PostgreSQL and SQL Server.
Another type of relational databases that we find on the cloud is what’s called a data warehouse. Think of these modern data warehouses are databases with powerful data storing and data retrieval capabilities.
NO SQL Databases:
NoSQL refers to “Not only SQL”, which are database systems that store and process data in different ways. The need of such systems became vital with the emerging of No structured data that are resulted from APIs, Logging Systems and so one.
Examples of NoSQL databases are: MongoDB, CassandraDB, and Redis.
Databases and SQL
Now we come to SQL. SQL or Structured Query Language is the the universal language we use to communicate with any Relational Database.
Databases allow us to store Billions of records, and SQL allows us to save those records, retrieve them, aggregate them and much more.
Note that different RDBMS can have slightly different SQL syntax. But the Over all language is very similar.
SQL in the job of a Data Engineer:
Data engineers are responsible of creating, designing, and implementing end-to-end data pipelines, as well as maintaining these pipelines, designing data warehouses
SQL in the world of data engineering
SQL can be crucial for data engineers to achieve the following tasks:
Implementing Data warehouses solutions:
Before saving moving the data to the data warehouse, The engineer first designs the warehouses usually following Ralph Kimball’s Dimensional modeling (Start Schema), or by following the OBT (One Big Table) approach. After that the engineer uses SQL to create all required database objects such as tables. Only then, one can start storing historical data.
SQL is also use after the data warehouse is implemented to create objects such as Views, Materialized Views, Procedures, Security Control… etc which usually makes it easy for other data professionals such as data scientists and data analysts to do their job, and keep the data safe.
Retrieve and Load data:
When we want to implement an ETL pipeline that would extract data from one an OLTP system (Transactions system such as application database) to be them transformed and stored in a OLAP system (Analytics system such as a Data Warehouse), We usually use a programming language with drivers. Those drivers use SQL. The Data engineers must define the SQL code for that.
We can also use Structured Query Language directly in the RDBMS GUI Editor which allows us to save a SELECT Query results.
Another approach is to use the Terminal to retrieve data using a DUMP command, or loading data using the COPY command. This method is simple and fast.
Cleaning data ELT
Data transformation can be done before the data is stored in the warehouse (ETL process), or when the data is already inside the database (ELT process).
When the data is already inside the database we can use SQL to transform it by applying filters, aggregation functions, handling Null values and much more.
Playing multiple roles
Sometimes, the data engineer play the role of data analyst in small companies. And with that comes the responsibility of developing dedicated Query codes to answer different business related questions.
As explained in the blog, SQL is necessary in the everyday life of a Data Engineer. The use cases of the technology are all over the place since it is the language of relational databases, One language to run them all!
Note that NoSQL databases don’t use SQL necessarily, But the good thing is that when is comes to reporting, storing historical data, and data analytics in general, we often use data warehouses to store the data, and all modern data warehouses use SQL.
Check out SQL Command Types.