1. What is ETL?
E = Extract data from a data sources such as a database, a CSV file, a Excel file, etc.
T = Transform the data extracted. It can include processing, cleaning, converting extracted data to new data model.
L = Load the data into a destination source such as a data warehouse, a database, a file, etc.
2. Data warehouse and BI solution
Data warehouse is a system used for reporting and data analysis, and is considered
a core component of business intelligence (BI) (ref:
https://en.wikipedia.org/wiki/Data_warehouse).
Nowadays, there are a lot of tools supporting you to build your BI solution such as:
Jaspersoft ETL,
Talend, etc. After trying some tools to build my BI solution, I see that a tool of Microsoft - SQL Server Integration Services (SSIS) is the one of the best tools for doing this. So I want to share my experiences on using SSIS in this post.
3. Setup SSTD for MS Visual Studio
To create SSIS packages, you must have SQL Server Data Tools (SSTD). It is a set of tools for MS Visual Studio. If you don't install MS Visual Studio (MS VS), let download
MS Visual Studio 2015 Community version (free), then download SSTD from here:
https://msdn.microsoft.com/en-us/mt186501.aspx
4. Create new Integration Services Project
After setting up SSTD for MS VS, you can create new Integration Services Project. In MS VS go to menu
File >> New >> Project, then expand
Templates >> Business Intelligence >> select Integration Services Project, like below picture:
5. Structure of an Integration Services Project and Tasks
Solution Explore (in right side) of a project is like below:
+
Project.params: containing parameters for the project. The parameters will be input when executing a SSIS package. They can be input from Environment Variables which are set up in a deployment environment. You can double click it and add new parameter.
+
Connection Managers: containing connection to data sources used in the project. You can right click to
New Connection Manager.
+
SSIS Packages: containing SSIS packages. A Package is a set of tasks and flows for an ETL purpose. You can right click to
New SSIS Package.
There are many tasks for
Control Flow of a package as you can see in the left side when opening it (see the following picture). Normally almost package starts with
Data Flow Task:
For another tasks, you can read here:
https://msdn.microsoft.com/en-us/library/ms139892(v=sql.120).aspx
6. Sample: merge an Excel file to a data table
Via this sample, you can know how to merge data (update existing and insert new one) from an Excel file with a data table.
Assuming that we have a data table
SSIS_Demo_Products as the following:
And its data:
Here is the content of the Excel file -
SSIS_Demo_Products.xlsx:
On MS VS, create new Integration Services Project (for example, you can name it as SSIS Demo) ==> then create New SSIS Package (for example, you can rename it to
MergeExcelFile.dtsx) ==> drag
Data Flow Task into
Control Flow of
MergeExcelFile.dtsx ==> double click Data Flow Task ==> drag
Excel Source and
OLE DB Source into the
Data Flow:
Double click
Excel Source ==> create New Excel Connection Manager:
Double click
OLE DB Source ==> create New OLE DB Connection Manager:
And configure other sections like below:
Note: in
OLE DB Source, we just need ID column to compare if the record is existing or not.
To compare and merge, Excel columns' data type must be same with columns of
SSIS_Demo_Products data table. Let drag
Data Conversion into for converting Excel data. Connect
blue line of
Excel Source with
Data Conversion, then double click
Data Conversion to configure it:
Now your data is ready to merge. But wait. We should sort data of both Excel file and source data table to boost up performance when merging. Drag 2
Sort tasks into this
Data Flow then connect
Data Conversion with one and
OLE DB Source with other. You can rename them to
Sort Excel and
Sort DB Source as below:
Here is the configuration of
Sort Excel:
and
Sort DB Source:
OK, now drag
Merge Join into ==> connect
Sort Excel to it:
Connect
Sort DB Source to
Merge Join ==> double click
Merge Join to configure it:
Note: we use
Left outer join and Join Key between converted.ID and ID
As result of
Merge Join, we will have a data with records having ID = NULL if records in Excel file doesn't exist in data source. So by using
Conditional Split task, we can split into New Record and Existing Record easily, let drag
Conditional Split into then configure it as below:
Yeah, it's time to load the data again
SSIS_Demo_Products table. Let drag
OLE DB Destination into then connect with
Conditional Split via
NewRecord output:
Configure
OLE DB Destination it to insert new record into
SSIS_Demo_Products table:
To update Existing Record, let drag
OLE DB Command into ==> with
Conditional Split via
ExistingRecord output ==> configure it as below:
Note: the SQLCommand as below:
UPDATE SSIS_Demo_Products
SET ProductID = ?,
ProductName = ?,
SellDate = ?,
UnitPrice = ?,
Quantity = ?,
Buyer = ?
WHERE ID = ?
The final diagram of
Data Flow is below, right click then
Execute Task then see data of
SSIS_Demo_Products table updated and inserted:
That's enough for Part 1. I'll publish Part 2 ASAP, in which I'll present how to Parameterize, Load Data Into Destination Using Look up and Deployment SSIS Package on SQL Server.
Let check it out! Happy coding!
Here is
Part 2.