Wednesday, November 16, 2016

ETL and Build Data Warehouse with SSIS - Part 1

1. What is ETL?

E = Extract data from a data sources such as a database, a CSV file, a Excel file, etc.
T = Transform the data extracted. It can include processing, cleaning, converting extracted data to new data model.
L = Load the data into a destination source such as a data warehouse, a database, a file, etc.

2. Data warehouse and BI solution

Data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence (BI) (ref: https://en.wikipedia.org/wiki/Data_warehouse).

Nowadays, there are a lot of tools supporting you to build your BI solution such as: Jaspersoft ETL, Talend, etc. After trying some tools to build my BI solution, I see that a tool of Microsoft - SQL Server Integration Services (SSIS) is the one of the best tools for doing this. So I want to share my experiences on using SSIS in this post.

3. Setup SSTD for MS Visual Studio
To create SSIS packages, you must have SQL Server Data Tools (SSTD). It is a set of tools for MS Visual Studio. If you don't install MS Visual Studio (MS VS), let download MS Visual Studio 2015 Community version (free), then download SSTD from here: https://msdn.microsoft.com/en-us/mt186501.aspx

4. Create new Integration Services Project
After setting up SSTD for MS VS, you can create new Integration Services Project. In MS VS go to menu File >> New >> Project, then expand Templates >> Business Intelligence >> select Integration Services Project, like below picture:


5. Structure of an Integration Services Project and Tasks
Solution Explore (in right side) of a project is like below:
+Project.params: containing parameters for the project. The parameters will be input when executing a SSIS package. They can be input from Environment Variables which are set up in a deployment environment. You can double click it and add new parameter.
+Connection Managers: containing connection to data sources used in the project. You can right click to New Connection Manager.
+SSIS Packages: containing SSIS packages. A Package is a set of tasks and flows for an ETL purpose. You can right click to New SSIS Package.

There are many tasks for Control Flow of a package as you can see in the left side when opening it (see the following picture). Normally almost package starts with Data Flow Task:


For another tasks, you can read here: https://msdn.microsoft.com/en-us/library/ms139892(v=sql.120).aspx

6. Sample: merge an Excel file to a data table
Via this sample, you can know how to merge data (update existing and insert new one) from an Excel file with a data table.
Assuming that we have a data table SSIS_Demo_Products as the following:

And its data:
Here is the content of the Excel file - SSIS_Demo_Products.xlsx:

On MS VS, create new Integration Services Project (for example, you can name it as SSIS Demo) ==> then create New SSIS Package (for example, you can rename it to MergeExcelFile.dtsx) ==> drag Data Flow Task into Control Flow of MergeExcelFile.dtsx ==> double click Data Flow Task ==> drag Excel Source and OLE DB Source into the Data Flow:


Double click Excel Source ==> create New Excel Connection Manager:


Double click OLE DB Source ==> create New OLE DB Connection Manager:


And configure other sections like below:


Note: in OLE DB Source, we just need ID column to compare if the record is existing or not.

To compare and merge, Excel columns' data type must be same with columns of SSIS_Demo_Products data table. Let drag Data Conversion into for converting Excel data. Connect blue line of Excel Source with Data Conversion, then double click Data Conversion to configure it:


Now your data is ready to merge. But wait. We should sort data of both Excel file and source data table to boost up performance when merging. Drag 2 Sort tasks into this Data Flow then connect Data Conversion with one and OLE DB Source with other. You can rename them to Sort Excel and Sort DB Source as below:


Here is the configuration of Sort Excel:

and Sort DB Source:

OK, now drag Merge Join into ==> connect Sort Excel to it:
Connect Sort DB Source to Merge Join ==> double click Merge Join to configure it:

Note: we use Left outer join and Join Key between converted.ID and ID

As result of Merge Join, we will have a data with records having ID = NULL if records in Excel file doesn't exist in data source. So by using Conditional Split task, we can split into New Record and Existing Record easily, let drag Conditional Split into then configure it as below:

Yeah, it's time to load the data again SSIS_Demo_Products table. Let drag OLE DB Destination into then connect with Conditional Split via NewRecord output:
Configure OLE DB Destination it to insert new record into SSIS_Demo_Products table:


To update Existing Record, let drag OLE DB Command into ==> with Conditional Split via ExistingRecord output ==> configure it as below:

Note: the SQLCommand as below:
UPDATE SSIS_Demo_Products
SET ProductID = ?,
ProductName = ?,
SellDate = ?,
UnitPrice = ?,
Quantity = ?,
Buyer = ?
WHERE ID = ?


The final diagram of Data Flow is below, right click then Execute Task then see data of SSIS_Demo_Products table updated and inserted:

That's enough for Part 1. I'll publish Part 2 ASAP, in which I'll present how to Parameterize, Load Data Into Destination Using Look up and Deployment SSIS Package on SQL Server.

Let check it out! Happy coding!

Here is Part 2.

1 comment:

Subscribe to RSS Feed Follow me on Twitter!