SSIS – Running single SSIS packages in Parallel to get faster performance for multiple data loads
What we are going to show you below, is that by using ONE SSIS package, we can re-use this one SSIS package multiple times to get data from multiple sources, possibly from different databases hosted in different locations around the world. So in effect we are loading our data from one SSIS package in parallel.
NOTE: By completing the steps below we have seen our packages perform and complete a lot quicker due to running in parallel.
The thing to note with my example below is that all the database have exactly the same structure within the databases, where ever they are hosted.
Example Details
· Databases are hosted in different locations around the world.
o And there were multiple databases on each database system.
· All the databases have exactly the same table and schema structures.
· Within each hosting location we have different IP Addresses and database names.
Our requirement was to load the data as quickly as possible from our multiple locations. This is how we achieved this.
NOTE: I am not going to go into the details on how we got our connection details to connect to each system. If someone has a requirement for this then I can possibly put this into another blog post.
Creating your Control SSIS Package to run in Parallel
What I am going to explain below is how we use a control SSIS package, which controls how we run our SSIS packages in parallel.
1. The first thing that we did was to create the following variable called:
a. Name:
i. LoopQuitParent
b. Scope:
i. PackageName
c. Data Type:
i. Int32
d. NOTE: The reason that we create this is because of the following:
i. This is going to be used in the steps further down in our For Loop container to know when to exit the For Loop.
ii. It is also going to be used in the actual SSIS package that runs in parallel, and then it has completed to use a script task to populate this variable from our parent package.
2. As with normal data warehousing practice we first truncate our staging tables.
3. Next is where we have our sequence container.
a. It is within this sequence container that we put how many packages we want to run in parallel.
4. So within our Sequence container we put the following
a. First we dragged in a For Loop container inside our Sequence Container.
b. We then configured our For Loop container with the following:
ii. What we are saying above is when our variable from step 1 @LoopQuitparent not equals zero then finish or complete.
c. Next inside our For Loop Container we put an Execute Package Task
i. NOTE: This is our SSIS Package that is going to run multiple times or in Parallel.
ii. As per our example if you look at the Package Window on the left hand side you will see where we have configured our SSIS Package
d. Then all that you need to do is to copy and paste the For Loop Container as many times as you want to run in Parallel.
i. In our picture below we wanted to run it in parallel x 4
ii. NOTE: We just renamed each For Loop container so that we know how many we have.
iii. NOTE 2: It might be good to test how many will be the optimal number for you, as it is dependent on server resources, network interface and bandwidth to your locations of where the data is.
5. In the next steps we will explain how we configure our one SSIS package so that it can run in parallel as well as know when to exit.
Configuring our SSIS Package so that it can get the required details and run in parallel (SSIS-Parallel-LoadStaging.dtsx)
In the next steps I will explain how we get our one SSIS package (SSIS-Parallel-LoadStaging.dtsx) to get the required data into variables and then run in parallel.
1. It is going to be easiest to explain if I can show you an example of what our Source Table looks like below. This is where we store the following which will be explained after the picture for clarity.
b. SourceSystemSK
i. We use this to uniquely identify our Source Systems. This is used throughout our SSIS Packages and is put into our data warehouse and Cube so that we know from which system the data came from.
ii. NOTE: This also enables us to get an overview of a particular system when it has more than one database.
c. IPAddress
i. This is the actual IP address that we will pass as an expression in our Connection manager in order to connect to the SQL Instance.
d. SQLPortNumber
i. This is the standard SQL Port number, but if for some reason it was on a different port we would specify it here.
ii. NOTE: This can also get passed as an expression in our Connection manager.
e. IsProcess
i. This is used later within our SSIS package (SSIS-Parallel-LoadStaging.dtsx) and what we will explain later, is that when the row has been used and the data processed it changes the value from a 2 to a 1.
f. DatabaseName
i. This is the database name within the SQL Instance system we are connecting to.
ii. NOTE: This is gets passed as an expression in our Connection Manager.
g. Location
i. This is used in our dimensions later if we want to know where the actual data is coming from.
h. SQLInstanceName
i. This is the actual SQL Instance name that we are connecting to.
i. IsActive
i. This is part of our process, where we test to see if we can connect to the actual SQL Instance and databasename. If we can the IsActive is set to 1, and if not it is set to 0 (Zero)
ii. NOTE: We did not explain how we get the data into this table, as this would make this blog post a whole lot longer.
2. So what we do in our first step is to get the SourceSystemSK into a variable.
b. But we need to explain what else we do within our Execute SQL Task.
c. Below is the actual query with an explanation afterwards
— Section 1
DECLARE@ServerIDINT
— Section 2
BEGINTRAN
— Section 3
SETROWCOUNT 1
— Section 4
SELECT@ServerID=[SourceSystemSK]
FROM[dbo].[Mart_TD_SourceSystems]WITH (TABLOCKX,HOLDLOCK)
WHEREIsProcess= 2
— Section 5
UPDATE[dbo].[Mart_TD_SourceSystems]
WITH (TABLOCK)
SETIsProcess= 1
WHERE[SourceSystemSK]=@ServerID
— Section 6
SETROWCOUNT 0
— Section 7
SELECTISNULL(@ServerID, 0)asSourceSystemSK
— Section 8
COMMIT
i. — Section 1
ii. The first thing that we do is declare our @ServerID
iii. — Section 2
iv. Next we start a Begin Tran
1. NOTE: The reason for this is so that we can actually lock the table whilst this transaction is happening.
2. NOTE 2: This is so that when the SSIS Packages are running in parallel, we know for a fact that each package will only get a valid SourceSystemSK when trying to get another value. If it tries whilst another SSIS Package is busy it will wait until the transaction has been completed.
v. — Section 3
vi. We then set the RowCount to 1, this is so that we will only get one row back.
1. NOTE: we are doing this because we only ever want to get back one SourceSystemSK
vii. — Section 4
viii. Next is our TSQL Select statement
1. Here we are getting any SourceSystemSK from our table, where the IsProcess is set to 2.
2. We are then putting this into our @ServerID variable.
ix. — Section 5
x. Next we are updating our table using the TABLOCK hint so that we can once again lock the entire table, to ensure that for our SourceSystemSK row it gets updated from a 2 to 1.
1. NOTE: This then means that this data is and has been processed when the next package comes along looking for data to process.
xi. — Section 6
xii. We then set the RowCount back to 0 (zero)
xiii. — Section 7
xiv. Then we run a TSQL Select statement so that we can then pass our @ServerID as SourceSystemSK
1. NOTE: This is because we are using the SourceSystemSK and passing this into a variable in our Execute SQL Task.
2. NOTE 2: We also set the isNull to 0 (zero) so that when there are no more rows to process it defaults to zero, which is explained later.
xv. — Section 8
xvi. Finally we are then committing our transaction.
d. Within our Execute SQL Task we also map the result set to our variable called:
3. The next thing that we need to look into is our Exit Loop
b. NOTE: The reason that we have this is because we need to pass a variable from our child package back to our Parent Package.
c. So we put in a Script Task using Visual Basic 2010 and configured it with the following:
d. NOTE: Click on Edit script to open the Script Task editor
PublicSub Main()
‘
‘ Add your code here
‘
Dts.Variables(“LoopQuitParent”).Value = 1
Dts.TaskResult = ScriptResults.Success
i.
ii. What the above does is pass the Variable name LoopQuitParent value of 1 back to the Parent package.
iii. NOTE: This was configured in our For Loop Container in step 4 from the section above, where it was set to not equal to zero. So when it equals one then exit or complete
4. Next we creating another Execute SQL Task, and in this task it then uses a Query within a variable to get the IP Address and Database name for our server that we want to connect to.
b. NOTE: This is based on our SourceSystemSK variable which we populated in step 2 above.
c. This is what our Variable looks like, where we have configured it as an Expression
ii. As you can see from above when we run the above query, if our SourceSystemSK = 1 we would get the following details back
d. Then in our Result set for our Execute SQL Task we would have the following:
e. Now we currently have the following in our variables for our SSIS Package
i. SourceSystemSK, IPAddress and Database Name.
5. Now you need to create a local OLE DB Connection Manager in your SSIS Package.
a. NOTE: The reason that it must be a local connection, is so that when each package is running in parallel it will only populate the expressions for the current package.
b. With our example we created a local OLEDB connection with the following name:
i. ServerConnection.local-OLE
c. Next right click and go into the Properties for your ServerConnection.local-OLE.
d. Where it says Expressions click on the Ellipses button and configure it with the following:
i. Property:
1. InitialCatalog
2. NOTE: This is the actual DatabaseName
ii. Expression:
1. @[User::DatabaseName]
iii. Property:
1. ServerName
2. NOTE: This is the actual IPAddress to connect to the server.
iv. Expression:
1. @[User::IPAddress]
v. IT will look like the following:
e. Then click Ok.
f. NOTE: Ensure that your DatabaseName and IPAddress variables have valid default Values in your Variables
i. This is because when you put in your expressions above, it attempts to connect to the system to ensure that it is valid.
ii. This will save you a lot of time waiting for it to timeout and will ensure that it is indeed valid.
6. Now the next step is to put in what you want your SSIS Package to complete on each and every server.
b. With our example we would then get the Max RowID for each system which is stored in a separate table.
c. Within our table we simply have the Max RowID, SourceSystemSK and the date it got inserted.
d. Here is an example below:
f. This would then be in the Result Set
g. Then we have our Data Flow task which based on our Max RowID inserts the data into our Staging_tb_SalesInformation
7. Now the last thing to configure in this SSIS package (SSIS-Parallel-LoadStaging.dtsx) is the Precedence Constraint which can be seen in the picture below
b. The reason that we have this is so that we know for our package when it has completed going through our entire list of servers (and table with the connection details) and populate our LoopQuitParent variable back to our Parent Package, and let that particular For Loop Container complete.
c. NOTE: What we did in step 2 above is your will see that we set an IsNull in section 7 and set this to 0 (zero).
d. This is now used in our Precedence constraint and configured with the following for the “Get the IPAddress and DatabaseName into Variable based on SourceSystemSK” which is on the left hand side.
i. We configured the Precedence Constraint with the following:
iii. And here what we are saying is if we have any value greater than zero then get our values, and get the data down.
e. This is now used in our Precedence constraint and configured with the following for the “Exit Loop” which is on the right hand side.
i. We configured the Precedence Constraint with the following:
iii. And here what we are saying is when there are no more values for the SourceSystemSK, which will be set to zero, then exit out and populate the LoopQuitParent so that this can go back to the parent package.
8. Now we have configured our SSIS Package which can run in parallel.
9. This is what the entire package looks like
Final Notes
· Now when you run the Control package, and it runs, you will see all your For Loop containers running at once.
· The thing that you will not essentially see is if you have configured it to run in parallel with 4 SSIS packages, you will only see one running, but actually all 4 are running at once.
· Once it has all completed you will see your For Loop Container complete with the green arrow.
o You can then go into your Staging_tb_SalesInformation table and validate it to see that you have got all your required information from your Source Systems.
If you have any questions or queries then please do not hesitate to contact me.
Wow what a great write up. This is exactly what I needed. What I will like to do is to have an outer loop container that call a list of ELT packages that then execute the multiple connection string in parallel. Will report back on my progress.
Thnaks!
Great that you found it informative, as well as something that can help you in your job.
If you get stuck or got any questions please let me know.
Thanks
Thanks
Hi Glibert,
Would it be possible to download this project from some where ?
Thanks,Jerome
Hi Gilbert,
Is this correct : ii. What we are saying above is when our variable from step 1 @LoopQuitparent not equals zero then finish or complete.
Should be “equals zero” instead of “not equals zero” ?
Regards,
Jerome
Hi Jerome,
The logic is saying it will be completed when all he rows in the table are not equal to zero. So they will have any other value besides zero. The reason for this is when we are updating the table, we are changing the value from 2 to 1.
Hi Gilbert,
Great post there and an awesome logic.. Nails what i need right on the head. I have quick a number of servers to connect to over two dozen of them across the Globe all same structure except for their names and IP… So this helps alot. So with this each table .
Can you share how you get data into the table with IP and all. And how do u set the ‘IsActive’ flag.
NOTE: We did not explain how we get the data into this table, as this would make this blog post a whole lot longer.
Kind regards,
Freddie.
Hi there Freddie,
What I did was I simply had an Excel file in which all the details were stored. I then used SSIS to bring the data in from the Excel file into the database table. Or what I did on another project is the DBA team had a database with all the locations and IP addresses of the database servers. I then simply got access to this table and used it again to get the data into my table.
I hope that it helps?
Very nice. I have something similar but I control the foreach loop on the child package. For example I have 100 files to load and 4 parallel packages. The files are split by parallel packages, so 25 files each in this example. In theory the 4 packages will take a similar time because the 100 files are split evenly by size
I’m trying to understand if you logic could improve my process… basically you dont split the workload at the beginning, the parallel packages pick 1 file by 1 file and then when there are no more, the parallel packages exit the loop. Is that correct?
Yes it could improve the overall time, the reason I developed it this way, is because I had some systems which would complete very quickly and others that took longer.
If I had to divide them evenly, and a few were all together that process would take significantly longer, because it would have to wait for each one to complete.
But by just taking the next one in the list, the fast ones will finish quicker, and the next one can continue in the execution.