Power BI – Quick Tips working with multiple large CSV Files
I have recently been working a lot with multiple larger type CSV files, and thought it would be good to share what I have found that makes the process run faster. I am always looking at ways to make sure that not only is the model efficient, but also the development experience.
Name your query first before applying steps in Query Editor
I have found that when working with csv files, if you rename your query in the Query Editor, it will then go and complete all the Applied Steps in your query. This in my case meant waiting a few minutes for a simple rename to take effect.
So always remember to give your query a name first, before doing any additional steps.
If possible combine all csv files from separate files into ONE file
I found that when using the Folder to import my CSV files, that when I put all the data into One file it loaded the data significantly faster.
Disabling Background Data in Options
When working with multiple tables that all load off the same CSV files, I found that Power BI desktop would start trying to refresh data in the background, and it would either make the Query Editor I was working on very slow. Or I would have to wait until the background data preview to complete. By turning this off it meant you only would refresh the data you are working on.
To disable this click on File, then Options and settings
Then click on Options
Then go down to the section called CURRENT FILE and click on Data Load
Now in the right hand pane under the section Background Data, remove the tick from Allow data preview to download data in the background. So that once complete it looks as shown below.
Then click Ok.
Now to ensure that it does take effect I would suggest closing and then opening Power BI Desktop.
[…] people have run into the same problem in Excel and also in Power BI Desktop (see here and here), so it looks like this is an important property to change if you have a large number of […]
Is there any documentation or experience on exactly what the different data load options do in terms of performance – particularly on loading a pbix file from scratch. Use case is 2 x years of locally stored daily 30MB csv files each containing 100,000 records. Final queries are broken down into historical data (not refreshed) and current data (for last month refreshed daily) – but even this takes two hours on a 16GB machine….any suggestions greatly appreciated.
Hi there Patrick
All the suggestions as part of the blog post really do help. If possible I would suggest putting all your historical data into one CSV file which should then increase the loading performance for the historical data. Another option is that you could create one table called Historical which has all the historical data, and right click on the table and de-select “Enable Load” and de-select “Include in Report Refresh” (What this will do for the historical data, is it should load it once and be part of the model, but remove the requirement for it to be refreshed. Also by disabling the loading it will not the data into the data model). Then have another table which has your current data.
Finally Merge the 2 tables together, which should hopefully increase the performance loading.
Another thing that I did do, was when I was developing to ensure that the files were as close (or local) as possible to the Power BI Desktop application. This is so that I could reduce the slow speed when going across the network. If you have it on a server, make sure that you can connect via the LAN and that the Server has got potentially Jumbo Frames enabled.
Also do you have a lot of transformations happening after you import the data? As this could also potentially slow it down (which will be seen by a High CPU Usage) if this is the case the High CPU (potentially running at 100%) could be another bottleneck to look at.
And finally there is the potential to have an On-Premise Gateway installed on a Server (with the Power Plan set to High performance). The server I would hope would have decent CPU’s, Memory and connectivity to the source data. This is because if you refresh from the On-Premise Gateway, where the Gateway is installed this is where it does all the refreshing.
Sorry one last point, is if possible an alternative is to load the CSV data into a SQL Server database, where you can manage the data better using SSIS, and then simply consume it into Power BI using SQL Server as the source? SQL Server and SSIS would be a lot more efficient at not only loading the data, but as well as making sure it only loads the new data and not the older data.
As well as if you are looking to use Power BI Premium they are going to have an option for Incremental loads (coming soon) which will also help the loading of your files.
Absolutely agree with split and only loading current data – and have already tried breaking the files into monthly gulps but it seems that about 10 x files (approx 1M rows and 300MB) is the practical limit for containment within the constraints of RAM and CPU locally. I’m doing all of this on the local box until I get the balance of performance and response right – then plan to move the 2 x data sets to the cloud so the historical data will never have to go through the query transformations again !
That is good to know that you have gone through the pain of testing them out as individual files. One thing that I have not tested is if the performance is the same, better or worse if they are excel files? No idea, but I wonder if it will be worth all the effort.
Absolutely â Thanks for that !
I should have also pointed out that the 30MB daily csv actually contains 2 effective data tables (once transformed) â so it might be a case of separately pre-processing (multiple times) in Excel to the limit of the 1M rows as a one off for the historical dataâ¦.
Cheers
Pat
No worries I do hope that it has helped.
Regards
Gilbert
Hi Gilbert,
I spent some time over the weekend diagnosing and performance testing various approaches to dealing with the CSV files that you may be interested in.
The issue seems to be related to the size of the aggregated csv file when applying transformations (grouping and extracting last record of group) within the query.
If the aggregated csv file is >2GB and transformations are applied, then the 16GB RAM PC will take forever to perform the transformation (to the point I killed the process after 4 hours)
If the aggregated csv file is >2GB but no transformations are applied, then the file seems to process approx. 10 x faster than processing the individual csvâs.
Iâm not sure why the magic number seems to be 2GB â maybe something to do with copying and storage of data while a transformation is applied ?
Any thoughts appreciated â I can share the public data source and my query steps if needed.
Cheers
Pat
Hi there Pat,
The reason is that when you apply the transformations it is due to the Query editor having to apply them to the entire file for potentially each and every row.
Whilst when you do this with no transformations it is typically just loading and compressing the data.
What transformations are you doing within your data?
What I would suggest is that the Query Editor is not very good when it comes to grouping data, as well as trying to find the last record. There often is a different way to get the same output. Especially if you are looking to aggregate and create measures from your data, as you can do this using DAX and even possible with Calculated tables (which are InMemory) only.
Always ready to try something different Gilbert â I think you are right that if we apply the group-by and get last record in DAX in-memory, the result may be much better.
Could you send me an email address so I can point you in the direction of the data and my query steps ?
Cheers
Pat
Patrick Booth |
patrick.booth@energetics.com.au
P +61 2 9492 9512 | M 0408 433 715
Level 7 / 132 Arthur Street North Sydney NSW 2060
S#L[Energetics]S#L
S#S[Subscribe] [Linkedin] [Twitter] [Linkedin-group] E#S
This email (including any attachments) is intended only for the person or entity to which it is addressed and may contain confidential or privileged material. If you are not the intended recipient you must not use, disclose, copy, distribute or reply upon the information in this email. If this is a private communication it does not represent the views of Energetics Pty Ltd. If you received this email in error, please delete it from your computer and notify us by email or telephone.