Did you know you can query data from AWS S3 into Power BI?
I really enjoy a good challenge, and with my customer they have all their data stored in AWS S3. Whilst there is no native connector, I thought there must be a way for me to get the data from AWS S3 into Power BI.
I did a bit of Googling and could not find any suitable solution. I also found and learnt that I could use AWS Athena to query the data living in S3. (I am definitely NOT an expert of have a lot of knowledge in the AWS space. I am fortunate that I have other people who know AWS and were able to setup, configure and give me the details to connect to S3 via AWS Athena)
Below are the steps on how I got this working.
Connection Details required
Before I could complete the steps below, I did require the following information from the people who had setup AWS Athena
- AWS Region
- Schema
- Workgroup
- S3 Output Location
- Encryption Options
- User (Access Key ID)
- Password (Secret Access Key)
Installation and configuration of Athena ODBC Driver
The first step is to download and install the driver from the link below.
NOTE: You can find the latest version of the driver here: Connecting to Amazon Athena with ODBC
Configured 64bit ODBC using the Simba Athena Driver
-
Opened the ODBC Data Source Administrator (64-bit)
-
Clicked on Simba Athen and then configure
-
I then put in the following details
- AWS Region
- Schema
- Workgroup
- S3 Output Location
- Encryption Options
-
-
Next, I needed to update the Authentication Options for the authentication
- The user (in my working example was called the “Access Key ID”
- The password (in my working example was called “Secret Access Key”)
-
The final step in order to Power BI to query the data efficiently was to then click on Advanced Options
- I changed the Rows to Fetch Per Block to be 1000
- I then removed the tick from “Use Resultset Streaming”
-
Once I had put in all the details, I then clicked on Test . . . to test and make sure that the connection was successful.
- And it was successful
Connecting to Athena in Power BI Desktop
Below are the steps to connect to Athena in Power BI Desktop
-
I opened Power BI Desktop, then clicked on Get Data and selected ODBC
-
On the From ODBC Source I clicked on the Data source name (DSN) and selected Simba Athena
- I then clicked Ok
-
On the ODBC Driver window, I clicked on Database and put in the following
- The user (in my working example was called the “Access Key ID”
- The password (in my working example was called “Secret Access Key”)
- I then clicked Connect
-
I was able to then see the Navigator and select the area of the data that I wanted to connect to
- I could then load my data either into the Power Query Editor to transform it or load it into the data model.
- Once I was happy, I then uploaded the PBIX to Power BI
Configuration the Power BI Service Gateway to connect to Athena
Below are the steps to configure the Gateway in the Power BI Service to connect to Athena
- I logged into the Power BI Service and clicked on Managed Gateways
-
Then clicked on Add New Data Source
-
I then gave it a Data Source Name, and changed the Data Source Type to be ODBC
-
Then I put in the following into the Connection string
- “driver=Simba Athena ODBC Driver;dsn=Simba Athena; data source=AwsDataCatalog”
-
From the above I am using the details from the ODBC window as shown previously.
-
I also made sure to put in my data source name, which once again I had selected previously when connecting to Athena
-
Then for the Username and password I put in the following:
- The user (in my working example was called the “Access Key ID”
- The password (in my working example was called “Secret Access Key”)
-
And this is what it looked like
-
This is the ConnectionString Used
“driver=Simba Athena ODBC Driver;dsn=Simba Athena; data source=AwsDataCatalog;RowsToFetchPerBlock=1000”
-
- I then clicked on Add
-
I could then see that the connection was successful
-
The final step was for me to map the dataset to the Gateway.
- I went into the dataset settings and then under Gateway connection I selected the Gateway server and mapped it to “Athena” (The data source I created previously)
- I then clicked Apply
-
I could then see the it was successfully updated.
-
I then manually refreshed the dataset to make sure it is working as expected, which I could see that the refresh was successful.
[…] Gilbert Quevauvilliers loves a challenge: […]
Wow, congrats for figuring it out, this is not for the faints of heart! I’d be concerned however about performance in production. What kind of dataset sizes did you test this with?
Thanks Olivier.
The performance is actually pretty good. I think it also depends on how Athena has been setup and the underlying dataset size. My dataset that I actually am using is a few CSV files.
Thanks for this.
I have a question: according to the AWS pricing calculator (https://calculator.aws), Athena is charged at total number of queries per day/week/month and the data amount scanned per query in MB/GB/TB.
How is the cost calculated when you’re running Power BI reports via Athena ODBC? Let’s say I have 200 users who will run a report at least 3 times a day. The datasource(s) are are 3 CSV files in an S3 bucket
Hi there
I have found that the costs are really small when querying from Athena. I would suggest doing a test and see how much that costs and then take it from there.
You could also use a dataflow if all the users are using the same 3 CSV files to save on refreshing the same data in duplicate.
Hi, thanks for the post!
I followed your instructions but my tests with the connector are insanely slow.
The schemas and tables in the Navigation menu take ~10 minutes to load (sure, that can be circumvented via Power Query editor if I have the pattern and know which tables I want, but still)
I tried using your suggested ODBC settings (no result set streaming, # rows to fetch) but loading tables is still insanely slow.
Even a tiny, thin table with 10k rows took a whopping 7min30s to refresh. Haven’t even bothered with larger tables due to how long it takes to load.
Do you have any ideas on how to improve on this? We’re looking into testing in other environments, but I still feel like it shouldn’t be this bad in my machine.
I have the latest version of the connector (1.1.3) and am on v2.84.981.0 (aug/2020) of Power BI
Hi there
What I have found is that at times it can be slow from S3. This would be due to the way Athena is designed and is more used for storage than query performance.
For our larger datasets we have put it into Redshift or another database in AWS which improves the performance.
Hi, I followed the above steps but after successful connection with athena, powerBI is showing empty AWSDataCatalog. I think there is some problem
in Schema field value. Can you suggest me what will be schema name if my file is located at S3://buckateName/folder/file.parquet?
Hi Vipul,
You will need to chat to your AWS Admin to ensure that Athena has been configured to read from the required buckets.
And that your account has access to read from the same bucket?
Hey Vipul,
Have you figured it out? I have the exact same problem.
Thanks,
Nora
Hi,
Did you install the Data Management Gateway on an Amazon EC2 instance in the VPC ?
(as per whitepaper: https://d1.awsstatic.com/whitepapers/using-microsoft-power-bi-with-aws.pdf?did=wp_card&trk=wp_card)
Hi there
I used this section of the link you sent through
“Connecting the Microsoft Power BI service to AWS data sources”