Kellogg Data Cloud
Overview
The Kellogg Data Cloud (KDC) is an Amazon Athena platform that hosts various Kellogg datasets. Some of these were previously stored in our on-premise Microsoft SQL Server (the Kellogg Data Center). To find whether a dataset is available in this platform, please check the “Data” tab.
For access, please contact us at rs@kellogg.northwestern.edu. Note that access is granted to each specific database and not to the platform as a whole.
Once you have access, you can use the data through one of the following methods:
- On the AWS Console
- On KLC through an ODBC connection
On the AWS console:
Gaining access
Request access to a specific dataset by contacting rs@kellogg.northwestern.eduLogging into AWS
Use the NUIT provided link here https://www.it.northwestern.edu/support/login/aws.html to login to AWS with your net ID credentials.
Select “ksm-rch-data” and then find the database you’d like to query.
Select “Management console”
Access restrictions
Once logged in, you only have access to Athena. Please note that this account will not grant you access to any other AWS tool.
AWS Athena setup
- From the “Search” field, navigate to Athena.
- Adjust your workgroup to match the database name.
- Check the Region at the upper right corner of the webpage. Please make sure you are in the US-EAST-2 (OHIO) region.
Explore your dataset and submit queries
After setup, you can submit SQL queries directly from the Athena console.
Query Limits
Note that most AWS databases have a daily query limit of 2TBs. Please contact rs@kellogg.northwestern.edu if you need to increase this limit.
Download Results
After your query is complete, there are multiple ways to download the results:
- Download directly by clicking the “Download results” button on the screen
- Go to the “Recent queries” tab to download results for a specific query
On KLC:
Gaining access
Request access to a specific dataset by contacting rs@kellogg.northwestern.edu
Logging into KLC
Follow the instructions here (https://www.kellogg.northwestern.edu/research-support/computing/kellogg-linux-cluster/connect.aspx) to login to KLC through any method you prefer.
Locate and copy AWS credentials
Navigate to the AWS login page here: https://www.it.northwestern.edu/support/login/aws.html
- Select “ksm-rch-data” and then find the database you’d like to query.
- Select “Command line and programmatic access.”
- Copy your temporary “AWS credentials file” from Option 2.
Note that these credentials will need to be updated every few hours.
Create a credentials file on KLC
After copying your credentials, create a hidden “.aws” folder in your KLC home directory. Within that folder, create a “credentials” file that contains the copied contents.
Use the AWS command line interface (CLI)
Load the AWS command line tool with:
module load awscli/2
Check that your credentials work by displaying the S3 buckets you can access:
aws s3 ls --profile <account profile>
Please replace <account profile> with the name of your database account profile.
Set up the ODBC environment
Set two paths to the ODBC instantiation files with:
export ODBCSYSINI=/kellogg/software/.odbc/<database_name>
export ODBCINI = /kellogg/software/.odbc/<database_name>
Please replace <database name> with the workgroup name you are provided.
Query limits
Note that most AWS databases have a daily query limit of 2TBs. Please contact rs@kellogg.northwestern.edu if you need to increase this limit.
Connect from code
Connect to your Athena Database through any software platform you prefer. We provide sample files to write queries here:
/kellogg/software/aws_odbc_samples
Python
Load the version of python you would like to use. Then, modify the athena_odbc.py file with your preferred database name and table name. Run the file with:
python athena_odbc.py
R
Load the version of R you would like to use. Then modify the athena_odbc.R file with your preferred database name and table name. Run the file with:
Rscript athena_odbc.R
Stata
Load the version of Stata you would like to use. Then modify the athena_odbc.do file with your preferred database name and table name. Run the file with:
stata-mp -b athena_odbc.do