Using glue jobs to create athena tables

Shantanu Oak
3 min readOct 26, 2020

When I have any CSV file, my go-to solution is pandas.

This command can read the first 1000 lines of a very large file.

df = pd.read_csv(‘/tmp/pigl_user_logs_sept2019_feb2020_updated.csv’, nrows=1000, engine=’python’, error_bad_lines=False)

The only problem with pandas is that it can not read very large file. In this case I am using 6 GB file! Where do I store my files? The go-to solution for storing files is S3. These are the simple commands to download and uncompress any file from S3…

# aws s3 cp s3://tpexport/pigl_user_logs_sept2019_feb2020.csv.gz .

# gunzip pigl_user_logs_sept2019_feb2020.csv.gz

I can query the data right there at linux prompt using a command something like this:

# grep ‘156193’ pigl_user_logs_sept2019_feb2020.csv | wc -l
Returns: 38

The command mentioned above will take several minutes if not hours. The same number of records will be returned using a select command in Athena.

# SELECT count(*) FROM “myglue”.”myglue-glueathena” where destination = 156193
Returns: 38

The cloudformation stack name “myglue” is the database name and S3 foldername “glueathena” is the table name! Naming convention is awesome.

It takes only a few seconds to return any query in Athena. Join queries are equally fast those are not possible with any linux command. The only problem that I faced with athena was paragraph marks. For e.g. I have a column called “message” where I have text like

“Dear Customer,
Thanks for your business.”

The “new lines” between quotes needs to be removed using a command like this…

gawk -v RS=’”’ ‘NR % 2 == 0 { gsub(/\n/, “”) } { printf(“%s %s”, $0, RT) }’ pigl_user_logs_sept2019_feb2020.csv > pigl_user_logs_sept2019_feb2020_updated.csv

The new and “clean” file should be uploaded to S3

aws s3 cp pigl_user_logs_sept2019_feb2020_updated.csv s3://tpexport/glueathena/

You can now click on the following button to create a glue job that will create athena table in your account.

crawler to create athena table

You will have to provide the full path where the updated file is stored. For e.g. “s3://tpexport/glueathena/” Make sure that there are no other files in that location. If you have several files in that sub-folder, all those files should have the same number of columns because they are going to be part of a single table!

Once the stack is installed, go to the crawler page and click on “Run crawler” button.

You can also use the table creation wizard available in Athena console to generate this table. Or use the standard “create table” statement. But you need to know the column names and their types. This is not possible with very large files stored in s3. Glue will read the first few lines of the file and decide the column names and types for you. That saves a lot of time. You have to upload the curated file in any given S3 location and run the cloudformation template mentioned above to quickly create athena table.

Once you have queried the Athena table, delete the cloudformation stack to remove the tables from glue and athena. Your S3 files are not deleted even if you remove athena tables.

The life is easy with AWS! Thank you Athena for solving the big data problem once and for all.

--

--