This will be a simple guide to use boto3 to perfom various operations on S3 bucket.
How to use PYTHON to get data, upload data, and search for an object?
If you are puzzled with these questions then this post is for you. Is there an easy way to do this? During the early stages of me using amazon s3 with python, I found different ways that kind of confused me. I will present the easiest and straightforward way that I use now.
What do you need?
you need something called “Boto3”. What is it?
Boto3 is a Python SDK/library that can be used for managing Amazon EC2, Amazon Simple Queue Service (SQS), Amazon S3 bucket, Dynamo DB, Cloudwatch, AWS key management service (AWS KMS), and many more. It's pretty much a way to access different services of AWS through python scripts.
Here in this guide, We are going to see only the operations that can be performed on AWS S3 using python scripts.
Setup:
a. On Terminal/command prompt, Type “pip list” to list the packages installed. If Boto3 is not installed, Install by typing “pip install boto3”
b. Make Sure you have already added your AWS Credentials such as Access Key and Secret Key using the AWS CLI. Link to setup AWS CLI
Note: This Step ‘b’ is very important. If the credentials are not set in the configuration file as mentioned in the above link, you cannot access the S3 bucket.
Coding:
Like in MongoDB or SQL where we first initiate a connection to db, Here we need to define a client. This lets Boto3 know which service we are going to access by our python script.
To Initialize Boto3 for the service we intend to use, specify the name of the service within the parenthesis. For example, To access “S3” services, define the connection as “boto3.client(‘s3’)”, To access “cloudwatch” services, define the connection as “boto3.client(‘cloudwatch’), similarly to access “EC2” services, use “boto3.client(‘ec2’)”.
Operations performed on S3 using python and boto3 libaray:
Before everything, in the script don’t forget to import boto3.
import boto3
a. List all the buckets accessible by your account:
In order to list all the buckets accessible by your user account, use the “list_buckets” method.
b. To create a new bucket :
Method 1: Creating a new bucket in the default region
This can be used if the S3 bucket could be created in the default region the S3 chooses. Say you are in Paris and you create a new bucket, if you don’t specify a particular region then the AWS region in which your bucket will be stored will be “eu-west-3”. The bucket is created in the nearest data center to be accessible with less latency.
Method 2: Creating a new bucket in a specific region
In some situations like compliance to GDPR and due to data governance principles, you are obliged to store the data within a specific region, For this case, we need to create a bucket within a certain region.
c. To upload files to a bucket:
The first step is the same as described in the above sections. To upload a file, don’t just specify the filename alone, Specify the full path along with the file name For example, if I have to upload a file called “mobilenet.h5” from my pc to the s3 bucket named “saved_models”, I have to specify the File name with the location of the file in my pc as “c:/deeplearning/models/mobilenet.h5” and then the bucket name as “saved_models”. If you want the file “mobilenet.h5” to be stored as “bestmodel.h5”, specify the store_as= “bestmodel.h5”, so the file will be automatically renamed as “bestmodel.h5” and will be stored in the “saved_models” bucket. Note: Make sure that the bucket to which you upload is available in the s3.
d. To download the file:
After specifying the type of service as “s3”, specify the path of the file within s3 that needs to be downloaded. Note: you must specify the full path of the file as shown in the below example. My s3 bucket structure looks like below, so to download the “best_model.h5”, I need to specify the full path to the file without the bucket name as shown in the example below.
e. To search for a file within some bucket:
Let's say, You remember the name of the file but don’t know the bucket in which the file is. We can use the List_object() method within boto3 as shown below to get all the objects within S3 and then we filter the file needed.
For example, consider that we have buckets within S3 organized like the shown here,
So if the hierarchy is nested then opening manually each bucket to check for the presence of the object will be a tedious job. Recently, in my job I wanted to know the if the of the mobilenet model that was stored by another python script. So lets see how we can put up a script that searches for the file we need.
As usual, in the first line, we specify the service that we want to use as S3, Since I am searching for the model named “mobilenet.h5” within a particular date, I specify the bucket name as “Models_Bucket” and All the objects within the “Models_Bucket” are got using the “list_objects()” method in line4. Note: The results returned are in the dictionary format.
Hence, The result variable of line4 contains a dictionary with the key “CommonPrefixes” that lists the objects within “Models_Bucket” as a list. In my case it was like this [{Prefix: ‘20/07/2021"}, {Prefix: ‘21/07/2021"}, {Prefix: ‘22/07/2021"}]. Hence in line 5, we first loop through this list and we get the value of each key ‘Prefix’, and we check in line 7, if the objects are equal to ‘22/07/2021’, then if the object is the found, we get list all the files within this using line8. Note: Again the result is a dictionary, to find the files we need to get the information within the key “Contents”. Hence using line9 we get the contents and further we check if the file ‘mobilenet.h5’ is present in line10.
Let's see another scenario where we need to check if the ‘mobilenet.h5’ was during each day by another automation script that runs and saves models each day. To accomplish this task we adapt the above script as below,
As seen here, using line 5 we loop through all prefix (“Dates” in our example) present within the “CommonPrefix” key. Using line 7, we get all the list of files within each prefix and we check if the files contained within returned “Contents” key has further the file (‘Mobilenet.h5’) we are looking for. if the file is present, then we print the date under which the file is present using the print statement.
For any help, Reach out to me @ Linkedin. I would be happy to help you.