Syncing files to aws with R

news
assets
aws
s3
r
paws
Author

al

Published

May 23, 2024

Modified

May 24, 2024

Code
knitr::opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE, dpi=60, out.width = "100%")
options(scipen=999)
options(knitr.kable.NA = '--') #'--'
options(knitr.kable.NAN = '--')

Inspired by https://blog.djnavarro.net/posts/2022-03-17_using-aws-s3-in-r/ by Danielle Navarro.

Note to self - /Users/airvine/Projects/repo/new_graphiti/_freeze/posts/aws-storage-processx/index/execute-results/html.json is created when I render this document. This seems to be what is published to website after 1. the github_actions workflow is run to generate the gh-pages branch (on github runner) 2. the site is published with gitpages from github.

“Quick” post to document where I got to with syncing files to aws with R. Didn’t love the aws.s3::sync function because from what I could tell I could not tell it to delete files if they were not present locally or in a bucket (I could be wrong).

Then climbed into s3fs which mirrors the fs package and seems a bit more user friendly than the aws.s3 package for managing files. It is created by Dyfan Jones who also is the top contributor to paws!! He seems like perhaps as much of a beast as one of the contributors to s3fs who is Scott Chamberlain.

For the sync issue figured why not just call the aws command line tool from R. processx is an insane package that might be the mother of all packages. It allows you to run command line tools from R with flexibility for some things like setting the directory where the command is called from in the processx called function (big deal as far as I can tell).

We need to set up our aws account online. The blog above from Danielle Navarro covers that I believe (I struggled through it a long time ago). I should use a ~/.aws/credentials file but don’t yet. I have my credentials in my ~/.Renviron file as well as in my ~/.bash_profile (probably a ridiculous setup). They are:

AWS_ACCESS_KEY_ID='my_key'
AWS_DEFAULT_REGION='my_region'
AWS_SECRET_ACCESS_KEY='my_secret_key'
Code
# library(aws.s3)
library(processx)
# library(paws) #this is the mom.  Couple examples of us hashed out here
library(s3fs)
# library(aws.iam) #not useing - set permissions
library(here) #helps us with working directory issues related to the `environment` we operate in when rendering

See buckets using the s3fs package.


Current buckets are:

Code
s3fs::s3_dir_ls(refresh = TRUE) 
[1] "s3://23cog"
Code
# First we set up our AWS s3 file system. I am actually not sure this is necessary but I did it.  Will turn the chunk off
# to not repeat.
# s3fs::s3_file_system(profile_name = "s3fs_example")

Create a Bucket

Let’s generate the name of the bucket based on the name of the repo but due to aws bucket naming rules we need to swap out our underscores for hyphens! Maybe a good enough reason to change our naming conventions for our repos on github!!

Code
bucket_name <- basename(here::here()) |> 
  stringr::str_replace_all("_", "-") 

bucket_path <- s3fs::s3_path(bucket_name)

s3fs::s3_bucket_create( bucket_path)  
[1] "s3://new-graphiti"

Sync Files to Bucket

We build a little wrapper function to help us debug issues when running system commands with processx.

Code
sys_call <- function(){
  result <- tryCatch({
    processx::run(
      command,
      args = args,
      echo = TRUE,            # Print the command output live
      wd = working_directory, # Set the working directory
      spinner = TRUE,         # Show a spinner
      timeout = 60            # Timeout after 60 seconds
    )
  }, error = function(e) {
    # Handle errors: e.g., print a custom error message
    cat("An error occurred: ", e$message, "\n")
    NULL  # Return NULL or another appropriate value
  })
  
  # Check if the command was successful
  if (!is.null(result)) {
    cat("Exit status:", result$status, "\n")
    cat("Output:\n", result$stdout)
  } else {
    cat("Failed to execute the command properly.\n")
  }
}


Then we specify our command and arguments. To achieve the desired behavior of including only files in the assets/* directory, you need to combine the order of --exclude and --include flags appropriately (exclude everything first thenn include what we want):

Code
command <- "aws"
args <- c('s3', 'sync', '.', bucket_path, '--delete', '--exclude', '*', '--include', 'posts/*')

working_directory = here::here() #we could just remove from funciton to get the current wd but its nice to have so we leave

Now lets put a tester file in our directory and sync it to our bucket. We will remove it later to test if it is removed on sync.

Code
file.create(here::here('posts/test.txt'))
[1] TRUE

Run our little function to sync the files to the bucket.

Code
sys_call()
Completed 237 Bytes/511.7 KiB (3.0 KiB/s) with 12 file(s) remaining
upload: posts/_metadata.yml to s3://new-graphiti/posts/_metadata.yml
Completed 237 Bytes/511.7 KiB (3.0 KiB/s) with 11 file(s) remaining
Completed 2.0 KiB/511.7 KiB (11.9 KiB/s) with 11 file(s) remaining 
upload: posts/logos-equipment/index.qmd to s3://new-graphiti/posts/logos-equipment/index.qmd
Completed 2.0 KiB/511.7 KiB (11.9 KiB/s) with 10 file(s) remaining
Completed 3.6 KiB/511.7 KiB (20.8 KiB/s) with 10 file(s) remaining
upload: posts/snakecase/index.qmd to s3://new-graphiti/posts/snakecase/index.qmd
Completed 3.6 KiB/511.7 KiB (20.8 KiB/s) with 9 file(s) remaining
Completed 8.6 KiB/511.7 KiB (48.5 KiB/s) with 9 file(s) remaining
upload: posts/snakecase/thumbnail.jpg to s3://new-graphiti/posts/snakecase/thumbnail.jpg
Completed 8.6 KiB/511.7 KiB (48.5 KiB/s) with 8 file(s) remaining
Completed 13.9 KiB/511.7 KiB (74.3 KiB/s) with 8 file(s) remaining
upload: posts/aws-storage-permissions/index.qmd to s3://new-graphiti/posts/aws-storage-permissions/index.qmd
Completed 13.9 KiB/511.7 KiB (74.3 KiB/s) with 7 file(s) remaining
Completed 17.8 KiB/511.7 KiB (82.4 KiB/s) with 7 file(s) remaining
upload: posts/aws-storage-processx/image.jpg to s3://new-graphiti/posts/aws-storage-processx/image.jpg
Completed 17.8 KiB/511.7 KiB (82.4 KiB/s) with 6 file(s) remaining
upload: posts/test.txt to s3://new-graphiti/posts/test.txt        
Completed 17.8 KiB/511.7 KiB (82.4 KiB/s) with 5 file(s) remaining
Completed 26.0 KiB/511.7 KiB (119.1 KiB/s) with 5 file(s) remaining
upload: posts/aws-storage-processx/index.rmarkdown to s3://new-graphiti/posts/aws-storage-processx/index.rmarkdown
Completed 26.0 KiB/511.7 KiB (119.1 KiB/s) with 4 file(s) remaining
Completed 34.0 KiB/511.7 KiB (152.0 KiB/s) with 4 file(s) remaining
upload: posts/aws-storage-permissions/image.jpg to s3://new-graphiti/posts/aws-storage-permissions/image.jpg
Completed 34.0 KiB/511.7 KiB (152.0 KiB/s) with 3 file(s) remaining
Completed 42.2 KiB/511.7 KiB (180.5 KiB/s) with 3 file(s) remaining
upload: posts/aws-storage-processx/index.qmd to s3://new-graphiti/posts/aws-storage-processx/index.qmd
Completed 42.2 KiB/511.7 KiB (180.5 KiB/s) with 2 file(s) remaining
Completed 83.0 KiB/511.7 KiB (256.0 KiB/s) with 2 file(s) remaining
upload: posts/logos-equipment/image.jpg to s3://new-graphiti/posts/logos-equipment/image.jpg
Completed 83.0 KiB/511.7 KiB (256.0 KiB/s) with 1 file(s) remaining
Completed 339.0 KiB/511.7 KiB (830.9 KiB/s) with 1 file(s) remaining
Completed 511.7 KiB/511.7 KiB (654.4 KiB/s) with 1 file(s) remaining
upload: posts/snakecase/all.jpeg to s3://new-graphiti/posts/snakecase/all.jpeg
Exit status: 0 
Output:
 Completed 237 Bytes/511.7 KiB (3.0 KiB/s) with 12 file(s) remaining
upload: posts/_metadata.yml to s3://new-graphiti/posts/_metadata.yml
Completed 237 Bytes/511.7 KiB (3.0 KiB/s) with 11 file(s) remaining
Completed 2.0 KiB/511.7 KiB (11.9 KiB/s) with 11 file(s) remaining 
upload: posts/logos-equipment/index.qmd to s3://new-graphiti/posts/logos-equipment/index.qmd
Completed 2.0 KiB/511.7 KiB (11.9 KiB/s) with 10 file(s) remaining
Completed 3.6 KiB/511.7 KiB (20.8 KiB/s) with 10 file(s) remaining
upload: posts/snakecase/index.qmd to s3://new-graphiti/posts/snakecase/index.qmd
Completed 3.6 KiB/511.7 KiB (20.8 KiB/s) with 9 file(s) remaining
Completed 8.6 KiB/511.7 KiB (48.5 KiB/s) with 9 file(s) remaining
upload: posts/snakecase/thumbnail.jpg to s3://new-graphiti/posts/snakecase/thumbnail.jpg
Completed 8.6 KiB/511.7 KiB (48.5 KiB/s) with 8 file(s) remaining
Completed 13.9 KiB/511.7 KiB (74.3 KiB/s) with 8 file(s) remaining
upload: posts/aws-storage-permissions/index.qmd to s3://new-graphiti/posts/aws-storage-permissions/index.qmd
Completed 13.9 KiB/511.7 KiB (74.3 KiB/s) with 7 file(s) remaining
Completed 17.8 KiB/511.7 KiB (82.4 KiB/s) with 7 file(s) remaining
upload: posts/aws-storage-processx/image.jpg to s3://new-graphiti/posts/aws-storage-processx/image.jpg
Completed 17.8 KiB/511.7 KiB (82.4 KiB/s) with 6 file(s) remaining
upload: posts/test.txt to s3://new-graphiti/posts/test.txt        
Completed 17.8 KiB/511.7 KiB (82.4 KiB/s) with 5 file(s) remaining
Completed 26.0 KiB/511.7 KiB (119.1 KiB/s) with 5 file(s) remaining
upload: posts/aws-storage-processx/index.rmarkdown to s3://new-graphiti/posts/aws-storage-processx/index.rmarkdown
Completed 26.0 KiB/511.7 KiB (119.1 KiB/s) with 4 file(s) remaining
Completed 34.0 KiB/511.7 KiB (152.0 KiB/s) with 4 file(s) remaining
upload: posts/aws-storage-permissions/image.jpg to s3://new-graphiti/posts/aws-storage-permissions/image.jpg
Completed 34.0 KiB/511.7 KiB (152.0 KiB/s) with 3 file(s) remaining
Completed 42.2 KiB/511.7 KiB (180.5 KiB/s) with 3 file(s) remaining
upload: posts/aws-storage-processx/index.qmd to s3://new-graphiti/posts/aws-storage-processx/index.qmd
Completed 42.2 KiB/511.7 KiB (180.5 KiB/s) with 2 file(s) remaining
Completed 83.0 KiB/511.7 KiB (256.0 KiB/s) with 2 file(s) remaining
upload: posts/logos-equipment/image.jpg to s3://new-graphiti/posts/logos-equipment/image.jpg
Completed 83.0 KiB/511.7 KiB (256.0 KiB/s) with 1 file(s) remaining
Completed 339.0 KiB/511.7 KiB (830.9 KiB/s) with 1 file(s) remaining
Completed 511.7 KiB/511.7 KiB (654.4 KiB/s) with 1 file(s) remaining
upload: posts/snakecase/all.jpeg to s3://new-graphiti/posts/snakecase/all.jpeg

Then we can see our bucket contents - as well as list our bucket contents and capture them.

Code
s3fs::s3_dir_tree(bucket_path)
s3://new-graphiti
└── posts
    ├── _metadata.yml
    ├── test.txt
    ├── aws-storage-permissions
    │   ├── image.jpg
    │   └── index.qmd
    ├── aws-storage-processx
    │   ├── image.jpg
    │   ├── index.qmd
    │   └── index.rmarkdown
    ├── logos-equipment
    │   ├── image.jpg
    │   └── index.qmd
    └── snakecase
        ├── all.jpeg
        ├── index.qmd
        └── thumbnail.jpg
Code
t <- s3fs::s3_dir_info(bucket_path, recurse = TRUE)

Now we will remove test.txt

Code
file.remove(here::here('posts/test.txt'))
[1] TRUE

Now we sync again.

Code
sys_call()
delete: s3://new-graphiti/posts/test.txt
Exit status: 0 
Output:
 delete: s3://new-graphiti/posts/test.txt

List our bucket contents and capture them again

Code
s3_dir_tree(bucket_path)
s3://new-graphiti
└── posts
    ├── _metadata.yml
    ├── aws-storage-permissions
    │   ├── image.jpg
    │   └── index.qmd
    ├── aws-storage-processx
    │   ├── image.jpg
    │   ├── index.qmd
    │   └── index.rmarkdown
    ├── logos-equipment
    │   ├── image.jpg
    │   └── index.qmd
    └── snakecase
        ├── all.jpeg
        ├── index.qmd
        └── thumbnail.jpg
Code
t2 <- s3fs::s3_dir_info(bucket_path, recurse = TRUE)

Compare the file structure before and after our sync.

Code
waldo::compare(t$key, t2$key)
     old                             | new                                 
 [9] "posts/snakecase/all.jpeg"      | "posts/snakecase/all.jpeg"      [9] 
[10] "posts/snakecase/index.qmd"     | "posts/snakecase/index.qmd"     [10]
[11] "posts/snakecase/thumbnail.jpg" | "posts/snakecase/thumbnail.jpg" [11]
[12] "posts/test.txt"                -                                     

Success!!

To Do

We need to build the call to sync the other way (cloud to local) in a way that perhaps nukes local files if they are not on the cloud. This is because we need to collaborate within our team so we do things like one person will change the name of images so when the other person syncs they will have only the newly named image in their local directory.


This all deserved consideration as it could get really messy from a few different angles (ie. one person adds files they don’t want nuked and then they get nukes. There are lots of different options for doing things so we will get there.)

Delete Bucket

Lets delete the bucket.

Code
#
Here is the command line approach that we will turn off in favor of the s3fs approach.
args <- c('s3', 'rb', bucket_path, '--force')
sys_call()
Code
# Here is the `s3fs` way to "delete" all the versions.  
# list all the files in the bucket
fl <- s3fs::s3_dir_ls(bucket_path, recurse = TRUE, refresh = TRUE)

# list all the version info for all the files
vi <- fl |> 
  purrr::map_df(s3fs::s3_file_version_info)

s3fs::s3_file_delete(path = vi$uri)
Code
s3fs::s3_bucket_delete(bucket_path)
[1] "s3://new-graphiti"

As we have tried this before we know that if we tell it we want to delete a bucket with versioned files in it we need to empty the bucket first including delete_markers. That is easy in the aws console with th UI but seems tricky. There is a bunch of discussion on options to this here https://stackoverflow.com/questions/29809105/how-do-i-delete-a-versioned-bucket-in-aws-s3-using-the-cli . Thinking a good way around it (and a topic for another post) would be to apply a lifecycle-configuration to the bucket that deletes all versions of files after a day - allowing you to delete bucket after they expire (as per the above post). Really we may want to have a lifecycle-configuration on all our versioned buckets to keep costs down anyway but deserves more thought and perhaps another post.

Code
# old notes
# We are going to test creating a bucket with versioning on.  This has large implications for billing with some details
# of how it works [here](https://aws.amazon.com/blogs/aws/amazon-s3-enhancement-versioning/) with example of costs [here](https://aws.amazon.com/s3/faqs/?nc1=h_ls).  Thinking we may want versioned buckets for things like `sqlite`
# "snapshot" databases but definitely not for things like images.