Category:AWS
Revision as of 00:39, 24 September 2023 by RobertBushman (talk | contribs) (→Sync the Link List JSONs)
CLI
S3 Sync
aws --profile monkey-banana s3 sync <local_directory> s3://<bucket_name>/<optional_prefix> --exact-timestamps --size-only
Here's a breakdown of the options:
- <local_directory>: The local directory you want to sync with S3.
- s3://<bucket_name>/<optional_prefix>: The destination S3 bucket and an optional prefix (like a folder).
- --exact-timestamps: By default, aws s3 sync uses the LastModified time to determine whether an S3 object is the same as a local file. If the times are not the same but the sizes of the files are the same, AWS CLI will consider them to be the same and will not replace the file. The --exact-timestamps option changes this behavior to consider files as different if their LastModified times are different.
- --size-only: Use this option to make the comparison based on the size of the files only, and not the last modified timestamp. This can be useful if timestamps might differ but the content hasn't changed.
Note:
- Ensure you've properly configured your AWS CLI with the necessary access rights to perform the s3 sync operation.
- The aws s3 sync command by default won't delete files in the destination that are not present in the source. However, if you add the --delete option, it would delete files from the S3 bucket th
aws --profile iterative-chaos s3 sync dedupe s3://iterative-chaos/cyphernews/harvest/reddit/parquet/dedupe --size-only --delete
Sync the Link List JSONs
aws s3 sync ../data/reddit/json/link_list s3://iterative-chaos/cyphernews/harvest/reddit/json/link_list --size-only --delete
Sync the Discussion JSONs
aws s3 sync ../data/reddit/json/discussion s3://iterative-chaos/cyphernews/harvest/reddit/json/discussion --size-only --delete
Sync Parquet
aws s3 sync ../data/reddit/parquet s3://iterative-chaos/cyphernews/harvest/reddit/parquet --size-only --delete
This category currently contains no pages or media.