Revision as of 13:16, 26 September 2023

Data Redundancy

Data Classification

Catalog Format Dataset Classification Notes
Harvest JSON link_list create, delete
Harvest JSON archive_link_list create
Harvest JSON discussion create, (delete - 2 months only) Restructure the storage, move date to year=2023/month=09
Pipeline Parquet raw
Pipeline Parquet compacted_raw
Pipeline Parquet dedupe


Pull Discussions

This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete.

cd /opt/cypherpunk/data/reddit/json
find discussion -name '2023-09' -type d | awk '{print "rsync -avz --size-only --delete"$1"/ ./"$1"/"}'

Pull All Other JSON

rsync -avz --size-only --delete /opt/cypherpunk/data/reddit/json/link_list/
rsync -avz --size-only --delete /opt/cypherpunk/data/reddit/json/archive_link_list/

S3 Backup

JSON, Daily

time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/

Parquet Compacted, Weekly

time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/