CypherResilience: Difference between revisions
Jump to navigation
Jump to search
Line 57: | Line 57: | ||
<pre> | <pre> | ||
rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/ | rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/ | ||
rsync -avz --size-only | rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/ | ||
</pre> | </pre> | ||
Revision as of 15:20, 26 September 2023
Data Redundancy
Data Classification
Catalog | Format | Dataset | Classification | Notes |
---|---|---|---|---|
Harvest | JSON | link_list | create, delete | |
Harvest | JSON | archive_link_list | create | revise sync calls - do not use delete switch |
Harvest | JSON | discussion | create, (delete - only affects 2 most recent months) | Restructure the storage, move date to year=2023/month=09 |
Pipeline | Parquet | raw/link_list | create, delete (2x daily) | |
Pipeline | Parquet | dedupe/link_list | create, delete (2x daily) | |
Pipeline | Parquet | compacted_raw/link_list | create, delete (~weekly) |
Sync
Pull Discussions
This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete.
cd /opt/cypherpunk/data/reddit/json find discussion -name '2023-09' -type d | awk '{print "rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/"$1"/ ./"$1"/"}'
Pull All Other JSON
rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/ rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
S3 Backup
JSON, Daily
time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/
Parquet Compacted, Weekly
time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/