CypherResilience: Difference between revisions

From Traxel Wiki
Jump to navigation Jump to search
(Created page with "= Data Redundancy = == Sync == === Pull Discussions === This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete. <pre> cd /opt/cypherpunk/data/reddit/json find discussion -name '2023-09' -type d | awk '{print "rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/"$1"/ ./"$1"/"}' </pre> === Pull All Othe...")
 
Line 1: Line 1:
= Data Redundancy =
= Data Redundancy =
== Data Categorization ==
== Sync ==
== Sync ==
=== Pull Discussions ===
=== Pull Discussions ===

Revision as of 13:04, 26 September 2023

Data Redundancy

Data Categorization

Sync

Pull Discussions

This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete.

cd /opt/cypherpunk/data/reddit/json
find discussion -name '2023-09' -type d | awk '{print "rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/"$1"/ ./"$1"/"}'

Pull All Other JSON

rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/
rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/

S3 Backup

JSON, Daily

time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/

Parquet Compacted, Weekly

time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/