Data Redundancy

Data Classification


Catalog	Format	Dataset	Classification	Notes
Harvest	JSON	link_list	create, delete
Harvest	JSON	archive_link_list	create
Harvest	JSON	discussion	create, (delete - 2 months only)	Restructure the storage, move date to year=2023/month=09
Pipeline	Parquet	raw/link_list	create, delete
Pipeline	Parquet	compacted_raw/link_list	create, delete
Pipeline	Parquet	dedupe/link_list	create, delete

Sync

Pull Discussions

This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete.

cd /opt/cypherpunk/data/reddit/json
find discussion -name '2023-09' -type d | awk '{print "rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/"$1"/ ./"$1"/"}'

Pull All Other JSON

rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/
rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/

S3 Backup

JSON, Daily

time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/

Parquet Compacted, Weekly

time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/

CypherResilience

Contents

Data Redundancy

Data Classification

Sync

Pull Discussions

Pull All Other JSON

S3 Backup

JSON, Daily

Parquet Compacted, Weekly

Navigation menu

CypherResilience

Data Redundancy

Data Classification

Sync

Pull Discussions

Pull All Other JSON

S3 Backup

JSON, Daily

Parquet Compacted, Weekly

Navigation menu

Search