Compaction

Execute On Remote Host

# stop cron from running jobs while compacting
crontab -r
ps -fU admin
# check to see if any Python or .sh is running, wait till it completes
# generate the parquets
export PYTHONPATH=/home/admin/projects/cypherpunk/cypherpunk_reddit/
python /home/admin/projects/cypherpunk/cypherpunk_reddit/step_02_parquet_link_list.py
# backup the archive_link_list
cd /opt/cypherpunk/data/reddit/json
cp -r archive_link_list bak.archive_link_list
# add the new link_list contents
rsync -av link_list/ archive_link_list/
# verify that the new size matches the sum of the old archive and the new link_list
du --max-depth 1
# good place to pull a copy to another machine
# rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
# remove the live link_list jsons
find link_list -type f | xargs rm
# move the existing compacted dir
cd /opt/cypherpunk/data/reddit/parquet/
mv compacted_raw old.compacted_raw
# create the new parquets to the compacted dir
cp -r raw compacted_raw
# remove the files from the existing raw structure
find raw -type f | xargs rm
# regenerate parquets
python /home/admin/projects/cypherpunk/cypherpunk_reddit/step_02_parquet_link_list.py
# check parquet sizes
cd ../parquet
du --max-depth 1
# clean up
rm -rf old.compacted_raw
cd ../json
rm -rf bak.archive_link_list
crontab /home/admin/projects/cypherpunk/cypherpunk_reddit/crontab.l

Data Redundancy

Data Classification


Catalog	Format	Dataset	Classification	Notes
Harvest	JSON	link_list	create, delete
Harvest	JSON	archive_link_list	create	revise sync calls - do not use delete switch
Harvest	JSON	discussion	create, (delete - only affects 2 most recent months)	Restructure the storage, move date to year=2023/month=09
Pipeline	Parquet	raw/link_list	create, delete (2x daily)
Pipeline	Parquet	dedupe/link_list	create, delete (2x daily)
Pipeline	Parquet	compacted_raw/link_list	create, delete (~weekly)

Sync

Pull Discussions

This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete.

This will not work if new subreddits have been added on the remote host, as it relies on the local directory structure to find the subreddits. Create the current month directories to fix that.

cd /opt/cypherpunk/data/reddit/json
find discussion -name '2023-09' -type d | awk '{print "rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/"$1"/ ./"$1"/"}'

Pull All Other JSON

rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/

Pull Compacted Parquet

rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/parquet/compacted_raw/ /opt/cypherpunk/data/reddit/parquet/compacted_raw/

S3 Backup

JSON, Daily

time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/

Parquet Compacted, Weekly

time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/

CypherResilience

Contents

Compaction

Data Redundancy

Data Classification

Sync

Pull Discussions

Pull All Other JSON

Pull Compacted Parquet

S3 Backup

JSON, Daily

Parquet Compacted, Weekly

Navigation menu

CypherResilience

Compaction

Data Redundancy

Data Classification

Sync

Pull Discussions

Pull All Other JSON

Pull Compacted Parquet

S3 Backup

JSON, Daily

Parquet Compacted, Weekly

Navigation menu

Search