CypherResilience
Compaction
Concept
Execution
Execute On Remote Host
# stop cron from running jobs while compacting
crontab -r
ps -fU admin
# check to see if any Python or .sh is running, wait till it completes
# generate the parquets
export PYTHONPATH=/home/admin/projects/cypherpunk/cypherpunk_reddit/
python /home/admin/projects/cypherpunk/cypherpunk_reddit/step_02_parquet_link_list.py
# backup the archive_link_list
cd /opt/cypherpunk/data/reddit/json
cp -r archive_link_list bak.archive_link_list
# add the new link_list contents
rsync -av link_list/ archive_link_list/
# verify that the new size matches the sum of the old archive and the new link_list
du --max-depth 1
# good place to pull a copy to another machine
# rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
# remove the live link_list jsons
find link_list -type f | xargs rm
# find link_list/ -type f ! -cnewer link_list/day/worldnews/1701374453-0.json.bz2 | xargs ls -l
# backup the parquet
cd /opt/cypherpunk/data/reddit/
cp -r parquet bak.parquet
# check the active months
cd parquet
find compacted_raw -path '*created_year=2023/created_month=10/*' -type f
# Confirm you have replacements for all of these.
# Do not compact within one week of a month boundary. (improve this)
# when doing a month boundary, have at least a week on each side.
find raw -path '*created_year=2023/created_month=10/*' -type f
# compare sizes ** THIS IS NOT A FOOL-PROOF CHECK **
find compacted_raw/ -path '*created_year=2023/created_month=10/*' -type f | xargs ls -l | awk '{sum += $5} END {print sum}'
find raw/ -path '*created_year=2023/created_month=10/*' -type f | xargs ls -l | awk '{sum += $5} END {print sum}'
# remove the active month files from compacted
find compacted_raw/ -path '*created_year=2023/created_month=10/*' -type f | xargs rm
# Move the active month files from raw to compacted_raw
find raw -path '*created_year=2023/created_month=10/*' -type f | awk '{print "mv "$1" compacted_"$1}'
python /home/admin/projects/cypherpunk/cypherpunk_reddit/step_02_parquet_link_list.py
# check parquet sizes
du --max-depth 1
# Run dedupe - should have no effect
du --max-depth 1
# clean up
cd ..
rm -rf bak.parquet
cd json
rm -rf bak.archive_link_list
crontab /home/admin/projects/cypherpunk/cypherpunk_reddit/crontab.l
Data Redundancy
Data Classification
Catalog | Format | Dataset | Classification | Notes |
---|---|---|---|---|
Harvest | JSON | link_list | create, delete | |
Harvest | JSON | archive_link_list | create | revise sync calls - do not use delete switch |
Harvest | JSON | discussion | create, (delete - only affects 2 most recent months) | Restructure the storage, move date to year=2023/month=09 |
Pipeline | Parquet | raw/link_list | create, delete (2x daily) | |
Pipeline | Parquet | dedupe/link_list | create, delete (2x daily) | |
Pipeline | Parquet | compacted_raw/link_list | create, delete (~weekly) |
Sync
Execute On Local (Backup) Host
Pull Discussions
This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete.
This will not work if new subreddits have been added on the remote host, as it relies on the local directory structure to find the subreddits. Create the current month directories to fix that.
cd /opt/cypherpunk/data/reddit/json find discussion -name '2023-09' -type d | awk '{print "rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/"$1"/ ./"$1"/"}'
Pull All Other JSON
rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/ rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/
Pull Compacted Parquet
rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/parquet/compacted_raw/ /opt/cypherpunk/data/reddit/parquet/compacted_raw/
S3 Backup
JSON, Daily
time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/
Parquet Compacted, Weekly
time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/