CypherResilience: Difference between revisions
Jump to navigation
Jump to search
Line 16: | Line 16: | ||
# verify that the new size matches the sum of the old archive and the new link_list | # verify that the new size matches the sum of the old archive and the new link_list | ||
du --max-depth 1 | du --max-depth 1 | ||
# remove the live link_list jsons | |||
find link_list -type f | xargs rm | |||
# backup the existing compacted dir | |||
cd /opt/cypherpunk/data/reddit/parquet/ | |||
cp -r compacted_raw bak.compacted_raw | |||
# move the generated parquets to the compacted dir | |||
find raw -name '*.parquet' | awk '{print "cp "$1" compacted_"$1}' # | sh | |||
crontab /home/admin/projects/cypherpunk/cypherpunk_reddit/crontab.l | crontab /home/admin/projects/cypherpunk/cypherpunk_reddit/crontab.l | ||
</syntaxhighlight> | </syntaxhighlight> |
Revision as of 20:29, 27 September 2023
Compaction
# stop cron from running jobs while compacting
crontab -r
ps -fU admin
# check to see if any Python or .sh is running, wait till it completes
# generate the parquets
export PYTHONPATH=/home/admin/projects/cypherpunk/cypherpunk_reddit/
python /home/admin/projects/cypherpunk/cypherpunk_reddit/step_02_parquet_link_list.py
# backup the archive_link_list
cd /opt/cypherpunk/data/reddit/json
cp -r archive_link_list bak.archive_link_list
# add the new link_list contents
rsync -av link_list/ archive_link_list/
# verify that the new size matches the sum of the old archive and the new link_list
du --max-depth 1
# remove the live link_list jsons
find link_list -type f | xargs rm
# backup the existing compacted dir
cd /opt/cypherpunk/data/reddit/parquet/
cp -r compacted_raw bak.compacted_raw
# move the generated parquets to the compacted dir
find raw -name '*.parquet' | awk '{print "cp "$1" compacted_"$1}' # | sh
crontab /home/admin/projects/cypherpunk/cypherpunk_reddit/crontab.l
Data Redundancy
Data Classification
Catalog | Format | Dataset | Classification | Notes |
---|---|---|---|---|
Harvest | JSON | link_list | create, delete | |
Harvest | JSON | archive_link_list | create | revise sync calls - do not use delete switch |
Harvest | JSON | discussion | create, (delete - only affects 2 most recent months) | Restructure the storage, move date to year=2023/month=09 |
Pipeline | Parquet | raw/link_list | create, delete (2x daily) | |
Pipeline | Parquet | dedupe/link_list | create, delete (2x daily) | |
Pipeline | Parquet | compacted_raw/link_list | create, delete (~weekly) |
Sync
Pull Discussions
This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete.
cd /opt/cypherpunk/data/reddit/json find discussion -name '2023-09' -type d | awk '{print "rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/"$1"/ ./"$1"/"}'
Pull All Other JSON
rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/ rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
S3 Backup
JSON, Daily
time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/
Parquet Compacted, Weekly
time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/