Latest revision as of 20:40, 9 March 2024

Compaction

Concept

Suppose today is 2023-10-18.

The active speed parquet is: parquet/raw/link_list/day/politics/created_year=2023/created_month=10/abcdefg.parquet

Aside: If it were 2023-10-02, the speed file for created_month=9 would also be active.

The active compacted dir is: parquet/compacted_raw/link_list/day/politics/created_year=2023/created_month=10/

The active link_list json files are in: json/link_list/day/politics/

The archive link_list json files are in: json/archive_link_list/day/politics/

Process:

stop data pipeline
Run parquet generation step (cypherpipe/step_2_parquet_link_list.py), which reads from compacted and json.
Remove any parquet files in the active compacted_raw.
Move generated parquet files from active raw (speed layer) to active compacted_raw (compacted layer).
Move active json files to json_archive.
restart data pipeline
Backup data in archive link_list.

Execution

Execute On Remote Host

# stop cron from running jobs while compacting
crontab -r
ps -fU admin
# check to see if any Python or .sh is running, wait till it completes
# generate the parquets
export PYTHONPATH=/home/admin/projects/cypherpunk/cypherpunk_reddit/
python /home/admin/projects/cypherpunk/cypherpunk_reddit/step_02_parquet_link_list.py
# backup the archive_link_list
cd /opt/cypherpunk/data/reddit/json
cp -r archive_link_list bak.archive_link_list
# add the new link_list contents
rsync -av link_list/ archive_link_list/
# verify that the new size matches the sum of the old archive and the new link_list
du --max-depth 1
# good place to pull a copy to another machine
# rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
# remove the live link_list jsons
find link_list -type f | xargs rm
# find link_list/ -type f ! -cnewer link_list/day/worldnews/1701374453-0.json.bz2 | xargs ls -l
# backup the parquet
cd /opt/cypherpunk/data/reddit/
cp -r parquet bak.parquet
# check the active months
cd parquet
find compacted_raw -path '*created_year=2023/created_month=10/*' -type f
# Confirm you have replacements for all of these.
# Do not compact within one week of a month boundary. (improve this)
# when doing a month boundary, have at least a week on each side.
find raw -path '*created_year=2023/created_month=10/*' -type f
# compare sizes ** THIS IS NOT A FOOL-PROOF CHECK **
find compacted_raw/ -path '*created_year=2023/created_month=10/*' -type f | xargs ls -l | awk '{sum += $5} END {print sum}'
find raw/ -path '*created_year=2023/created_month=10/*' -type f | xargs ls -l | awk '{sum += $5} END {print sum}'
# remove the active month files from compacted
find compacted_raw/ -path '*created_year=2023/created_month=10/*' -type f | xargs rm
# make the dirs in compacted_raw
find raw/ -path '*created_year=2024/created_month=2' -type d | awk '{print "mkdir compacted_"$1}'
# Move the active month files from raw to compacted_raw
find raw -path '*created_year=2023/created_month=10/*' -type f | awk '{print "mv "$1" compacted_"$1}'
python /home/admin/projects/cypherpunk/cypherpunk_reddit/step_02_parquet_link_list.py
# check parquet sizes
du --max-depth 1
# Run dedupe - should have no effect
du --max-depth 1
# clean up
cd ..
rm -rf bak.parquet
cd json
rm -rf bak.archive_link_list
crontab /home/admin/projects/cypherpunk/cypherpunk_reddit/crontab.l

Data Redundancy

Data Classification


Catalog	Format	Dataset	Classification	Notes
Harvest	JSON	link_list	create, delete
Harvest	JSON	archive_link_list	create	revise sync calls - do not use delete switch
Harvest	JSON	discussion	create, (delete - only affects 2 most recent months)	Restructure the storage, move date to year=2023/month=09
Pipeline	Parquet	raw/link_list	create, delete (2x daily)
Pipeline	Parquet	dedupe/link_list	create, delete (2x daily)
Pipeline	Parquet	compacted_raw/link_list	create, delete (~weekly)

Sync

Execute On Local (Backup) Host

Pull All The New Stuff

rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/ /opt/cypherpunk/data/reddit/

Delete Discussions

This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete.

This will not work if new subreddits have been added on the remote host, as it relies on the local directory structure to find the subreddits. Create the current month directories to fix that.

cd /opt/cypherpunk/data/reddit/json
find discussion -name '2023-09' -type d | awk '{print "rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/"$1"/ ./"$1"/"}'

Delete All Other JSON

rsync -navz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
rsync -navz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/

Delete Parquet

rsync -navz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/parquet/ /opt/cypherpunk/data/reddit/parquet/

gigabrick backup

ssh 192.186.0.8
lsblk
udisksctl mount -b /dev/sdb2

rsync -navz --size-only --delete /opt/cypherpunk/data/reddit/ 192.168.0.8:/media/bob/backup/cypherpunk/data/reddit/

S3 Backup

JSON, Daily

time aws s3 sync --size-only --delete --dryrun /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/

Parquet Compacted, Weekly

time aws s3 sync --size-only --delete --dryrun /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/

CypherResilience: Difference between revisions

Latest revision as of 20:40, 9 March 2024

Contents

Compaction

Concept

Execution

Data Redundancy

Data Classification

Sync

Pull All The New Stuff

Delete Discussions

Delete All Other JSON

Delete Parquet

gigabrick backup

S3 Backup

JSON, Daily

Parquet Compacted, Weekly

Navigation menu

@@ Line 1: / Line 1: @@
+[[Category:CypherTech]]
+= Compaction =
+== Concept ==
+Suppose today is 2023-10-18.
+The active speed parquet is: parquet/raw/link_list/day/politics/created_year=2023/created_month=10/abcdefg.parquet
+Aside: If it were 2023-10-02, the speed file for created_month=9 would also be active.
+The active compacted dir is: parquet/compacted_raw/link_list/day/politics/created_year=2023/created_month=10/
+The active link_list json files are in: json/link_list/day/politics/
+The archive link_list json files are in: json/archive_link_list/day/politics/
+Process:
+# stop data pipeline
+# Run parquet generation step (cypherpipe/step_2_parquet_link_list.py), which reads from compacted and json.
+# Remove any parquet files in the active compacted_raw.
+# Move generated parquet files from active raw (speed layer) to active compacted_raw (compacted layer).
+# Move active json files to json_archive.
+# restart data pipeline
+# Backup data in archive link_list.
+== Execution ==
+'''''Execute On Remote Host'''''
+<syntaxhighlight lang="bash">
+# stop cron from running jobs while compacting
+crontab -r
+ps -fU admin
+# check to see if any Python or .sh is running, wait till it completes
+# generate the parquets
+export PYTHONPATH=/home/admin/projects/cypherpunk/cypherpunk_reddit/
+python /home/admin/projects/cypherpunk/cypherpunk_reddit/step_02_parquet_link_list.py
+# backup the archive_link_list
+cd /opt/cypherpunk/data/reddit/json
+cp -r archive_link_list bak.archive_link_list
+# add the new link_list contents
+rsync -av link_list/ archive_link_list/
+# verify that the new size matches the sum of the old archive and the new link_list
+du --max-depth 1
+# good place to pull a copy to another machine
+# rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
+# remove the live link_list jsons
+find link_list -type f | xargs rm
+# find link_list/ -type f ! -cnewer link_list/day/worldnews/1701374453-0.json.bz2 | xargs ls -l
+# backup the parquet
+cd /opt/cypherpunk/data/reddit/
+cp -r parquet bak.parquet
+# check the active months
+cd parquet
+find compacted_raw -path '*created_year=2023/created_month=10/*' -type f
+# Confirm you have replacements for all of these.
+# Do not compact within one week of a month boundary. (improve this)
+# when doing a month boundary, have at least a week on each side.
+find raw -path '*created_year=2023/created_month=10/*' -type f
+# compare sizes ** THIS IS NOT A FOOL-PROOF CHECK **
+find compacted_raw/ -path '*created_year=2023/created_month=10/*' -type f | xargs ls -l | awk '{sum += $5} END {print sum}'
+find raw/ -path '*created_year=2023/created_month=10/*' -type f | xargs ls -l | awk '{sum += $5} END {print sum}'
+# remove the active month files from compacted
+find compacted_raw/ -path '*created_year=2023/created_month=10/*' -type f | xargs rm
+# make the dirs in compacted_raw
+find raw/ -path '*created_year=2024/created_month=2' -type d | awk '{print "mkdir compacted_"$1}'
+# Move the active month files from raw to compacted_raw
+find raw -path '*created_year=2023/created_month=10/*' -type f | awk '{print "mv "$1" compacted_"$1}'
+python /home/admin/projects/cypherpunk/cypherpunk_reddit/step_02_parquet_link_list.py
+# check parquet sizes
+du --max-depth 1
+# Run dedupe - should have no effect
+du --max-depth 1
+# clean up
+cd ..
+rm -rf bak.parquet
+cd json
+rm -rf bak.archive_link_list
+crontab /home/admin/projects/cypherpunk/cypherpunk_reddit/crontab.l
+</syntaxhighlight>
 = Data Redundancy =
-== Data Categorization ==
+== Data Classification ==
 {| class="wikitable"
 |+
-!
+!Catalog
-!
+!Format
-!
+!Dataset
-!
+!Classification
+!Notes
 |-
+|Harvest
+|JSON
+|link_list
+|create, delete
 |
-|
+|-
-|
+|Harvest
+|JSON
+|archive_link_list
+|create
+|revise sync calls - do not use delete switch
+|-
+|Harvest
+|JSON
+|discussion
+|create, (delete - only affects 2 most recent months)
+|Restructure the storage, move date to year=2023/month=09
+|-
+|Pipeline
+|Parquet
+|raw/link_list
+|create, delete (2x daily)
 |
 |-
-|
+|Pipeline
-|
+|Parquet
-|
+|dedupe/link_list
+|create, delete (2x daily)
 |
 |-
-|
+|Pipeline
-|
+|Parquet
-|
+|compacted_raw/link_list
+|create, delete (~weekly)
 |
 |}
 == Sync ==
-=== Pull Discussions ===
+'''''Execute On Local (Backup) Host'''''
+=== Pull All The New Stuff ===
+<pre>
+rsync -avz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/ /opt/cypherpunk/data/reddit/
+</pre>
+=== Delete Discussions ===
 This has to be done separately because the remote harvester host will not have the full discussion archive. Remote will have the current and previous month, which can be pulled with --delete.
+This will not work if new subreddits have been added on the remote host, as it relies on the local directory structure to find the subreddits. Create the current month directories to fix that.
 <pre>
 cd /opt/cypherpunk/data/reddit/json
@@ Line 32: / Line 139: @@
 </pre>
-=== Pull All Other JSON ===
+=== Delete All Other JSON ===
+<pre>
+rsync -navz --size-only www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
+rsync -navz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/
+</pre>
+=== Delete Parquet ===
+<pre>
+rsync -navz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/parquet/ /opt/cypherpunk/data/reddit/parquet/
+</pre>
+=== gigabrick backup ===
+<pre>
+ssh 192.186.0.8
+lsblk
+udisksctl mount -b /dev/sdb2
+</pre>
 <pre>
-rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/link_list/ /opt/cypherpunk/data/reddit/json/link_list/
+rsync -navz --size-only --delete /opt/cypherpunk/data/reddit/ 192.168.0.8:/media/bob/backup/cypherpunk/data/reddit/
-rsync -avz --size-only --delete www.iterativechaos.com:/opt/cypherpunk/data/reddit/json/archive_link_list/ /opt/cypherpunk/data/reddit/json/archive_link_list/
 </pre>
@@ Line 41: / Line 163: @@
 ==== JSON, Daily ====
 <pre>
-time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/
+time aws s3 sync --size-only --delete --dryrun /opt/cypherpunk/data/reddit/json/ s3://iterative-chaos/cyphernews/harvest/reddit/json/
 </pre>
 ==== Parquet Compacted, Weekly ====
 <pre>
-time aws s3 sync --size-only --delete /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/
+time aws s3 sync --size-only --delete --dryrun /opt/cypherpunk/data/reddit/parquet/compacted_raw/ s3://iterative-chaos/cyphernews/harvest/reddit/parquet/compacted_raw/
 </pre>

CypherResilience: Difference between revisions

Latest revision as of 20:40, 9 March 2024

Compaction

Concept

Execution

Data Redundancy

Data Classification

Sync

Pull All The New Stuff

Delete Discussions

Delete All Other JSON

Delete Parquet

gigabrick backup

S3 Backup

JSON, Daily

Parquet Compacted, Weekly

Navigation menu

Search