CypherReddit: Difference between revisions

From Traxel Wiki
Jump to navigation Jump to search
No edit summary
Line 2: Line 2:


= Data Processing Process =
= Data Processing Process =
# $ python get_link_lists.py # daily
# $ python step_01_get_link_lists.py
## Needs data archival / backup.
## Pulls 24h top 100 every 4 hours
### Doing a brute-force full backup right now, might be sufficient for the time being.
## Pulls 7d top 1,000 every 24 hours
#### $ scp -i key.pem reddit-link-list.tar.bz2 admin@www.iterativechaos.com:./
## Stores as json.bz2, in pages of 100 links.
### Can do just week/day going forward, but that will quickly get slow.
# $ python step_02_parquet_link_list.py
### Storing the files as .json.bz2 would be a big improvement
## Union all of json.bz2 rows into Parquet (ie: raw layer).
# $ python parquet_link_list.py
## Uses compaction, gets slow after about a week without a compaction run.
## (done) Make this iterate and do a full refresh each time
# $ python step_03_dedupe_link_list_parquet.py
## Reconsider full refresh when processing time goes over a minute (currently runs in 12 seconds (not sure if that's real or user))
## Takes the most recent entry for each link_id from the raw layer.
## Old Version:
## Writes to parquet subreddit/day files.
### $ python parquet_link_list.py ../data/reddit/link_list/day/science/ ../data/reddit/parquet/link_list/day/science/
# $ python step_04_get_discussions.py
# $ python dedupe_link_list_parquet.py
## Finds discussions above a threshold of comments and upvotes.
## (done) Make this iterate and do a full refresh each time
## Downloads the discussion and stores as json.bz2.
## Reconsider full refresh when processing time goes over a minute (currently runs in 2 seconds (real))
## Skips download if a file exists and no more than 20% increase in comments.
# $ python get_discussions.py
## TODO: should do a "final download" after a week (or whatever).
## Make this check the existing download, harvest timestamp, num_comments
# $ python step_05_discussion_to_gpt.py
# $ python discussion_to_gpt.py
## Finds the top couple discussions in a set of subreddits.
## Make this iterate and do a full refresh each time
## Pulls the top N tokens worth of comments from each discussion.
## Reconsider full refresh when processing time goes over a minute
## Sends to GPT to summarize.
# hit ChatGPT with output
## Writes each summary to html/summary/{link-id}.html
## Change this to API call to ChatGPT
## Writes the list to archive/gpt_stories_{edition_id}.json
## Make this check the existing generate, harvest versus generate timestamp
# $ python step_06_gpt_to_html.py
## Add data archival / backup
## Loads the 10 most recent gpt_stories_*.json
## generates the news.html file.
# $ rsync -avz html/summary /var/www/html/
# $ cp html/news.html /var/www/html/


* 2023-08-26: Overnight buildup: 3,649 (bug)
* 2023-08-26: Overnight buildup: 3,649 (bug)

Revision as of 15:05, 13 October 2023


Data Processing Process

  1. $ python step_01_get_link_lists.py
    1. Pulls 24h top 100 every 4 hours
    2. Pulls 7d top 1,000 every 24 hours
    3. Stores as json.bz2, in pages of 100 links.
  2. $ python step_02_parquet_link_list.py
    1. Union all of json.bz2 rows into Parquet (ie: raw layer).
    2. Uses compaction, gets slow after about a week without a compaction run.
  3. $ python step_03_dedupe_link_list_parquet.py
    1. Takes the most recent entry for each link_id from the raw layer.
    2. Writes to parquet subreddit/day files.
  4. $ python step_04_get_discussions.py
    1. Finds discussions above a threshold of comments and upvotes.
    2. Downloads the discussion and stores as json.bz2.
    3. Skips download if a file exists and no more than 20% increase in comments.
    4. TODO: should do a "final download" after a week (or whatever).
  5. $ python step_05_discussion_to_gpt.py
    1. Finds the top couple discussions in a set of subreddits.
    2. Pulls the top N tokens worth of comments from each discussion.
    3. Sends to GPT to summarize.
    4. Writes each summary to html/summary/{link-id}.html
    5. Writes the list to archive/gpt_stories_{edition_id}.json
  6. $ python step_06_gpt_to_html.py
    1. Loads the 10 most recent gpt_stories_*.json
    2. generates the news.html file.
  7. $ rsync -avz html/summary /var/www/html/
  8. $ cp html/news.html /var/www/html/
  • 2023-08-26: Overnight buildup: 3,649 (bug)
  • 2023-08-27: overnight buildup: 369

Interesting Subreddits

  • aitah
  • antiwork
  • ask
  • askmen
  • askreddit
  • askscience
  • chatgpt
  • conservative
  • dataisbeautiful
  • explainlikeimfive
  • latestagecapitalism
  • leopardsatemyface
  • lifeprotips
  • news
  • nostupidquestions
  • outoftheloop
  • personalfinance
  • politics
  • programmerhumor
  • science
  • technology
  • todayilearned
  • tooafraidtoask
  • twoxchromosomes
  • unpopularopinion
  • worldnews
  • youshouldknow

Reddit OAuth2

Example Curl Request

curl
  -X POST
  -d 'grant_type=password&username=reddit_bot&password=snoo'
  --user 'p-jcoLKBynTLew:gko_LXELoV07ZBNUXrvWZfzE3aI'
  https://www.reddit.com/api/v1/access_token

Real Curl Request

curl
  -X POST
  -d 'grant_type=client_credentials'
  --user 'client_id:client_secret'
  https://www.reddit.com/api/v1/access_token

One Line

curl -X POST -d 'grant_type=client_credentials' --user 'client_id:client_secret' https://www.reddit.com/api/v1/access_token

Oauth Data Call

$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/api/v1/me
$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/r/news/top?t=day&limit=100

Reddit Python

t3 fields of interest

  1. "url_overridden_by_dest": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900",
  2. "url": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900",
  3. "title": "What infamous movie plot hole has an explanation that you're tired of explaining?",
  4. "downs": 0,
  5. "upvote_ratio": 0.94,
  6. "ups": 10891,
  7. "score": 10891,
  8. "created": 1692286512.0,
  9. "num_comments": 8112,
  10. "created_utc": 1692286512.0,

Minimal Term Set

hands, mouth, eyes, head, ears, nose, face, legs, teeth, fingers, breasts, skin, bones, blood,
be born, children, men, women, mother, father, wife, husband,
long, round, flat, hard, soft, sharp, smooth, heavy, sweet,
stone, wood, made of,
be on something, at the top, at the bottom, in front, around,
sky, ground, sun, during the day, at night, water, fire, rain, wind,
day,
creature, tree, grow (in ground), egg, tail, wings, feathers, bird, fish, dog,
we, know (someone), be called,
hold, sit, lie, stand, sleep,
play, laugh, sing, make, kill, eat, drink,
river, mountain, jungle/forest, desert, sea, island,
rain, wind, snow, ice, air,
flood, storm, drought, earthquake,
east, west, north, south,
bird, fish, tree,
dog, cat, horse, sheep, goat, cow, pig (camel, buffalo, caribou, seal, etc.),
mosquitoes, snake, flies,
family, we,
year, month, week, clock, hour,
house, village, city,
school, hospital, doctor, nurse, teacher, soldier,
country, government, the law, vote, border, flag, passport,
meat, rice, wheat, corn (yams, plantain, etc.), flour, salt, sugar, sweet,
knife, key, gun, bomb, medicines,
paper, iron, metal, glass, leather, wool, cloth, thread,
gold, rubber, plastic, oil, coal, petrol,
car, bicycle, plane, boat, train, road, wheel, wire, engine, pipe, telephone, television, phone, computer,
read, write, book, photo, newspaper, film,
money, God, war, poison, music,
go/went, burn, fight, buy/pay, learn,
clean