Revision as of 17:06, 5 October 2023

Data Processing Process

$ python get_link_lists.py # daily
1. Needs data archival / backup.
  1. Doing a brute-force full backup right now, might be sufficient for the time being.
    1. $ scp -i key.pem reddit-link-list.tar.bz2 admin@www.iterativechaos.com:./
  2. Can do just week/day going forward, but that will quickly get slow.
  3. Storing the files as .json.bz2 would be a big improvement
$ python parquet_link_list.py
1. (done) Make this iterate and do a full refresh each time
2. Reconsider full refresh when processing time goes over a minute (currently runs in 12 seconds (not sure if that's real or user))
3. Old Version:
  1. $ python parquet_link_list.py ../data/reddit/link_list/day/science/ ../data/reddit/parquet/link_list/day/science/
$ python dedupe_link_list_parquet.py
1. (done) Make this iterate and do a full refresh each time
2. Reconsider full refresh when processing time goes over a minute (currently runs in 2 seconds (real))
$ python get_discussions.py
1. Make this check the existing download, harvest timestamp, num_comments
$ python discussion_to_gpt.py
1. Make this iterate and do a full refresh each time
2. Reconsider full refresh when processing time goes over a minute
hit ChatGPT with output
1. Change this to API call to ChatGPT
2. Make this check the existing generate, harvest versus generate timestamp
3. Add data archival / backup

2023-08-26: Overnight buildup: 3,649 (bug)
2023-08-27: overnight buildup: 369

Interesting Subreddits

aitah
antiwork
ask
askmen
askreddit
askscience
chatgpt
conservative
dataisbeautiful
explainlikeimfive
latestagecapitalism
leopardsatemyface
lifeprotips
news
nostupidquestions
outoftheloop
personalfinance
politics
programmerhumor
science
technology
todayilearned
tooafraidtoask
twoxchromosomes
unpopularopinion
worldnews
youshouldknow

Reddit OAuth2

Example Curl Request

curl
  -X POST
  -d 'grant_type=password&username=reddit_bot&password=snoo'
  --user 'p-jcoLKBynTLew:gko_LXELoV07ZBNUXrvWZfzE3aI'
  https://www.reddit.com/api/v1/access_token

Real Curl Request

curl
  -X POST
  -d 'grant_type=client_credentials'
  --user 'client_id:client_secret'
  https://www.reddit.com/api/v1/access_token

One Line

curl -X POST -d 'grant_type=client_credentials' --user 'client_id:client_secret' https://www.reddit.com/api/v1/access_token

Oauth Data Call

$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/api/v1/me
$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/r/news/top?t=day&limit=100

https://old.reddit.com/r/worldnews/top/?sort=top&t=day
/r/subreddit/top?t=day&limit=100
count=100&

Reddit Python

pip install aiofiles aiohttp asyncio
https://realpython.com/async-io-python/

t3 fields of interest

"url_overridden_by_dest": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900",
"url": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900",
"title": "What infamous movie plot hole has an explanation that you're tired of explaining?",
"downs": 0,
"upvote_ratio": 0.94,
"ups": 10891,
"score": 10891,
"created": 1692286512.0,
"num_comments": 8112,
"created_utc": 1692286512.0,

Minimal Term Set

hands, mouth, eyes, head, ears, nose, face, legs, teeth, fingers, breasts, skin, bones, blood,
be born, children, men, women, mother, father, wife, husband,
long, round, flat, hard, soft, sharp, smooth, heavy, sweet,
stone, wood, made of,
be on something, at the top, at the bottom, in front, around,
sky, ground, sun, during the day, at night, water, fire, rain, wind,
day,
creature, tree, grow (in ground), egg, tail, wings, feathers, bird, fish, dog,
we, know (someone), be called,
hold, sit, lie, stand, sleep,
play, laugh, sing, make, kill, eat, drink,
river, mountain, jungle/forest, desert, sea, island,
rain, wind, snow, ice, air,
flood, storm, drought, earthquake,
east, west, north, south,
bird, fish, tree,
dog, cat, horse, sheep, goat, cow, pig (camel, buffalo, caribou, seal, etc.),
mosquitoes, snake, flies,
family, we,
year, month, week, clock, hour,
house, village, city,
school, hospital, doctor, nurse, teacher, soldier,
country, government, the law, vote, border, flag, passport,
meat, rice, wheat, corn (yams, plantain, etc.), flour, salt, sugar, sweet,
knife, key, gun, bomb, medicines,
paper, iron, metal, glass, leather, wool, cloth, thread,
gold, rubber, plastic, oil, coal, petrol,
car, bicycle, plane, boat, train, road, wheel, wire, engine, pipe, telephone, television, phone, computer,
read, write, book, photo, newspaper, film,
money, God, war, poison, music,
go/went, burn, fight, buy/pay, learn,
clean

@@ Line 1: / Line 1: @@
 [[Category:CypherTech]]
+= Data Processing Process =
+# $ python get_link_lists.py # daily
+## Needs data archival / backup.
+### Doing a brute-force full backup right now, might be sufficient for the time being.
+#### $ scp -i key.pem reddit-link-list.tar.bz2 admin@www.iterativechaos.com:./
+### Can do just week/day going forward, but that will quickly get slow.
+### Storing the files as .json.bz2 would be a big improvement
+# $ python parquet_link_list.py
+## (done) Make this iterate and do a full refresh each time
+## Reconsider full refresh when processing time goes over a minute (currently runs in 12 seconds (not sure if that's real or user))
+## Old Version:
+### $ python parquet_link_list.py ../data/reddit/link_list/day/science/ ../data/reddit/parquet/link_list/day/science/
+# $ python dedupe_link_list_parquet.py
+## (done) Make this iterate and do a full refresh each time
+## Reconsider full refresh when processing time goes over a minute (currently runs in 2 seconds (real))
+# $ python get_discussions.py
+## Make this check the existing download, harvest timestamp, num_comments
+# $ python discussion_to_gpt.py
+## Make this iterate and do a full refresh each time
+## Reconsider full refresh when processing time goes over a minute
+# hit ChatGPT with output
+## Change this to API call to ChatGPT
+## Make this check the existing generate, harvest versus generate timestamp
+## Add data archival / backup
+* 2023-08-26: Overnight buildup: 3,649 (bug)
+* 2023-08-27: overnight buildup: 369
+= Interesting Subreddits =
+* aitah
+* antiwork
+* ask
+* askmen
+* askreddit
+* askscience
+* chatgpt
+* conservative
+* dataisbeautiful
+* explainlikeimfive
+* latestagecapitalism
+* leopardsatemyface
+* lifeprotips
+* news
+* nostupidquestions
+* outoftheloop
+* personalfinance
+* politics
+* programmerhumor
+* science
+* technology
+* todayilearned
+* tooafraidtoask
+* twoxchromosomes
+* unpopularopinion
+* worldnews
+* youshouldknow
+= Reddit OAuth2 =
+* https://www.reddit.com/r/redditdev/wiki/oauth2/explanation/
+* https://www.reddit.com/dev/api/oauth/
+* https://github.com/reddit-archive/reddit/wiki/OAuth2
+** https://github.com/reddit-archive/reddit/wiki/OAuth2#application-only-oauth
+'''Example Curl Request'''
+<syntaxhighlight lang="bash" line>
+curl
+  -X POST
+  -d 'grant_type=password&username=reddit_bot&password=snoo'
+  --user 'p-jcoLKBynTLew:gko_LXELoV07ZBNUXrvWZfzE3aI'
+  https://www.reddit.com/api/v1/access_token
+</syntaxhighlight>
+'''Real Curl Request'''
+<syntaxhighlight lang="bash" line>
+curl
+  -X POST
+  -d 'grant_type=client_credentials'
+  --user 'client_id:client_secret'
+  https://www.reddit.com/api/v1/access_token
+</syntaxhighlight>
+'''One Line'''
+<syntaxhighlight lang="bash" line>
+curl -X POST -d 'grant_type=client_credentials' --user 'client_id:client_secret' https://www.reddit.com/api/v1/access_token
+</syntaxhighlight>
+'''Oauth Data Call'''
+<syntaxhighlight lang="bash" line>
+$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/api/v1/me
+$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/r/news/top?t=day&limit=100
+</syntaxhighlight>
+* https://old.reddit.com/r/worldnews/top/?sort=top&t=day
+* /r/subreddit/top?t=day&limit=100
+* count=100&
+= Reddit Python =
+* pip install aiofiles aiohttp asyncio
+* https://realpython.com/async-io-python/
+== t3 fields of interest ==
+# "url_overridden_by_dest": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900",
+# "url": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900",
+# "title": "What infamous movie plot hole has an explanation that you're tired of explaining?",
+# "downs": 0,
+# "upvote_ratio": 0.94,
+# "ups": 10891,
+# "score": 10891,
+# "created": 1692286512.0,
+# "num_comments": 8112,
+# "created_utc": 1692286512.0,
+= Minimal Term Set =
+<pre>
+hands, mouth, eyes, head, ears, nose, face, legs, teeth, fingers, breasts, skin, bones, blood,
+be born, children, men, women, mother, father, wife, husband,
+long, round, flat, hard, soft, sharp, smooth, heavy, sweet,
+stone, wood, made of,
+be on something, at the top, at the bottom, in front, around,
+sky, ground, sun, during the day, at night, water, fire, rain, wind,
+day,
+creature, tree, grow (in ground), egg, tail, wings, feathers, bird, fish, dog,
+we, know (someone), be called,
+hold, sit, lie, stand, sleep,
+play, laugh, sing, make, kill, eat, drink,
+river, mountain, jungle/forest, desert, sea, island,
+rain, wind, snow, ice, air,
+flood, storm, drought, earthquake,
+east, west, north, south,
+bird, fish, tree,
+dog, cat, horse, sheep, goat, cow, pig (camel, buffalo, caribou, seal, etc.),
+mosquitoes, snake, flies,
+family, we,
+year, month, week, clock, hour,
+house, village, city,
+school, hospital, doctor, nurse, teacher, soldier,
+country, government, the law, vote, border, flag, passport,
+meat, rice, wheat, corn (yams, plantain, etc.), flour, salt, sugar, sweet,
+knife, key, gun, bomb, medicines,
+paper, iron, metal, glass, leather, wool, cloth, thread,
+gold, rubber, plastic, oil, coal, petrol,
+car, bicycle, plane, boat, train, road, wheel, wire, engine, pipe, telephone, television, phone, computer,
+read, write, book, photo, newspaper, film,
+money, God, war, poison, music,
+go/went, burn, fight, buy/pay, learn,
+clean
+</pre>

CypherReddit: Difference between revisions

Revision as of 17:06, 5 October 2023

Contents

Data Processing Process

Interesting Subreddits

Reddit OAuth2

Reddit Python

t3 fields of interest

Minimal Term Set

Navigation menu

CypherReddit: Difference between revisions

Revision as of 17:06, 5 October 2023

Data Processing Process

Interesting Subreddits

Reddit OAuth2

Reddit Python

t3 fields of interest

Minimal Term Set

Navigation menu

Search