CypherReddit: Difference between revisions
Jump to navigation
Jump to search
(Created page with "Category:CypherTech") |
|||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Category:CypherTech]] | [[Category:CypherTech]] | ||
= Data Processing Process = | |||
# $ python step_01_get_link_lists.py | |||
## Pulls 24h top 100 every 4 hours | |||
## Pulls 7d top 1,000 every 24 hours | |||
## Stores as json.bz2, in pages of 100 links. | |||
# $ python step_02_parquet_link_list.py | |||
## Union all of json.bz2 rows into Parquet (ie: raw layer). | |||
## Uses compaction, gets slow after about a week without a compaction run. | |||
# $ python step_03_dedupe_link_list_parquet.py | |||
## Takes the most recent entry for each link_id from the raw layer. | |||
## Writes to parquet subreddit/day files. | |||
# $ python step_04_get_discussions.py | |||
## Finds discussions above a threshold of comments and upvotes. | |||
## Downloads the discussion and stores as json.bz2. | |||
## Skips download if a file exists and no more than 20% increase in comments. | |||
## TODO: should do a "final download" after a week (or whatever). | |||
# $ python step_05_discussion_to_gpt.py | |||
## Finds the top couple discussions in a set of subreddits. | |||
## Pulls the top N tokens worth of comments from each discussion. | |||
## Sends to GPT to summarize. | |||
## Writes each summary to html/summary/{link-id}.html | |||
## Writes the list to archive/gpt_stories_{edition_id}.json | |||
# $ python step_06_gpt_to_html.py | |||
## Loads the 10 most recent archive/gpt_stories_*.json | |||
## generates the html/news.html file. | |||
# $ rsync -avz html/summary /var/www/html/ | |||
# $ cp html/news.html /var/www/html/ | |||
* 2023-08-26: Overnight buildup: 3,649 (bug) | |||
* 2023-08-27: overnight buildup: 369 | |||
* Note: The mutation rate of the discussion is faster than the mutation rate of the story summary. Therefore, the regeneration rate of the discussion tree should be decoupled from the regeneration rate of the story summary. | |||
** Cocicil: Since the regeneration rate of the discussion tree is different from the regeneration rate of the summary, the content store may benefit from being updated at a different rate. (eg: if using static HTML files, those might be generated by different processes, or the summary might be static HTML while the discussion could live in a Nostr relay or graph database). | |||
= Interesting Subreddits = | |||
* aitah | |||
* antiwork | |||
* ask | |||
* askmen | |||
* askreddit | |||
* askscience | |||
* chatgpt | |||
* conservative | |||
* dataisbeautiful | |||
* explainlikeimfive | |||
* latestagecapitalism | |||
* leopardsatemyface | |||
* lifeprotips | |||
* news | |||
* nostupidquestions | |||
* outoftheloop | |||
* personalfinance | |||
* politics | |||
* programmerhumor | |||
* science | |||
* technology | |||
* todayilearned | |||
* tooafraidtoask | |||
* twoxchromosomes | |||
* unpopularopinion | |||
* worldnews | |||
* youshouldknow | |||
= Reddit OAuth2 = | |||
* https://www.reddit.com/r/redditdev/wiki/oauth2/explanation/ | |||
* https://www.reddit.com/dev/api/oauth/ | |||
* https://github.com/reddit-archive/reddit/wiki/OAuth2 | |||
** https://github.com/reddit-archive/reddit/wiki/OAuth2#application-only-oauth | |||
'''Example Curl Request''' | |||
<syntaxhighlight lang="bash" line> | |||
curl | |||
-X POST | |||
-d 'grant_type=password&username=reddit_bot&password=snoo' | |||
--user 'p-jcoLKBynTLew:gko_LXELoV07ZBNUXrvWZfzE3aI' | |||
https://www.reddit.com/api/v1/access_token | |||
</syntaxhighlight> | |||
'''Real Curl Request''' | |||
<syntaxhighlight lang="bash" line> | |||
curl | |||
-X POST | |||
-d 'grant_type=client_credentials' | |||
--user 'client_id:client_secret' | |||
https://www.reddit.com/api/v1/access_token | |||
</syntaxhighlight> | |||
'''One Line''' | |||
<syntaxhighlight lang="bash" line> | |||
curl -X POST -d 'grant_type=client_credentials' --user 'client_id:client_secret' https://www.reddit.com/api/v1/access_token | |||
</syntaxhighlight> | |||
'''Oauth Data Call''' | |||
<syntaxhighlight lang="bash" line> | |||
$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/api/v1/me | |||
$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/r/news/top?t=day&limit=100 | |||
</syntaxhighlight> | |||
* https://old.reddit.com/r/worldnews/top/?sort=top&t=day | |||
* /r/subreddit/top?t=day&limit=100 | |||
* count=100& | |||
= Reddit Python = | |||
* pip install aiofiles aiohttp asyncio | |||
* https://realpython.com/async-io-python/ | |||
== t3 fields of interest == | |||
# "url_overridden_by_dest": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900", | |||
# "url": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900", | |||
# "title": "What infamous movie plot hole has an explanation that you're tired of explaining?", | |||
# "downs": 0, | |||
# "upvote_ratio": 0.94, | |||
# "ups": 10891, | |||
# "score": 10891, | |||
# "created": 1692286512.0, | |||
# "num_comments": 8112, | |||
# "created_utc": 1692286512.0, | |||
= Minimal Term Set = | |||
<pre> | |||
hands, mouth, eyes, head, ears, nose, face, legs, teeth, fingers, breasts, skin, bones, blood, | |||
be born, children, men, women, mother, father, wife, husband, | |||
long, round, flat, hard, soft, sharp, smooth, heavy, sweet, | |||
stone, wood, made of, | |||
be on something, at the top, at the bottom, in front, around, | |||
sky, ground, sun, during the day, at night, water, fire, rain, wind, | |||
day, | |||
creature, tree, grow (in ground), egg, tail, wings, feathers, bird, fish, dog, | |||
we, know (someone), be called, | |||
hold, sit, lie, stand, sleep, | |||
play, laugh, sing, make, kill, eat, drink, | |||
river, mountain, jungle/forest, desert, sea, island, | |||
rain, wind, snow, ice, air, | |||
flood, storm, drought, earthquake, | |||
east, west, north, south, | |||
bird, fish, tree, | |||
dog, cat, horse, sheep, goat, cow, pig (camel, buffalo, caribou, seal, etc.), | |||
mosquitoes, snake, flies, | |||
family, we, | |||
year, month, week, clock, hour, | |||
house, village, city, | |||
school, hospital, doctor, nurse, teacher, soldier, | |||
country, government, the law, vote, border, flag, passport, | |||
meat, rice, wheat, corn (yams, plantain, etc.), flour, salt, sugar, sweet, | |||
knife, key, gun, bomb, medicines, | |||
paper, iron, metal, glass, leather, wool, cloth, thread, | |||
gold, rubber, plastic, oil, coal, petrol, | |||
car, bicycle, plane, boat, train, road, wheel, wire, engine, pipe, telephone, television, phone, computer, | |||
read, write, book, photo, newspaper, film, | |||
money, God, war, poison, music, | |||
go/went, burn, fight, buy/pay, learn, | |||
clean | |||
</pre> |
Latest revision as of 15:46, 13 October 2023
Data Processing Process
- $ python step_01_get_link_lists.py
- Pulls 24h top 100 every 4 hours
- Pulls 7d top 1,000 every 24 hours
- Stores as json.bz2, in pages of 100 links.
- $ python step_02_parquet_link_list.py
- Union all of json.bz2 rows into Parquet (ie: raw layer).
- Uses compaction, gets slow after about a week without a compaction run.
- $ python step_03_dedupe_link_list_parquet.py
- Takes the most recent entry for each link_id from the raw layer.
- Writes to parquet subreddit/day files.
- $ python step_04_get_discussions.py
- Finds discussions above a threshold of comments and upvotes.
- Downloads the discussion and stores as json.bz2.
- Skips download if a file exists and no more than 20% increase in comments.
- TODO: should do a "final download" after a week (or whatever).
- $ python step_05_discussion_to_gpt.py
- Finds the top couple discussions in a set of subreddits.
- Pulls the top N tokens worth of comments from each discussion.
- Sends to GPT to summarize.
- Writes each summary to html/summary/{link-id}.html
- Writes the list to archive/gpt_stories_{edition_id}.json
- $ python step_06_gpt_to_html.py
- Loads the 10 most recent archive/gpt_stories_*.json
- generates the html/news.html file.
- $ rsync -avz html/summary /var/www/html/
- $ cp html/news.html /var/www/html/
- 2023-08-26: Overnight buildup: 3,649 (bug)
- 2023-08-27: overnight buildup: 369
- Note: The mutation rate of the discussion is faster than the mutation rate of the story summary. Therefore, the regeneration rate of the discussion tree should be decoupled from the regeneration rate of the story summary.
- Cocicil: Since the regeneration rate of the discussion tree is different from the regeneration rate of the summary, the content store may benefit from being updated at a different rate. (eg: if using static HTML files, those might be generated by different processes, or the summary might be static HTML while the discussion could live in a Nostr relay or graph database).
Interesting Subreddits
- aitah
- antiwork
- ask
- askmen
- askreddit
- askscience
- chatgpt
- conservative
- dataisbeautiful
- explainlikeimfive
- latestagecapitalism
- leopardsatemyface
- lifeprotips
- news
- nostupidquestions
- outoftheloop
- personalfinance
- politics
- programmerhumor
- science
- technology
- todayilearned
- tooafraidtoask
- twoxchromosomes
- unpopularopinion
- worldnews
- youshouldknow
Reddit OAuth2
- https://www.reddit.com/r/redditdev/wiki/oauth2/explanation/
- https://www.reddit.com/dev/api/oauth/
- https://github.com/reddit-archive/reddit/wiki/OAuth2
Example Curl Request
curl
-X POST
-d 'grant_type=password&username=reddit_bot&password=snoo'
--user 'p-jcoLKBynTLew:gko_LXELoV07ZBNUXrvWZfzE3aI'
https://www.reddit.com/api/v1/access_token
Real Curl Request
curl
-X POST
-d 'grant_type=client_credentials'
--user 'client_id:client_secret'
https://www.reddit.com/api/v1/access_token
One Line
curl -X POST -d 'grant_type=client_credentials' --user 'client_id:client_secret' https://www.reddit.com/api/v1/access_token
Oauth Data Call
$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/api/v1/me
$ curl -H "Authorization: bearer J1qK1c18UUGJFAzz9xnH56584l4" -A "Traxelbot/0.1 by rbb36" https://oauth.reddit.com/r/news/top?t=day&limit=100
- https://old.reddit.com/r/worldnews/top/?sort=top&t=day
- /r/subreddit/top?t=day&limit=100
- count=100&
Reddit Python
- pip install aiofiles aiohttp asyncio
- https://realpython.com/async-io-python/
t3 fields of interest
- "url_overridden_by_dest": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900",
- "url": "https://www.nbcnews.com/politics/donald-trump/live-blog/trump-georgia-indictment-rcna98900",
- "title": "What infamous movie plot hole has an explanation that you're tired of explaining?",
- "downs": 0,
- "upvote_ratio": 0.94,
- "ups": 10891,
- "score": 10891,
- "created": 1692286512.0,
- "num_comments": 8112,
- "created_utc": 1692286512.0,
Minimal Term Set
hands, mouth, eyes, head, ears, nose, face, legs, teeth, fingers, breasts, skin, bones, blood, be born, children, men, women, mother, father, wife, husband, long, round, flat, hard, soft, sharp, smooth, heavy, sweet, stone, wood, made of, be on something, at the top, at the bottom, in front, around, sky, ground, sun, during the day, at night, water, fire, rain, wind, day, creature, tree, grow (in ground), egg, tail, wings, feathers, bird, fish, dog, we, know (someone), be called, hold, sit, lie, stand, sleep, play, laugh, sing, make, kill, eat, drink, river, mountain, jungle/forest, desert, sea, island, rain, wind, snow, ice, air, flood, storm, drought, earthquake, east, west, north, south, bird, fish, tree, dog, cat, horse, sheep, goat, cow, pig (camel, buffalo, caribou, seal, etc.), mosquitoes, snake, flies, family, we, year, month, week, clock, hour, house, village, city, school, hospital, doctor, nurse, teacher, soldier, country, government, the law, vote, border, flag, passport, meat, rice, wheat, corn (yams, plantain, etc.), flour, salt, sugar, sweet, knife, key, gun, bomb, medicines, paper, iron, metal, glass, leather, wool, cloth, thread, gold, rubber, plastic, oil, coal, petrol, car, bicycle, plane, boat, train, road, wheel, wire, engine, pipe, telephone, television, phone, computer, read, write, book, photo, newspaper, film, money, God, war, poison, music, go/went, burn, fight, buy/pay, learn, clean