Welcome to the TWC Wiki! You are not logged in. Please log in to the Wiki to vote in polls, change skin preferences, or edit pages. See HERE for details of how to LOG IN.

User:Dismounted Feudal Knight/Archiving/Bulk forum archiving

From TWC Wiki
< User:Dismounted Feudal Knight‎ | Archiving
Revision as of 11:58, 12 October 2023 by Dismounted Feudal Knight (talk | contribs) (some meager progress on this topic)
Jump to navigationJump to search

This is the only way to save the forum in a timely manner. It's also the thing I'm most stumped about. The only way for this to possibly be efficient is with Tech support, and that's not an option right now, so we need a scraper (would that it was as easy as saving the wiki...)

I found this thing on github, but it obviously requires heavy modification to be possible and the way it works is just finding thread ids and saving them. The sheer size and age of twc means this is highly inefficient. A solution needs to be able to tackle forums at a time and ensure it does not overstay its welcome on TWC's server load. Play by Post RPGs is the perfect testing example because it is already deprecated and will no longer have posts + is threatened with the axe, so feel free to try that one out if you're at a loss what to test with. If something effective can be found then the rest of the forum may follow.

Optimistically for a moment. If we have a good system with this and the best does happen, GED's back and/or things pop off, forums can be simply removed if they're no longer useful and offered as a far more usable archive for people to peruse at leisure without taxing TWC. Or miraculously, the software is updated: TWC can have the option of straight up starting over and its most valuable asset, old posts, is not lost.

Odd findings

  • The SingleFile extension may be helpful. However I want a reliable option that can do this more automatically (ie save pages immediately upon visit) for ease of use so for example, I can run through an index here on the wiki and all that content will be useful in case of the worst. SingleFile is already a start as it saves pages properly and completely, while manual CTRL + S has produced incredibly poor results on firefox for me. It's not an elegant line of thought but until/unless the scraper idea proceeds it's the best we've got for now.
  • Random script on github that may be tangentially useful for twc articles: here. I'm not planning on pushing twc articles to go to wordpress but the proof of concept is something I'd like to remember, and it may lead to a finding that's useful for restoring articles if ever required.
  • This isn't what I originally found but this scraper and also this scraper may merit further observation. Again, I'm not enthusiastic pulling a scraper unless I have fine control over what forums are selected at one time. The second script does have built in rate limiting which I like to see.

What can't really be saved

Post histories and other minutiae, without tech on deck. And obviously any scraping would mean the result is read only with no account use.

The internet archive

The internet archive is a wonderful thing, but its coverage is very spotty and inconsistent. This is why the idea came about: it is a poor substitute for TWC. Countless things would still be lost.