Difference between revisions of "User:Dismounted Feudal Knight/Archiving/Bulk forum archiving"
(some meager progress on this topic) |
m |
||
Line 1: | Line 1: | ||
This is the only way to save the forum in a timely manner. It's also the thing I'm most stumped about. The only way for this to possibly be efficient is with Tech support, and that's not an option right now, so we need a scraper (would that it was as easy as saving the wiki...) | This is the only way to save the forum in a timely manner. It's also the thing I'm most stumped about. The only way for this to possibly be efficient is with Tech support, and that's not an option right now, so we need a scraper (would that it was as easy as saving the wiki...) | ||
− | I found | + | I found a german option on github, but it obviously requires heavy modification to be possible and the way it works is just finding thread ids and saving them. The sheer size and age of twc means this is highly inefficient. A solution needs to be able to tackle forums at a time and ensure it does not overstay its welcome on TWC's server load. Play by Post RPGs is the perfect testing example because it is already deprecated and will no longer have posts + is threatened with the axe, so feel free to try that one out if you're at a loss what to test with. If something effective can be found then the rest of the forum may follow. |
− | Optimistically for a moment. If we have a good system with this and the best does happen, GED's back and/or things pop off, forums can be simply removed if they're no longer useful and offered as a far more usable archive for people to peruse at leisure without taxing TWC. Or miraculously, the software is updated: TWC can have the option of straight up starting over and its most valuable asset, old posts, is not lost. | + | Optimistically for a moment. If we have a good system with this and the best does happen, GED's back and/or things pop off, forums can be simply removed if they're no longer useful and offered as a far more usable archive for people to peruse at leisure without taxing TWC. Or miraculously, the software is updated: TWC can have the option of straight up starting over and its most valuable asset, old posts, is not lost because there would be a low-intensity read only archive. But the absence of good controls is a concern that's grown on me so this may have to be rethought. More updates in the future on this, hopefully. |
==Odd findings== | ==Odd findings== | ||
Line 11: | Line 11: | ||
==What can't really be saved== | ==What can't really be saved== | ||
− | Post histories and other minutiae, without tech on deck. And obviously any scraping would mean the result is read only with no account use. | + | Post histories and other minutiae, [s]without tech on deck[/s] extremely tall order even with tech, only worthwhile to think about if TWC makes progress on converting to modern software where that would fit in. |
+ | |||
+ | And obviously any scraping would mean the result is read only with no account use and the related bells and whistles. | ||
==The internet archive== | ==The internet archive== | ||
− | The internet archive is a wonderful thing, but its coverage is very spotty and inconsistent. This is why the idea came about: it is a poor substitute for TWC. Countless things would still be lost. | + | The internet archive is a wonderful thing, but its coverage is very spotty and inconsistent. This is why the idea came about: it is a poor substitute for TWC. Countless things would still be lost. I'm looking into™ a way to mass save TWC through IA or a related service so that option is available. |
Revision as of 10:03, 13 December 2023
This is the only way to save the forum in a timely manner. It's also the thing I'm most stumped about. The only way for this to possibly be efficient is with Tech support, and that's not an option right now, so we need a scraper (would that it was as easy as saving the wiki...)
I found a german option on github, but it obviously requires heavy modification to be possible and the way it works is just finding thread ids and saving them. The sheer size and age of twc means this is highly inefficient. A solution needs to be able to tackle forums at a time and ensure it does not overstay its welcome on TWC's server load. Play by Post RPGs is the perfect testing example because it is already deprecated and will no longer have posts + is threatened with the axe, so feel free to try that one out if you're at a loss what to test with. If something effective can be found then the rest of the forum may follow.
Optimistically for a moment. If we have a good system with this and the best does happen, GED's back and/or things pop off, forums can be simply removed if they're no longer useful and offered as a far more usable archive for people to peruse at leisure without taxing TWC. Or miraculously, the software is updated: TWC can have the option of straight up starting over and its most valuable asset, old posts, is not lost because there would be a low-intensity read only archive. But the absence of good controls is a concern that's grown on me so this may have to be rethought. More updates in the future on this, hopefully.
Odd findings
- The SingleFile extension may be helpful. However I want a reliable option that can do this more automatically (ie save pages immediately upon visit) for ease of use so for example, I can run through an index here on the wiki and all that content will be useful in case of the worst. SingleFile is already a start as it saves pages properly and completely, while manual CTRL + S has produced incredibly poor results on firefox for me. It's not an elegant line of thought but until/unless the scraper idea proceeds it's the best we've got for now.
- Random script on github that may be tangentially useful for twc articles: here. I'm not planning on pushing twc articles to go to wordpress but the proof of concept is something I'd like to remember, and it may lead to a finding that's useful for restoring articles if ever required.
- This isn't what I originally found but this scraper and also this scraper may merit further observation. Again, I'm not enthusiastic pulling a scraper unless I have fine control over what forums are selected at one time. The second script does have built in rate limiting which I like to see.
What can't really be saved
Post histories and other minutiae, [s]without tech on deck[/s] extremely tall order even with tech, only worthwhile to think about if TWC makes progress on converting to modern software where that would fit in.
And obviously any scraping would mean the result is read only with no account use and the related bells and whistles.
The internet archive
The internet archive is a wonderful thing, but its coverage is very spotty and inconsistent. This is why the idea came about: it is a poor substitute for TWC. Countless things would still be lost. I'm looking into™ a way to mass save TWC through IA or a related service so that option is available.