Welcome to the TWC Wiki! You are not logged in. Please log in to the Wiki to vote in polls, change skin preferences, or edit pages. See HERE for details of how to LOG IN.

Difference between revisions of "User:Dismounted Feudal Knight/Archiving/Bulk forum archiving"

From TWC Wiki
Jump to navigationJump to search
(Created page with "This is the only way to save the forum in a timely manner. It's also the thing I'm most stumped about. The only way for this to possibly be efficient is with Tech support, and...")
 
(an update)
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is the only way to save the forum in a timely manner. It's also the thing I'm most stumped about. The only way for this to possibly be efficient is with Tech support, and that's not an option right now, so we need a scraper (would that it was as easy as saving the wiki...)
+
This is the only way to 'save' the forum in a timely manner. It's also the thing I'm most stumped about. Without guaranteed database copies from our techies if they appear, we're left with scraping.
  
I found this thing on github, but it obviously requires heavy modification to be possible and the way it works is just finding thread ids and saving them. The sheer size and age of twc means this is highly inefficient. A solution needs to be able to tackle forums at a time and ensure it does not overstay its welcome on TWC's server load. Play by Post RPGs is the perfect testing example because it is already deprecated and will no longer have posts + is threatened with the axe, so feel free to try that one out if you're at a loss what to test with. If something effective can be found then the rest of the forum may follow.
+
Options I've looked at are as follows:
 +
* Bulk vb scraper: https://github.com/bendrick92/vbulletin_scraper/releases
 +
* Bulk vb scraper 2: https://github.com/vizzerdrix55/web-scraping-vBulletin-forum
 +
* Generic scraper (any number could fill this spot): https://scrapy.org/
 +
* A professional service: https://archive-it.org/
  
Optimistically for a moment. If we have a good system with this and the best does happen, GED's back and/or things pop off, forums can be simply removed if they're no longer useful and offered as a far more usable archive for people to peruse at leisure without taxing TWC. Or miraculously, the software is updated: TWC can have the option of straight up starting over and its most valuable asset, old posts, is not lost.  
+
The problem in general is collecting the forum in a way that does not tax its resources, accurately and cleanly saves TWC content, and is possible in a reasonable time. I ran the idea by a peer who scrapes wikis specifically on a regular basis and the estimate to do the entire forum was 'about a month' if done continuously. Ideally sections can be collected piecemeal and by priority. Even better would be discounting all of this and having it done on the backend with a guarantee that Hex as an entity is able to make a save in its current condition. This depends on reaching techies who are actually able and willing to make it so. Inquiries are pending which will change the track of thinking on this page. Or not. Someone should hit me if I don't update this by May.
 
 
==What can't really be saved==
 
Post histories and other minutiae, without tech on deck. And obviously any scraping would mean the result is read only with no account use.
 
 
 
==The internet archive==
 
The internet archive is a wonderful thing, but its coverage is very spotty and inconsistent. This is why the idea came about: it is a poor substitute for TWC. Countless things would still be lost.
 

Latest revision as of 12:41, 3 April 2024

This is the only way to 'save' the forum in a timely manner. It's also the thing I'm most stumped about. Without guaranteed database copies from our techies if they appear, we're left with scraping.

Options I've looked at are as follows:

The problem in general is collecting the forum in a way that does not tax its resources, accurately and cleanly saves TWC content, and is possible in a reasonable time. I ran the idea by a peer who scrapes wikis specifically on a regular basis and the estimate to do the entire forum was 'about a month' if done continuously. Ideally sections can be collected piecemeal and by priority. Even better would be discounting all of this and having it done on the backend with a guarantee that Hex as an entity is able to make a save in its current condition. This depends on reaching techies who are actually able and willing to make it so. Inquiries are pending which will change the track of thinking on this page. Or not. Someone should hit me if I don't update this by May.