Welcome to the TWC Wiki! You are not logged in. Please log in to the Wiki to vote in polls, change skin preferences, or edit pages. See HERE for details of how to LOG IN.

Difference between revisions of "User:Dismounted Feudal Knight/Archiving/Bulk forum archiving"

From TWC Wiki
Jump to navigationJump to search
(some meager progress on this topic)
(an update)
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
This is the only way to save the forum in a timely manner. It's also the thing I'm most stumped about. The only way for this to possibly be efficient is with Tech support, and that's not an option right now, so we need a scraper (would that it was as easy as saving the wiki...)
+
This is the only way to 'save' the forum in a timely manner. It's also the thing I'm most stumped about. Without guaranteed database copies from our techies if they appear, we're left with scraping.
  
I found this thing on github, but it obviously requires heavy modification to be possible and the way it works is just finding thread ids and saving them. The sheer size and age of twc means this is highly inefficient. A solution needs to be able to tackle forums at a time and ensure it does not overstay its welcome on TWC's server load. Play by Post RPGs is the perfect testing example because it is already deprecated and will no longer have posts + is threatened with the axe, so feel free to try that one out if you're at a loss what to test with. If something effective can be found then the rest of the forum may follow.
+
Options I've looked at are as follows:
 +
* Bulk vb scraper: https://github.com/bendrick92/vbulletin_scraper/releases
 +
* Bulk vb scraper 2: https://github.com/vizzerdrix55/web-scraping-vBulletin-forum
 +
* Generic scraper (any number could fill this spot): https://scrapy.org/
 +
* A professional service: https://archive-it.org/
  
Optimistically for a moment. If we have a good system with this and the best does happen, GED's back and/or things pop off, forums can be simply removed if they're no longer useful and offered as a far more usable archive for people to peruse at leisure without taxing TWC. Or miraculously, the software is updated: TWC can have the option of straight up starting over and its most valuable asset, old posts, is not lost.
+
The problem in general is collecting the forum in a way that does not tax its resources, accurately and cleanly saves TWC content, and is possible in a reasonable time. I ran the idea by a peer who scrapes wikis specifically on a regular basis and the estimate to do the entire forum was 'about a month' if done continuously. Ideally sections can be collected piecemeal and by priority. Even better would be discounting all of this and having it done on the backend with a guarantee that Hex as an entity is able to make a save in its current condition. This depends on reaching techies who are actually able and willing to make it so. Inquiries are pending which will change the track of thinking on this page. Or not. Someone should hit me if I don't update this by May.
 
 
==Odd findings==
 
* The SingleFile extension may be helpful. However I want a reliable option that can do this more automatically (ie save pages immediately upon visit) for ease of use so for example, I can run through an index here on the wiki and all that content will be useful in case of the worst. SingleFile is already a start as it saves pages properly and completely, while manual CTRL + S has produced incredibly poor results on firefox for me. It's not an elegant line of thought but until/unless the scraper idea proceeds it's the best we've got for now.
 
* Random script on github that may be tangentially useful for twc articles: [https://github.com/Budget101/vB4-CMS-Article-to-WP-Post-Converter here]. I'm not planning on pushing twc articles to go to wordpress but the proof of concept is something I'd like to remember, and it may lead to a finding that's useful for restoring articles if ever required.
 
* This isn't what I originally found but [https://github.com/bendrick92/vbulletin_scraper this scraper] and also [https://github.com/vizzerdrix55/web-scraping-vBulletin-forum/tree/master this scraper] may merit further observation. Again, I'm not enthusiastic pulling a scraper unless I have fine control over what forums are selected at one time. The second script does have built in rate limiting which I like to see.
 
 
 
==What can't really be saved==
 
Post histories and other minutiae, without tech on deck. And obviously any scraping would mean the result is read only with no account use.
 
 
 
==The internet archive==
 
The internet archive is a wonderful thing, but its coverage is very spotty and inconsistent. This is why the idea came about: it is a poor substitute for TWC. Countless things would still be lost.
 

Latest revision as of 12:41, 3 April 2024

This is the only way to 'save' the forum in a timely manner. It's also the thing I'm most stumped about. Without guaranteed database copies from our techies if they appear, we're left with scraping.

Options I've looked at are as follows:

The problem in general is collecting the forum in a way that does not tax its resources, accurately and cleanly saves TWC content, and is possible in a reasonable time. I ran the idea by a peer who scrapes wikis specifically on a regular basis and the estimate to do the entire forum was 'about a month' if done continuously. Ideally sections can be collected piecemeal and by priority. Even better would be discounting all of this and having it done on the backend with a guarantee that Hex as an entity is able to make a save in its current condition. This depends on reaching techies who are actually able and willing to make it so. Inquiries are pending which will change the track of thinking on this page. Or not. Someone should hit me if I don't update this by May.