Difference between revisions of "User:Dismounted Feudal Knight/Archiving/Bulk forum archiving"
(some meager progress on this topic) |
(an update) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
− | This is the only way to save the forum in a timely manner. It's also the thing I'm most stumped about. | + | This is the only way to 'save' the forum in a timely manner. It's also the thing I'm most stumped about. Without guaranteed database copies from our techies if they appear, we're left with scraping. |
− | I | + | Options I've looked at are as follows: |
+ | * Bulk vb scraper: https://github.com/bendrick92/vbulletin_scraper/releases | ||
+ | * Bulk vb scraper 2: https://github.com/vizzerdrix55/web-scraping-vBulletin-forum | ||
+ | * Generic scraper (any number could fill this spot): https://scrapy.org/ | ||
+ | * A professional service: https://archive-it.org/ | ||
− | + | The problem in general is collecting the forum in a way that does not tax its resources, accurately and cleanly saves TWC content, and is possible in a reasonable time. I ran the idea by a peer who scrapes wikis specifically on a regular basis and the estimate to do the entire forum was 'about a month' if done continuously. Ideally sections can be collected piecemeal and by priority. Even better would be discounting all of this and having it done on the backend with a guarantee that Hex as an entity is able to make a save in its current condition. This depends on reaching techies who are actually able and willing to make it so. Inquiries are pending which will change the track of thinking on this page. Or not. Someone should hit me if I don't update this by May. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 12:41, 3 April 2024
This is the only way to 'save' the forum in a timely manner. It's also the thing I'm most stumped about. Without guaranteed database copies from our techies if they appear, we're left with scraping.
Options I've looked at are as follows:
- Bulk vb scraper: https://github.com/bendrick92/vbulletin_scraper/releases
- Bulk vb scraper 2: https://github.com/vizzerdrix55/web-scraping-vBulletin-forum
- Generic scraper (any number could fill this spot): https://scrapy.org/
- A professional service: https://archive-it.org/
The problem in general is collecting the forum in a way that does not tax its resources, accurately and cleanly saves TWC content, and is possible in a reasonable time. I ran the idea by a peer who scrapes wikis specifically on a regular basis and the estimate to do the entire forum was 'about a month' if done continuously. Ideally sections can be collected piecemeal and by priority. Even better would be discounting all of this and having it done on the backend with a guarantee that Hex as an entity is able to make a save in its current condition. This depends on reaching techies who are actually able and willing to make it so. Inquiries are pending which will change the track of thinking on this page. Or not. Someone should hit me if I don't update this by May.