Mandy (or: My Quest to Stop Rare Media Being Lost Again)

Despite not being able to watch many cartoons as a kid, I have always been a fan of 90s and early 2000s animation. Recently I was thrilled to find out not only could I still acquire copies of most televised shows in good quality, but that some viewers had also taped, uploaded and collected many of the bonus materials that channels aired between shows (an excellent example is the CN City era bumpers). This led me to Lost Media sites, where various passionate users would search the internet to rediscover all kinds of rare footage and share their findings with others.

Unfortunately, most of the videos are hosted externally on a site named YouTube.

YouTube deletes videos all the time.

An issue with off-site archival is that the most popular video hosts such as YouTube will often take down videos automatically, regardless of context. Furthermore, many accounts are terminated by their owners or due to the owner getting banned. YouTube is not a safe site for videos, evident by Lost Media articles littered with comments stating, "This footage is now lost again".

This is why I created Mandy, a bot to scrape these wiki sites for links to YouTube, Vimeo and Dailymotion and then re-upload them to archive.org, a far safer site for media preservation.

Mandy relies on a slightly modified version of TubeUp, a great tool made by ArchiveTeam which uses youtube-dl to re-upload media from various sites onto archive.org along with useful metadata such as its uploader, description and tags.

Here is an image of the account page on archive.org with the backed up media Mandy has saved.

Some of you may recognise the namesake of this project; a small, commanding child who becomes the boss of the Grim Reaper. Similarly, this tiny bot controls YouTube's power to kill the collection of rare footage which can be found there.

I will discuss some details of the program later in this post. Some general statistics are probably more interesting to casual readers so I want to mention those first.

Statistics

This blog post was made in celebration of Mandy's 5,000th upload!

The Lost Media Archive wikia appears to have approximately 4800 unique YouTube, Dailymotion and Vimeo links (to playlists and videos). Mandy has processed about 2200 since launching on 28 November, 2018.

These are the current approximate statistics as of July, 2019 (obvious duplicates removed):

Videos successfully archived: 5000
Videos geoblocked from proxy: 44
Links unavailable when trying to archive: 315
Unresolved errors (will soon drop to 50): 234
Large playlists queued for manual archival: 72

(The manual approval of large playlists was added after a playlist containing over 4000 long videos was posted in the Lost PewDiePie Videos article, resulting in a denial of service)

Vampire videos

One statistic I have been eager to calculate is the number of vampire videos. These are the videos which have been killed on YouTube but live on because of this project.

137 of the successfully archived videos are no longer available at their source (3.6%). While I am happy this bot has many archived videos before they are potentially lost, it is alarming how volatile they are.

Another worrying statistic is counting the number of videos which were already unavailable before they could be archived. On the pages Mandy has searched, about 315 videos linked on the wikia were dead. It should be noted that this may include links which were already known to be dead when posted (they may be the target lost media they are trying to find).

Luckily, some videos are already archived by other people before the bot arrives. It's reassuring to see that my first target was already preseved by another user.

Finally, while this project was mostly focused on YouTube, other sites are used to host videos as well. YouTube, Dailymotion and Vimeo have been tested so far, with 69 archived videos coming from Vimeo and 21 coming from Dailymotion. There was also one audio file from Soundcloud, which was a *proposed* site to archive. I am mildly suprised that it was successfully archived. youtube-dl and TubeUp ftw.

How it works

Mandy has done a decent job of saving the videos from these sites. But how?

Mandy is a small collection of Python and shell scripts which manage parsing the site for new video links, mirroring those links and performing maintainance tasks.

The basic method of operation involves:

Downloading a feed of recent changes on a lost media website
Finding feed items which may contain links to external media (such as articles, comments and new video pages)
Isolating the page content of these items and parsing them for known link formats of YouTube, Vimeo and Dailymotion media
Feeding the links into TubeUp to download and archive them along with a note mentioning which page it was found on

There is also a script which runs twice daily to:

Push lists of successfully and unsuccessfully archived links to this online repostitory
Email a health report (so I can quickly learn about any errors as well as celebrate victories)
Update software (this helps minimise youtube-dl errors when YouTube changes how it stores videos)

This was designed with modularity and extensibility in mind so that multiple sites can be archived concurrently. Currently, only lostmediaarchive.fandom.com is archived, but lostmediawiki.com will soon be added.

Mandy runs on lostmediaarchive.fandom.com by taking its Recent Changes feed and downloading each feed item to perform a regex search for known video link formats, such as youtube.com/watch?v=xxxxxxxxxxx, as well as formats for channel and playlist links. Once a list of target media sources has been made, it saves a JSON file containing each link and the wiki page it was found on. Then it calls a script to take each of these links, split any playlists or channels into individual videos, check which have not already been uploaded and then archive them using TubeUp. This copy of TubeUp has been modified to append extra information to the description noting that the video was automatically archived and where it was archived from, as well as how it appeared if it was from a channel or playlist. Lists are then updated to record which links were successfully saved and which were not. These statistics are pushed to this remote git repo and added to a health report which is sent to me twice daily.

Since this project uses tools made by ArchiveTeam, youtube-dl and InternetArchive, the archival part is quite simple to write. The difficulties come from scraping and processing. Scraping can be difficult due to a few main factors:

Trying to scrape article comments without dynamic content/JavaScript.
Workaround: On lostmediaarchive.fandom.com, comments can be viewed without JavaScript by finding their history page and viewing it as an RSS feed.
Processing user input which may be poorly formatted.
Users are not expecting their content to be read by a bot so there is no standardisation whatsoever. This is rarely a serious issue.
Recognising valid links in unusual formats
(eg. youtube.com/watch?index=1&list=PL0FpjAA...)
Recognising unexpected links which should be ignored
(eg. youtube.com/search?q=show...)

youtube-dl does a great job of recognising links so in the future I should look at their code, but my regex parser is generally accurate by now.

Processing links can also be problematic:

Recognising error messages from various websites.
YouTube has many different error messages, including a vague "This video is unavailable" message, geoblock messages, account termination messages and copyright takedown messages, then there are some Dailymotion and Vimeo messages as well (including two persistant 'Publishing in progress...' messages on Dailymotion)
youtube-dl treats links to channels differently to links to the channel's playlist of all videos. This is important when enumerating a playlist, because...
Large playlists can cause denial of service. As stated earlier, sometimes a user will post a link to a playlist containing hundred or thousands of long videos.

Most of those issues were solved after launch whenever they happened. I still consider Mandy to be in beta, although operation has become exponentially more stable as more of these unusual cases are discovered.

Mandy is the sort of program you would rather not have to talk to often. From the start there was an effort to try and make the scripts robust and handle errors appropriately, avoid concurrency issues and send regular status updates. This has generally worked out well through use of file locks on shared resources, regular logging and a twice-daily cron job.
Some small oversights and assumptions during planning and writing resulted in rare critical failures from unexpected errors or input. Most of these have been fixed but there are certainly a few issues left just waiting to occur. Still, for a first project this has progressed relatively smoothly.

The source code is available here. I won't pretend that I have maintained great style or even correctness at this point in time, most of the code has been developed though hurried prototyping. The code has many inline comments but external documentation has mostly been neglected since it seemed there is no current reason and little chance anyone would want to run this program themselves. There is definitly some code maintainance needed (tubeFeeder.py currently has no functions defined and code duplication exists, among other issues) but since I am the only maintainer of this small bot, features and bug fixes are unfortunately more of a priority than good practices.

Mandy was very much worth all the effort. Creating an unmanned system to preserve this media for future generations has been a rewarding experience and a fun project to work on. It feels good to be an archivist.

mandy@firemail.cc