Making good sense of the 1TB Yahoo Geocities data archive

5.00 avg. rating (94% score) - 1 vote

I got to know Yahoo Geocities in 2001 during a crash course on basic HTML and web design. Looking back, playing with HTML on Geocities website builder using a dialup connection was nothing to fancy about – it could take as much as 15 minutes to edit a simple page. And yet somehow I was fascinated by the idea of having my own website and spent hours on end learning HTML using nothing but my Pentium I machine and a 33.6Kbps modem.

Time passed and with other commitments, I quickly forgot about my Geocities personal home page. It wasn’t until 2011 that I came to know about Geocities’ shutdown back in 2009 and started to think about my very first website on Geocities. After a quick search, I found a 1TB archive of Geocities data that was release as a torrent in 2010 still available for download today, and was very excited to download it, especially since the metadata showed that the archive indeed included my website. For various reasons, not all Geocities sites are included in the archive, and nobody knows how complete the archive is.

Getting the archive

Believe it or not, even with a fast broadband connection. downloading this archive will be a challenge. The torrent contains almost 1TB of billion of small files, mostly HTML and text, which many popular common torrent clients are simply not designed for. What’s more, as Geocities most likely ran on Unix and used a case-sensitive file system, the names of many files in the archive are only different by their casing and confuse even the best Windows torrent clients. In my experiment, I also encountered infinite symbolic links, which are simply evil in their own rights.

All things considered, in order to download the torrent successfully, use a Linux torrent client such as Transmission to download the archive onto an empty 1TB ext2 or ext4 partition. It’s best not to use your boot partition as having such a large amount of files will most likely affect system performance. With this, it took me two weeks to successfully download the torrent. I also made several copies of it onto 1TB external hard disk drives as I did not want to spend time downloading it again.

With the archive in hand, it is time to find a use for it. Rehosting the long-forgotten sites would be out of the question since a default setup of Apache or nginx is definitely not capable of reliably hosting such a large amount of files. After some thinking, I came up with an idea to write codes to extract sentences or paragraphs containing emotional thoughts from the archive. As many sites on Geocities are simple personal home pages where users wrote about their life, work or families, the sites in the archive should be a reasonable good dataset to do what I wanted. With this in mind, I began my work on the project.

Pre-processing the downloaded files

For our purpose, we are only interested in files which contain mostly text. To make things simpler, I only choose TXT and HTM/HTML files (which can quickly be converted to text using regular expression) that are of reasonable size e.g. between 2KB and 2MB. Other types of files such as DOC/DOCX, RTF and PDF may also be of interest but are not included since converting them to text would be much slower. Using MonoDevelop on Ubuntu, I was quickly able to write a tool to filter the archive to only get the files I wanted. If you want to write such a tool, take note that most file systems will struggle if there are too many files in the same folder. To get around this, I assign an MD5 hash to each file (since we are not interested in the original filename) such as 4ECAA709544FF6C237DAB60152A8E.txt and store it under several nested folders corresponding to the first few letters of the filename, e.g. 4\E\C\A\A\7\0\4ECAA709544FF6C237DAB60152A8E.txt. This will reduce the number of files in the same folder to a manageable level. In the process I used the following functions to convert HTML to text paragraphs:

static string stripHTML(string inputHTML)
{
	string lineMarker = "****";

	inputHTML = inputHTML.Replace("\r\n", "\n");
	inputHTML = inputHTML.Replace("\n\r", "\n");
	inputHTML = inputHTML.Replace("\r", "\n");
	inputHTML = inputHTML.Replace((char)160, '\n'); // space
	inputHTML = inputHTML.Replace("\n", lineMarker);
	inputHTML = Regex.Replace(inputHTML, @"<style.*?</style>", " ").Trim();
	inputHTML = Regex.Replace(inputHTML, @"<script.*?</script>", " ").Trim();
	inputHTML = Regex.Replace(inputHTML, @"<(.|\n)*?>", " ").Trim();
	inputHTML = HttpUtility.HtmlDecode(inputHTML);
	inputHTML = inputHTML.Replace(lineMarker, "\n");
	// inputHTML = StripPunctuation(inputHTML).Trim();

	var paragraphs = inputHTML.Split(new char[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
	StringBuilder sb = new StringBuilder();
	for (int i = 0; i < paragraphs.Length; i++)
	{
		string temp = paragraphs[i].Trim();
		temp = Regex.Replace(temp, @"\s{2,}", " ");
		if (temp.Length > 0)
		{
			sb.Append(temp);

			if (i < paragraphs.Length - 1)
			{
				sb.Append("\n");
			}
		}
	}

	return sb.ToString();
}

Extracting paragraph data

It should be noted that many files have HTM/HTML/TXT extensions but are actually binary – some can even extract as ZIP files or open as Word documents if renamed. I suspected this was originally done to bypass Geocities file type restrictions. To get around this, we need to exclude binary files. Since we are only interested in English text, one way is to read every byte and conclude that a file is binary if there is an abundance of control characters other than CR and LF. See this for details.

After removing binary files, we need to extract paragraphs, which, for simplicity, are defined as groups of sentences separated by line breaks. A sentence is defined as group of words separated by certain punctuation such as dot, question or exclamation marks. Reasonable limits should be put in place to filter out too short (<10 words) or too long (>1000 words) sentences. Paragraphs that don’t look like normal English text, e.g those that have words that are too long, contain too many numbers, mixed case words or special characters are also excluded. Each word in every sentence is also checked against an English dictionary, and a paragraph will also be excluded if it contains too many misspelled words. I was able to remove a lot of technical content such as programming or mathematics lessons with this simple filter mechanism.

All selected paragraphs are written into a single text file which looks something like this:

geocities_extracted_paragraphs

Using the above algorithm, I extracted approximately 5 millions paragraphs of text. Most of them are readable; some even worth further reading as they were originally part of an essay or other literature work.

Finding the right paragraph

So how do we find out which paragraphs are worth reading among millions of them? Since we can’t read them all, we need to come up with a way to automatic categorize it. My idea is to assign each paragraph a score based on the ‘mood’ or tone of the paragraph, e.g. how cheerful or how depressing it sounds. Obviously not all paragraphs can be categorized this way, but since our dataset is a collection of mostly personal sites, this method should be good enough.

I started by forming a list of words that would make a paragraph sound cheerful (e.g. happy, great, amazed, delighted, etc.) and another list which would make a paragraph sound depressing (e.g. sad, upset, hateful, uncertain, etc). Taking into account the number of times such words appear and the length of each paragraph, the score of each paragraph (default to 0) will be added or subtracted accordingly. Paragraphs that contain offensive words or words that could potentially indicate illicit content will be excluded. Each paragraph will also be automatically tagged by its topics, identified by group of pertinent words (nouns/verbs/adjectives/adverbs) that appear together in the paragraph. Duplicate paragraphs are also removed in the process.

By using LINQ, I achieve the above in less than 1000 lines of codes and successfully extracted many ‘good’ paragraphs from the dataset:

best_paragraps

As you can see from the screenshot, other criteria such as the number of times certain punctuation (exclamation or question marks) appears in the paragraph are also taken into account for a more objective assessment.

The first paragraph has a mood score of -8800. Its first few sentences indeed sound very depressing:

“Faith? how can i do that? how can i have faith in something? everything changes, everything leaves. You say have faith in myself. but i don’t know who i am. why does it hurt so much to say that? why do these tears fall? i look for faith within myself and i find pain. i don’t understand. how can happiness turn to such doubt and loathing? she has that power. they all do. anyone that i love, can kill me.”

The second last paragraph has a mood score of 2400 and sounds rather delightful:

“Growing up/ Do you hear that sound? The sound of innocense breaking away. / Remembering back when i thought love was the answer, that i could change the world/suddenly paranoid. because i realize when they look at me, they’re thinking of something else. /who wants me? they all want me. why? because i looked at them and smiled. /love is a psychotic state/i never gave them any reason to think there was ever anything else. they made it all up in there head/ do i get mad? do i try to reason?/this is the trial of youth/am i strong enough for this game of love? will i make it out alright? don’t let them break you. do you think you hurt me? noone can hurt me.”

These two paragraphs were probably part of somebody’s diary back in the 1990s that has long been forgotten. By sorting by the mood score, many such paragraphs can easily be found in the final output of the algorithm. With some minimal efforts, these paragraphs can be automatically posted on Twitter or Facebook at a rate of once a day for a good read.

The algorithm is also able to identify words or phrases, known here as topics, that are commonly used, sorted by the number of times they are used:

Capture

At the top of the list, “my research”, “didn’t fit” and “limited information” are the most commonly used terms. Several other common themes such as “my life”, “long time”, ‘first time” can also be seen. Part of Geocities’ relics can also be seen if we look at 3-word topics:

Capture2

You can see “enjoyed my visit” and “signing my guestbook” in the list, which remind us of the times when guest books were still commonly used. It has been such a long time since I last visited any website that still had a functional guest book.

Downloads

I have prepared a ZIP file, which contained the C# source code and text output mentioned in this article:

  • GeocitiesTest: tool to extract only HTML and TXT files from the archive
  • TextExtractor2: tool to extract paragraphs.
  • TextExtractor_Cleanup: tool to do some post-processing on the extracted paragraphs
  • selected_para_5mil.txt: text file with 5 millions extracted paragraphs in raw form
  • para_topics_300k_csv: CSV file with 300,000 extracted paragraphs and associated parameters (paragraphs.csv) as well as topics list for each paragraph (topics.csv & para_topics.csv)

Take note that the code is not optimized and as LINQ is memory-hungry, you will need at least 64GB of RAM to work comfortably. The CSV files can be imported to an SQL database for easier access. You can download the ZIP file here.

5.00 avg. rating (94% score) - 1 vote
ToughDev

ToughDev

A tough developer who likes to work on just about anything, from software development to electronics, and share his knowledge with the rest of the world.

One thought on “Making good sense of the 1TB Yahoo Geocities data archive

  • July 22, 2019 at 4:42 am
    Permalink

    Awesome write up. Thanks for sharing!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>