Objective: To provide a reader with an overview of how we approach the challenge of link rot within the parameters defined by the author and arrive at a high-level workflow that will form the basis of a solution architecture.


Table of contents


While problem is pretty well defined on external sites (eg: Wikipedia), to me, link rot quite simply translates to - not being able to access the information and/or content one needs at point 1 from a resource that was on the web during point 0 (where point [0, 1,..n] are incremental units of time eg: days, weeks, months, years).

This could be because of a number of factors including (but not necessarily limited to):

  • Site structural changes without adequate redirection to new links
  • Takedown notices requiring the content or website to go offline
  • Domain and/or Service Provider expiry and non-renewal

 

Link rot affects us on multiple fronts for a variety of different content types.

We can frame the link rot challenge as a combination of the following (though not necessarily limited to) source and content types:

Content Types:

  • Documents (eg: PDF, Doc, XLS, Epub etc)
  • Images (eg: PNG, JPG, etc)
  • Sound (eg: WAV, Mp3, FLAC)
  • Videos (eg: Mp4, MKV)
  • Site Structure (HTML, CSS, JS, etc)

 

Source Types:

  • Content-specific Sources (eg: Image Dump)
  • Content-specific Hosts (eg: Imgur, Flickr, Soundcloud)
  • Combined-Content Sources (eg: Blogs, E-Libraries)
  • Service-based Content Sources (eg: Goodreads, G-Suite)

 

Here’s a visual outline of the problem at hand :

Content and Source Types can be thought of in two paradigms, one content type from many sources or many content types from a single source

 

Framing the Challenge - High-Level Considerations

Before attempting to architect any solutions, this section of the document will help frame the challenge at hand when it comes to link rot. While this frame may not be all encompassing, it is personalized and thus is based on current requirements.

Sources with multiple content-types and specific content types derived from multiple sources need to be classified into buckets based on the niche and/or category to enable easier retrieval of content at a later date.

This can be visually thought of as follows:

Categories can serve as catch all buckets for the aforementioned two paradigms of content and source types

 

Given that each category obviously comprises of multiple sources of content-types, the visualization above can be easily extended to the following as a means of better representing the state of the challenge:

A single category can house (deduplicated) multiple content types from multiple sources

 

There are three other key components to the above, which are:

  • Time
  • Access Control
  • Storage

 

At a high-level the considerations for each of those are as follows:

  1. Time
    • Does the source (Many from One) stay the same over time?
    • How often do we archive the sources and content-types?
    • Does our frequency affect the source? (eg: Bandwidth and Server Resource Constraints)
  2. Access Control
    • Who has access to the archive?
    • Can the information collected be shared either privately to elected individuals or publicly on the net?
    • Should it be made available in alternate formats to the aforementioned parties (identified or otherwise) ?
    • Do the archives detract users from visiting the primary source? Does this violate any usage and/or consumption policies?
  3. Storage
    • How do we store the collected archives? Are there any limitations? Are there any efficiencies that can be achieved? (eg: Compression)
    • How do we access the stored archives? Are there multiple and/or different ways to browse the archives based on the content and/or source type?
    • Do the access controls above translate to the storage of the content as well?

 

The three additional considerations above will be explored through the solutions below to better demonstrate how one can achieve the above based on the solution(s) elected.

 

Existing Solutions To the Challenge

Before proceeding to architect a new solution for the challenge, it is important to scope out and identify solutions that may already address the needs and requirements of the challenge above and possibly outline where it doesn’t completely fit the requirements.

Bookmark Archival Services and Tools

There are a few services and/or tools that I’ve discovered (but not extensively tested) that can not only serve as a bookmarking service but a link archival service as well. They do this by creating an offline or archived snapshot of the webpage at time of bookmarking (point 0) which then enables you to go back and refer to the content at a future date (point 1..n).

The tools and/or services (that I am aware of) that offer this functionality are as follows:

  • Pinboard.in Bookmark Archive
  • Pocket Premium
  • Evernote Web Clipper
  • Pirate’s Bookmark Archiver

 

Advantages:

  • Resolves the ‘Many from One’ scenario with varying degrees of depth (eg: Evernote’s Full page capture vs Simplified page capture)
  • Outsourced technology with simplified workflow - which enables us to ‘Bookmark as usual’ and yet keep an archived copy

 

Web Scrapers

Web scrapers enable us create complete (or as complete as possible) copies of entire (or elected partial portions of) websites along with all the structural components associated with said site.

Examples of tools and/or services (that I am aware of) that offer this functionality are as follows:

  • HTTrack
  • Wget
  • Scrapy

 

Advantages:

  • Really efficient for archiving large and/or complete (or partially elected) mirrors of identified resources (‘Many from One’)
  • Can be configured to perform tasks on an automated basis from a select list of URLs

 

Identifying Requirements - What are the “Nice to Have’s”

This section will aim to cover some of the “nice to have” features for combating link rot. The goal is to potentially highlight requirements that are not readily covered by the tools and/or services above but may possibly be achieved through chaining and/or with some additional code.

“Nice to Have” Requirement List

  • Automated Archival of Bookmarks and OneTab lists
  • Ability to operate without relying on local resources (eg: Running on an External Server)
  • Ability to monitor progress and results of current and past archival activities
  • Ability to gauge existing storage utilization and potentially chart incremental requirements for archives
  • Ability to do delta updates on specific targets rather than full archivals

 

Requirement #1 - Automated Archival of Bookmarks and OneTab lists

OneTab is probably my most frequently used add-on. It basically keeps a local copy of tabs that I’ve yet to read or am planning to refer to at a later point in time. The advantage of the add-on is the (virtually) limitless amount of URLs it can store locally. That unfortunately does lead to two challenges for me, which are: Long lists going back to the date of initialization Lists are machine and browser-based and do not transfer over to other browsers on the same system and other systems used by the same user (even when the same user profile is used) creating multiple copies of different lists.

NOTE : It is important to note however that this is not a negative criticism at the add-on. I am a HUGE fan of OneTab.

In addition, I occasionally (by force of habit) use either the browser’s native bookmarking feature or the external service offered by Pocket and Evernote. This creates a wide array of disparate collections that aren’t optimal when looking for information or that one article I bookmarked that one time.

Browser bookmarks on the other hand are less of a concern and slightly easier to deal with - especially because they automatically get backed up and are available at a later - more often than not with a robust, searchable index through the ‘Bookmark Managers’ that are built-in.

However - besides the few quick links - it would be great if I could selectively pick a specific folder or two from the bookmarks to be archived especially for offline browsing.

Requirement #2 - Non-reliance on local resources

All archival activities need to flexible enough to run either locally on my current machine or remotely on a server of choice. This reduces the requirement of keeping a local machine available with active resources allocated towards archival activities thus enabling both ‘business-as-usual’ and archival activities to function simultaneously.

Requirement #3 - Ability to monitor progress of present activities and past results

All automated activities need to be monitored (and tested) in some capacity over the course of their lifespan to ensure that:

  • It functions as it is intended to
  • Any new edge cases and/or issues that were not originally accounted for is discovered and subsequently addressed

More importantly monitoring is required to ensure that when an automated process fails - it needs to fail noisily. All monitoring alarm should be triggered to the extent of the seriousness of the fault to ensure that it is registered and subsequently addressed.

Requirement #4 - Ability to gauge existing storage utilization

As we scale our archival activities to cover all sources and content types associated with personal link rot, it is important to ensure that one has the ability to assess and plan for storage requirements should one continue to automatically generate incremental full or delta snapshots of said sources.

Requirement #5 - Ability to perform delta updates vs full updates

This requirement while optional does directly impact [Requirement #4]. The requirement will be revisited based on the current identified storage options available for the challenge.

 

Architecting a Solution

This section will aim to demonstrate how we derive an architecture for a solution to the aforementioned challenge. It however does not imply that the process undertaken is explicitly detailed and all encompassing but instead will serve as a guideline to the overarching thought process at hand.

Based on the above, one can already visualize a high-level workflow for an intended solution:

A high-level solution workflow for a to be designed architecture that is outlined below

The workflow outline is as follows:

  1. Bookmarks or Target URL lists are created as usual on a local machine and/or elected location eg: File share
  2. An endpoint is triggered at a previously configured interval
  3. The triggered endpoint initiates pre-configured rule-based archival activities
  4. The triggering of the archival activities also initiates log monitoring through a Dashboard or preconfigured report that can be reviewed through a Browser on an External device
  5. Post completion of an archival thread from the overall activity - the result (or files)is stored at an elected and monitored storage solution for retrieval at a later point

 

The solution can broadly though compressed into three major categories, namely:

  • Export
  • Extract
  • Store

 

Given that the above, we can start working on the implementation of the solution based on the aforementioned requirements.

The next article in the series will aim to cover the implementation of a case solution based on the above.