The RSS Problem

Posted by

Note: This post was originally part of my LinuxWorld blog; now migrated here after my resignation.There has been much talk recently of the problems of RSS and the popularity of the news aggregation solution starting to cause bandwidth problems with certain sites. The number of people subscribing to the various news feeds and selecting arbitrary refresh times for updating their feeds.The problems are caused by two different elements, human and technical. The human element is that the people who subscribe are generally hungry for news. I pick up maybe 1000-1200 items each day from about 200 feeds at intervals ranging from hourly to just once a day. There are others, though, who to get the news the moment it hits the sites and specify update times of as little as 5 minutes between updates.The technical side is a simple case of the system’s original intention becoming far more popular than expected. The RSS/RDF format works by the site or blog generating an XML file containing the list of stories, summaries and other information which is then downloaded by news aggregation software.These files are generally static, generated at set intervals or only regenerated when a new story is posted, but to enable other users to download all the stories in an existing feed each XML file may contain 20 or more stories. The downside to this approach is that for each user that ‘subscribes’ to the feed, the only way to update information is to download the entire file, parse the contents and then determine which of the stories you haven’t already seen and display them.In essence subscribing to a news feed is actually nothing more than an instruction to your aggregation software to keep downloading the same file at the interval you set. A typical feed file will be about 12-20Kb. With a thousand ‘subscribers’ refreshing their feed hourly that makes up to 20Mb an hour of bandwidth. Over the course of a month we’re talking about 14-15Gb of bandwidth just to support your reader’s habits. The real issue is that these users aren’t ‘subscribing’ so much as constantly reloading the file in the vain hope that something might have changed. In reality there might only be one new story, probably making up less than 1Kb of the content. Think about it: when you subscribe to a monthly magazine you don’t get the new issue with all the new stories and last months issue as well, just in case you haven’t already read it.The solution? We need to change both ends of the system. On the server side we should be using the dynamic nature of 90% of the RSS solutions so that they only generate content that has been published within a given timeframe. This way, readers can specify how far back they want to load stories. Now instead of generating XML or the last 20 stories, the system will generate XML for the stories published since the last time the client asked.Implementing these changes is not difficult. The software will need to be able to accept a timestamp in the URL request and to generate the XML on the fly (rather than dumping to a static file). On the client end the software needs to be adapted so that it only requests items posted and/or changed since the last time it refreshed the feed. Now it doesn’t matter whether they choose an hour or a day for refreshes, the amount of information they actually have to download will only consist of the information that has been changed.These are simple, straightforward and easy to implement changes that would make an enormous difference to the bandwidth required to support RSS. If used properly, it would even help the client side, as they wouldn’t have to remember or process the files so closely to ensure that already received (and read) stories were ignored.I can’t believe I’m the only person to think of this solution, but I also haven’t heard of anybody proposing such a solution to the problem.. It doesn’t require any changes to the XML and only minor changes to the way the clients and servers work. So what are we all waiting for? Let’s cooperate and starting solving the RSS problem before it becomes an even bigger issue.