I've been searching around trying to find an answer both here and google, although I've found some pointers I haven't quite found a solution.
If you have a simple RSS reader with a database, you might have a couple of tables for storing feeds (ignoring dealing with subscribers here):
- Feeds (feed-id, feed-title,feed-url)
- Items (item-id, feed-id,item-title, item-content)
This works in most cases but for many websites/web based applications you might have a main feed from the frontpage and then category feeds, if you take both into the above sort of system there's going to be a lot of replicated data due to the same post appearing in several rss feeds.
The two choices I've come up with are either ignore it and accept the duplicates or use a link table between the feeds and the items. But this also seems like quite a waste when probably 80% of the sort of feeds I'm looking to pull won't have multiple feeds which could create this replication.
Is there a better way of doing this/am I looking at this in completely the wrong way?
Update
Thanks to both for the answers, so the consensus seems to be that the saving on space is probably not significant enough to worry about and would be negated by the potential for unknown issues (such as mentioned by dbr).
Adding a link table or similar would probably increase the processing time as well so overall not worth worrying about too much. I had thoughts after reading the responses of linking content and removing duplicates only when the post is no longer in either RSS feed to save on space but again as Assaf has said, the space savings could make this a waste of time.