Scrape 'N' Feed: The Greasy Tag

Homemade RSS Feeds
Come On In

Scrape 'N' Feed is a simple Python wrapper around the PyRSS2Gen module. It implements almost all of the code you need to create RSS feeds out of web pages. All you have to write is the code that actually does the screen-scraping (and Beautiful Soup makes that easy). It stores feed state in a pickle file between invocations, freeing you from having to worry about most of the minor problems that get in the way of scraping RSS feeds.

Download

Get Scrape 'N' Feed v1.0

Generic Example

This example should give you an idea of how Scrape 'N' Feed interacts with the PyRSS2Gen module.

#!/usr/bin/env python
import ScrapeNFeed

class MyFeed(ScrapeNFeed.ScrapedFeed):    

    def HTML2RSS(self, headers, body):

	#Transform an HTTP response into a series of RSSItem objects
        items = []
	...
        #Scrape scrape scrape... 
	...
	#Then add the items to the feed.
        self.addRSSItems(items)

#Load the feed with its metadata. This will fetch the URL, scrape it
#for RSSItems, and possibly write out new RSS and pickle files.
MyFeed.load('Feed title',
            'Feed source: the URL to fetch and pass into HTML2RSS',
            'Feed description'
            'Path to destination RSS file', 
            'Path to pickle file containing feed state',
	    [Any other arguments you'd pass into the RSS2 constructor])

Specific Example

This example generates this RSS feed out of the list of upcoming books from O'Reilly.

#!/usr/bin/env python
import BeautifulSoup
from PyRSS2Gen import RSSItem, Guid
import ScrapeNFeed

class OReillyFeed(ScrapeNFeed.ScrapedFeed):    

    def HTML2RSS(self, headers, body):
        soup = BeautifulSoup.BeautifulSoup(body)
        headerText = soup.firstText('Upcoming Titles')
        titleList = headerText.findNext('ul')
        items = []
        for item in titleList('li'):
            link = self.baseURL + item.a['href']
            if not self.hasSeen(link):
                bookTitle = item.a.string
                releaseDate = item.em.string
                items.append(RSSItem(title=bookTitle,
                                     description=releaseDate,
                                     link=link))
        self.addRSSItems(items)

OReillyFeed.load("Newly announced O'Reilly books",
                 'http://www.oreilly.com/catalog/new.html',
                 "Keep track of O'Reilly books as they're announced",
                 'oreilly.xml', 
		 'oreilly.pickle',
                 managingEditor='leonardr@segfault.org (Leonard Richardson)')

Run this code once a day from a cron, and you'll have an up-to-date feed kept in oreilly.xml. The feed information is serialized to oreilly.pickle. The feed generated is a real PyRSS2Gen RSS2 object with real RSSItems in it, so you can put in any special PyRSS2Gen code you might need.

The Scrape 'N' Feed Advantage

Scrape 'N' Feed handles as much as possible of the work of turning a web page into an RSS feed, leaving you free to concentrate on the scraping code.

Scrape 'N' Feed saves you from having to write code to fetch the page. It respects the ETag and Last-Modified HTTP headers, which saves time and bandwidth over multiple runs.
As new items are put into the RSS feed, older items are automatically pushed out. A feed has a maximum length, but the maximum length can be temporarily violated if a single update yields a lot of new items. For instance, if the maximum feed length is 20, but an update reveals 25 new items, the feed will contain all 25 new items until the next update.
The items on some sites (such as the O'Reilly site) have no publication dates. This is no problem for Scrape 'N' Feed. It keeps track of when a link was seen for a feed, and it will use that date as the publication date if no other date is provided.
GUIDs are automatically created from the item links (If you want to create custom GUIDs, or if the page you're scraping doesn't have permalinks to each item, you can still create GUIDs manually).
Exceptions thrown while screen-scraping are turned into notifications in the RSS feed, so you're more likely to see them.

This document is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Tuesday, May 10 2005, 05:22:44 Nowhere Standard Time and last built on Friday, May 29 2026, 23:00:01 Nowhere Standard Time.

Document tree:

http://www.crummy.com/

software/

ScrapeNFeed/

Site Search:

Homemade RSS Feeds Come On In

Download

Generic Example

Specific Example

The Scrape 'N' Feed Advantage

Homemade RSS Feeds
Come On In