Scrape 'N' Feed

Homemade RSS Feeds
Come On In

Scrape 'N' Feed is a simple Python wrapper around the PyRSS2Gen module. It implements almost all of the code you need to create RSS feeds out of web pages. All you have to write is the code that actually does the screen-scraping (and Beautiful Soup makes that easy). It stores feed state in a pickle file between invocations, freeing you from having to worry about most of the minor problems that get in the way of scraping RSS feeds.

Download

Get Scrape 'N' Feed v1.0

Generic Example

This example should give you an idea of how Scrape 'N' Feed interacts with the PyRSS2Gen module.

#!/usr/bin/env python
import ScrapeNFeed

class MyFeed(ScrapeNFeed.ScrapedFeed):    

    def HTML2RSS(self, headers, body):

	#Transform an HTTP response into a series of RSSItem objects
        items = []
	...
        #Scrape scrape scrape... 
	...
	#Then add the items to the feed.
        self.addRSSItems(items)

#Load the feed with its metadata. This will fetch the URL, scrape it
#for RSSItems, and possibly write out new RSS and pickle files.
MyFeed.load('Feed title',
            'Feed source: the URL to fetch and pass into HTML2RSS',
            'Feed description'
            'Path to destination RSS file', 
            'Path to pickle file containing feed state',
	    [Any other arguments you'd pass into the RSS2 constructor])

Specific Example

This example generates this RSS feed out of the list of upcoming books from O'Reilly.

#!/usr/bin/env python
import BeautifulSoup
from PyRSS2Gen import RSSItem, Guid
import ScrapeNFeed

class OReillyFeed(ScrapeNFeed.ScrapedFeed):    

    def HTML2RSS(self, headers, body):
        soup = BeautifulSoup.BeautifulSoup(body)
        headerText = soup.firstText('Upcoming Titles')
        titleList = headerText.findNext('ul')
        items = []
        for item in titleList('li'):
            link = self.baseURL + item.a['href']
            if not self.hasSeen(link):
                bookTitle = item.a.string
                releaseDate = item.em.string
                items.append(RSSItem(title=bookTitle,
                                     description=releaseDate,
                                     link=link))
        self.addRSSItems(items)

OReillyFeed.load("Newly announced O'Reilly books",
                 'http://www.oreilly.com/catalog/new.html',
                 "Keep track of O'Reilly books as they're announced",
                 'oreilly.xml', 
		 'oreilly.pickle',
                 managingEditor='leonardr@segfault.org (Leonard Richardson)')

Run this code once a day from a cron, and you'll have an up-to-date feed kept in oreilly.xml. The feed information is serialized to oreilly.pickle. The feed generated is a real PyRSS2Gen RSS2 object with real RSSItems in it, so you can put in any special PyRSS2Gen code you might need.

The Scrape 'N' Feed Advantage

Scrape 'N' Feed handles as much as possible of the work of turning a web page into an RSS feed, leaving you free to concentrate on the scraping code.


This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Tuesday, May 10 2005, 05:22:44 Nowhere Standard Time and last built on Sunday, April 02 2023, 03:00:01 Nowhere Standard Time.

Crummy is © 1996-2023 Leonard Richardson. Unless otherwise noted, all text licensed under a Creative Commons License.

Document tree:

http://www.crummy.com/
software/
ScrapeNFeed/
Site Search: