Scrape 'N' Feed is a simple Python wrapper around the PyRSS2Gen module. It implements almost all of the code you need to create RSS feeds out of web pages. All you have to write is the code that actually does the screen-scraping (and Beautiful Soup makes that easy). It stores feed state in a pickle file between invocations, freeing you from having to worry about most of the minor problems that get in the way of scraping RSS feeds.
Get Scrape 'N' Feed v1.0
This example should give you an idea of how Scrape 'N' Feed interacts with the PyRSS2Gen module.
#!/usr/bin/env python import ScrapeNFeed class MyFeed(ScrapeNFeed.ScrapedFeed): def HTML2RSS(self, headers, body): #Transform an HTTP response into a series of RSSItem objects items =  ... #Scrape scrape scrape... ... #Then add the items to the feed. self.addRSSItems(items) #Load the feed with its metadata. This will fetch the URL, scrape it #for RSSItems, and possibly write out new RSS and pickle files. MyFeed.load('Feed title', 'Feed source: the URL to fetch and pass into HTML2RSS', 'Feed description' 'Path to destination RSS file', 'Path to pickle file containing feed state', [Any other arguments you'd pass into the RSS2 constructor])
This example generates this RSS feed out of the list of upcoming books from O'Reilly.
#!/usr/bin/env python import BeautifulSoup from PyRSS2Gen import RSSItem, Guid import ScrapeNFeed class OReillyFeed(ScrapeNFeed.ScrapedFeed): def HTML2RSS(self, headers, body): soup = BeautifulSoup.BeautifulSoup(body) headerText = soup.firstText('Upcoming Titles') titleList = headerText.findNext('ul') items =  for item in titleList('li'): link = self.baseURL + item.a['href'] if not self.hasSeen(link): bookTitle = item.a.string releaseDate = item.em.string items.append(RSSItem(title=bookTitle, description=releaseDate, link=link)) self.addRSSItems(items) OReillyFeed.load("Newly announced O'Reilly books", 'http://www.oreilly.com/catalog/new.html', "Keep track of O'Reilly books as they're announced", 'oreilly.xml', 'oreilly.pickle', managingEditorfirstname.lastname@example.org (Leonard Richardson)')
Run this code once a day from a cron, and you'll have an up-to-date
feed kept in
oreilly.xml. The feed information is serialized
oreilly.pickle. The feed generated is a real PyRSS2Gen
RSS2 object with real
RSSItems in it, so you
can put in any special PyRSS2Gen code you might need.
Scrape 'N' Feed handles as much as possible of the work of turning a web page into an RSS feed, leaving you free to concentrate on the scraping code.
This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Tuesday, May 10 2005, 05:22:44 Nowhere Standard Time and last built on Wednesday, October 28 2020, 09:00:01 Nowhere Standard Time.