Scrape 'N' Feed is a simple Python wrapper around the PyRSS2Gen module. It implements almost all of the code you need to create RSS feeds out of web pages. All you have to write is the code that actually does the screen-scraping (and Beautiful Soup makes that easy). It stores feed state in a pickle file between invocations, freeing you from having to worry about most of the minor problems that get in the way of scraping RSS feeds.
This example should give you an idea of how Scrape 'N' Feed interacts with the PyRSS2Gen module.
#!/usr/bin/env python import ScrapeNFeed class MyFeed(ScrapeNFeed.ScrapedFeed): def HTML2RSS(self, headers, body): #Transform an HTTP response into a series of RSSItem objects items = [] ... #Scrape scrape scrape... ... #Then add the items to the feed. self.addRSSItems(items) #Load the feed with its metadata. This will fetch the URL, scrape it #for RSSItems, and possibly write out new RSS and pickle files. MyFeed.load('Feed title', 'Feed source: the URL to fetch and pass into HTML2RSS', 'Feed description' 'Path to destination RSS file', 'Path to pickle file containing feed state', [Any other arguments you'd pass into the RSS2 constructor])
This example generates this RSS feed out of the list of upcoming books from O'Reilly.
#!/usr/bin/env python import BeautifulSoup from PyRSS2Gen import RSSItem, Guid import ScrapeNFeed class OReillyFeed(ScrapeNFeed.ScrapedFeed): def HTML2RSS(self, headers, body): soup = BeautifulSoup.BeautifulSoup(body) headerText = soup.firstText('Upcoming Titles') titleList = headerText.findNext('ul') items = [] for item in titleList('li'): link = self.baseURL + item.a['href'] if not self.hasSeen(link): bookTitle = item.a.string releaseDate = item.em.string items.append(RSSItem(title=bookTitle, description=releaseDate, link=link)) self.addRSSItems(items) OReillyFeed.load("Newly announced O'Reilly books", 'http://www.oreilly.com/catalog/new.html', "Keep track of O'Reilly books as they're announced", 'oreilly.xml', 'oreilly.pickle', managingEditor='leonardr@segfault.org (Leonard Richardson)')
Run this code once a day from a cron, and you'll have an up-to-date
feed kept in oreilly.xml
. The feed information is serialized
to oreilly.pickle
. The feed generated is a real PyRSS2Gen
RSS2
object with real RSSItem
s in it, so you
can put in any special PyRSS2Gen code you might need.
Scrape 'N' Feed handles as much as possible of the work of turning a web page into an RSS feed, leaving you free to concentrate on the scraping code.
This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Tuesday, May 10 2005, 05:22:44 Nowhere Standard Time and last built on Sunday, April 02 2023, 03:00:01 Nowhere Standard Time.
| Document tree: Site Search: |