This page contains automated test results for code from O'Reilly's Ruby Cookbook. If this code looks interesting or useful, you might want to buy the whole book.

Converting HTML Documents From the Web into Text
CodeExpectedActual
require 'open-uri'
example = open('http://www.example.com/')
#<StringIO:0xb7bb601c> #<StringIO:0xb7d2ec90>
html = example.read
plain_text = html.sub(%r{<body.*?>(.*?)</body>}mi, '\1').gsub(/<.*?>/m, ' ').
	gsub(%r{(\n\s*){2}}, "\n\n")
require 'cgi'
plain_text = CGI.unescapeHTML(plain_text)
puts plain_text	
Example Web Page

You have reached this web page by typing "example.com",
"example.net",
or "example.org" into your web browser.
These domain names are reserved for use in documentation and are not available
for registration. See  RFC
2606 , Section 3.
Example Web Page 

You have reached this web page by typing "example.com",
"example.net",
  or "example.org" into your web browser. 
 These domain names are reserved for use in documentation and are not available 
  for registration. See  RFC 
  2606 , Section 3.
require 'open-uri'
require 'cgi'
class HTMLSanitizer
  attr_accessor :html
  @@ignore_tags = ['head', 'script', 'frameset' ]
  @@inline_tags = ['span', 'strong', 'i', 'u'   ]
  @@block_tags  = ['p', 'div', 'ul', 'ol'       ]
  def initialize(source='')
    begin
      @html = open(source).read
    rescue Errno::ENOENT
      # If it's not a file, assume it's an HTML string
      @html = source
    end
  end
  def plain_text
    # remove pre-existing blank spaces between tags since we will
    # be adding spaces on our own
    @plain_text = @html.gsub(/\s*(<.*?>)/m, '\1')
    handle_ignore_tags
    handle_inline_tags
    handle_block_tags
    handle_all_other_tags
    return CGI.unescapeHTML(@plain_text)
  end
  private
  def tag_regex(tag)
    %r{<#{tag}.*?>(.*?)</#{tag}>}mi
  end
  def handle_ignore_tags
   @@ignore_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), '') }
  end
  def handle_inline_tags
    @@inline_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), '\1 ') }
  end
  def handle_block_tags
    @@block_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), "\n\\1\n") }
  end
  def handle_all_other_tags
    @plain_text.gsub!(/<br.*?>/mi, "\n")
    @plain_text.gsub!(/<.*?>/m, ' ')
    @plain_text.gsub!(/(\n\s*){2}/, "\n\n")
  end
end
puts HTMLSanitizer.new('http://slashdot.org/').plain_text
Stories
Slash Boxes
Comments

Slashdot

News for nerds, stuff that matters

Login

Why Login?    Why Subscribe?
...
OSTG     SourceForge    ThinkGeek    ITMJ    Linux.com    NewsForge    freshmeat    Newsletters    Jobs    Broadband    Whitepapers     X   

Stories 
 Slash Boxes 
 Comments 

Slashdot  

News for nerds, stuff that matters 

Login  

Why Login?    Why Subscribe?  

Sections 

Main     Apple    AskSlashdot    Books    Developers    Games    Hardware    Interviews    IT    Linux    Politics    Science    YRO  

Vendors  

AMD  

Help 

FAQ    Bugs  

Stories 

Old Stories    Old Polls    Topics    Hall of Fame    Submit Story  

About 

Supporters    Code  

Services 

Broadband    PriceGrabber    Sponsor Solutions    Special Offers    Tech Jobs  

Advertisement 

Science  

Device Developed To Help Socially Challenged    
		NASA Priorities Out of Whack?    
		Google Accused of Bio-piracy    
		Quasars Used for Encryption    
		Scrutinizing a Stem Cell Trial    
		Inside DARPA's Robot Race    
		Want to Experience Zero G? Stay in Bed    
		Electrical Noise Causing Physiological Stress?    
		VR Treatment for Lazy Eye    
		Drugs May Offer AIDS Prevention  

Slashdot Login  

Log in  
		Nickname   
		Password      
		Public Terminal    

[  
			Create a new account   ]

Slashdot Poll 

Poll    P=NP 
 True
 False
 Undecidable
 Either way won't affect me in the slightest
 == CowboyNeal

[  Results   |  Polls   ]

Comments: 557  | Votes: 31843   

Older Stuff 

Wednesday  March 29   Pair-Programming with a Wide Gap in Talent?  (94)    DRM and the Myth of the Analog Hole  (277)    Slashback: Vista Rewrite, Tuttle Travesty, Mac Botnets  (223)    Help for an MMORPG Addict?  (494)    iPod Update to Address Volume-Level Concerns  (254)    HP Lets User Take Linux for a Virtual Spin  (35)    UK Government Passes ID Card Bill  (280)    Recounting Bioware's Baldur's Gate II  (83)    A Web Based Solution to Replace Exchange?  (64)    Trustix, a Worthy Contender?  (103)    An Elder Scrolls Retrospective  (80)    DesktopBSD 1.0 Final Released  (170)    Bioware and Pandemic - Story So Far  (10)    MS Gives 60-Day Deadline to Web Devs  (355)    Game Site Space For $$  (40)  

Older Articles  
  Yesterday's News  

Book Reviews  

Recent reviews from Slashdot readers:
   The Equation That Couldn't Be Solved   looks at the last 150 years of math development, with a focus on group theory and its impact. ( Joe Kauzlarich's review )    The Areas of My Expertise   is a great assorted set of lists, data and other odd pieces of data.  Great reading in the bathroom or plane riders ( Peter Wayner's
review )     Write Portable Code   does an excellent job of explaining how to write code for multiple environments and variant systems. ( Simon P. Chappell's review ) 

Submitting a review  for
consideration is easy; please first read Slashdot's
book review guidelines . Updated: 20051129 by hemos

IT : Anti-malware Vendors Stare Down Microsoft Threat 

Posted by Zonk 

on Thursday March 30, @02:28PM 
from the angry-eyes  dept. 

Captain Rose  writes "Matt Hines at eWEEK has stepped up to report the other side of the story CNET inked recently on the perceived death knell that Vista will deliver to independent anti-spyware vendors . There's definitely a fight in store (David v. Goliath), though who knows how long we'll have to wait to see it play out now that Vista's delayed yet again. Is this a bit of foreshadowing on how the new Microsoft OS will address the self-replicating, zero-day spyware threats?"  From the article:"Most industry watchers concede that it will be hard for Microsoft to easily displace the enterprise security businesses of leading vendors such as Symantec, McAfee and Trend Micro, which market integrated packages of applications to companies wishing to solve long lists of problems. However, for firms that are focused on only one of those problem areas, analysts said, Vista and the other Microsoft security products could pose a significant threat." 

(
   Read More...     57  of 69  comments   it.slashdot.org  

)

Games  : The Oblivion of Western RPGs    30 of 36 comments

Science : Device Developed To Help Socially Challenged 

Posted by Zonk 

on Thursday March 30, @01:55PM 
from the insert-your-own-self-referential-joke-here  dept. 

An anonymous reader writes "A device from MIT Media Labs that can pick up on people's emotions is being developed to help people with autism  relate to those around them. It will alert its autistic user if the person they are talking to starts showing signs of getting bored or annoyed."  From the article:"The 'emotional social intelligence prosthetic' device, which El Kaliouby is constructing along with MIT colleagues Rosalind Picard and Alea Teeters, consists of a camera small enough to be pinned to the side of a pair of glasses, connected to a hand-held computer running image recognition software plus software that can read the emotions these images show. If the wearer seems to be failing to engage his or her listener, the software makes the hand-held computer vibrate." 

(
   Read More...     109  of 143  comments   science.slashdot.org  

)

Theaters Unhappy About Faster DVD Releases 

Posted by Zonk 

on Thursday March 30, @01:25PM 
from the just-get-to-the-downloading-already  dept. 

dolphinlover  writes "As movie studios such as Walt Disney Co. have pushed for more rapid DVD releases  of movies to combat piracy on the Internet, executives of movie theater chains such as Regal Entertainment Group and National Amusements Inc. have countered, saying that seeing a movie in the theater is a 'fuller, more entertaining experience' and that the time window between movie and DVD releases should even be extended.  Their views run counter to Disney's Chief Executive Rober Iger view that DVDs ought to come out simultaneously with the theater releases of movies.  Both sides say their plans would benefit consumers.  Is either correct, or are both approaching the situation from the wrong angle?" 

(
   Read More...     289  of 352  comments 

)

Games  : Grand Theft Auto Civil Case Moves Forward    53 of 65 comments

IT : Why Phishing Works 

Posted by Zonk 

on Thursday March 30, @12:47PM 
from the lower-your-expectations  dept. 

h0neyp0t writes "Harvard and Berkeley have released a study that shows why phishing attacks work  (pdf).  When asked if a phishing site was legit or a spoof, 23% of users use only the content of the website to make the decision! The majority of users ignore the address and SSL indicators in the browser.  Some users think that favicons and lock icons in HTML are more important indicators.  The paper hints that the proposed IE7 security indicators and multi-colored address bar  will also suffer a similar fate.  This study is brought to you by the people who developed the security skins Firefox extension ." 

(
   Read More...     149  of 183  comments   it.slashdot.org  

)

Games  : More Xbox Titles Added to 360 List    40 of 48 comments

Science : NASA Priorities Out of Whack? 

Posted by Zonk 

on Thursday March 30, @12:11PM 
from the still-no-word-on-search-for-spaghetti-monster  dept. 

amerinese writes "Just last week, we saw a story on NASA reconsidering the fate of the DAWN mission , another reminder of the space agency's budget woes. Gregg Easterbrook over at Slate.com argues not only is the budget a little short, but NASA's priorities are all wrong . From the article: 'For at least a decade, it's been clear that the space shuttle program is a clunker. Nonetheless, NASA's funding remains heavy on the shuttle and the space station, while usually slighting science. This year's proposed budget for fiscal 2007 takes the cosmic cake.' Is NASA just not thinking creatively enough?" 

(
   Read More...     136  of 174  comments   science.slashdot.org  

)

Ask.Com's New Look Competes Well With Google 

Posted by Zonk 

on Thursday March 30, @11:44AM 
from the head-to-head-face-to-face  dept. 

Carl Bialik from WSJ  writes "Ask Jeeves has been overhauled and renamed Ask.com. The Wall Street Journal's Walt Mossberg tested the new site against Google and found that Ask.com holds its own  and even beats the search champ in some cases. 'It has some very nice features Google lacks, including previews of the sites it finds, an easy way to narrow or broaden your search results, and frequent top-of-the-screen answers that lead you directly to core information,' Mossberg writes." 

(
   Read More...     132  of 156  comments 

)

Games  : UMD Format's Death Rattle Begins    86 of 103 comments

Hardware : ILM's Datacenter 

Posted by CmdrTaco 

on Thursday March 30, @11:14AM 
from the something-to-read  dept. 

kylegordon  writes "CGW has inside scoop on Industrial Light and Magic's facilities  after they moved from San Rafeal to San Franciscos Presidio. With 3000 disks, it can shift 170Tb to 5000 rendernodes over 10GbE and 1GbE network links. It's an impressive system, for impressive films." 

(
   Read More...     87  of 114  comments   hardware.slashdot.org  

)

Apple : Will Apple Disappoint on 30th Anniversary? 

Posted by Zonk 

on Thursday March 30, @10:59AM 
from the omg-so-old-so-old  dept. 

An anonymous reader writes "We've seen the media get over-excited about an Apple launch before, but one CNET columnist is 'threatening suicide'  if Apple don't announce something for their 30th Anniversary this Saturday. CNET is concerned at the lack of any news from Apple: 'You'd guess that Steve Jobs will at least have to walk out onto the lawn in Cupertino, light a few fireworks and make some whooping noises. It's that or risk an international incident.' Is Apple going to keep a low profile for their 30th?" 

(
   Read More...     201  of 249  comments   apple.slashdot.org  

)

IT : Lenovo Under U.S. Probe for Spying 

Posted by Zonk 

on Thursday March 30, @10:21AM 
from the seekrit-agent-man  dept. 

BigControversy writes "The DailyTech has a report indicating that Lenovo, the giant Chinese PC manufacturer, is under a probe  by the U.S.-China Economic Security Review Commission (USCC) for possible bugging. Apparently, the government has ordered 16,000 PCs from Lenovo but is now requesting that Lenovo be investigated by intelligence agencies. The fear is of foreign intelligence applying pressure to Lenovo to equip its PCs so that the U.S. can be spied on."  From the article:"Despite the probe, Lenovo says that its international business, especially those that deal with the US, follow strictly laid out government regulations and rules. Lenovo also claims that even after purchasing IBM's PC division, its international business has not been affected negatively. Interestingly, in an interview with the BBC, Lenovo mentioned that an open investigation or probe may negatively affect the way that the company deals with future government contracts or bids."  There just has to be better uses of our intelligence community's time.

(
   Read More...     205  of 250  comments   it.slashdot.org  

)

Science : Google Accused of Bio-piracy 

Posted by Zonk 

on Thursday March 30, @09:56AM 
from the ahoy-maties-turn-over-those-ribonucleic-acids-if-you-please  dept. 

Simon Phillips writes "ZDNet is reporting
that Google has been accused of being the 'biggest
threat to genetic privacy' this year  for its plan to create a searchable database of genetic information. From the article: 'Google was presented with an award as part of the Captain
Hook Awards for Biopiracy  in Curitiba, Brazil, this week. The organisers allege that Google's collaboration with genomic research institute J. Craig Venter to create a searchable online database of all the genes on the planet is a clear example of biopiracy.'" 

(
   Read More...     159  of 206  comments   science.slashdot.org  

)

Games : Japan's Gaming History Now Safe 

Posted by Zonk 

on Thursday March 30, @09:10AM 
from the can't-keep-an-old-NES-down  dept. 

An anonymous reader writes "The Guardian today has covered the final part of the ongoing saga  regarding the Electrical Appliance and Material Safety Law  in Japan. Thankfully, the law has been almost reversed allowing the continued sale of second hand electrical goods (including games consoles)."  From the article:"The Japanese secondhand electrical goods market was officially estimated last year to be worth around &pound;500m ... The government probably hoped the law would go largely unnoticed and bring a variety of benefits. By taking the money out of the secondhand market and injecting it into the market for new goods, regulation (of old products) and revivalisation (of the economy) would be achieved in one fell swoop. On paper, anyway. In practice it was rather different." 

(
   Read More...     70  of 85  comments   games.slashdot.org  

)

IT : Hotmail On Your Desktop 

Posted by

samzenpus

on Thursday March 30, @08:05AM 
from the mail-everywhere  dept. 

thomas2you  writes "Microsoft has just began its beta testing on a new program, made to have Microsoft's hotmail on your own desktop  according to an article on CNET. it's going to be free software and your going to be able to manage multiple accounts and they are attempting to include the ability to also just control all pop3 and smtp accounts you have, including Google's gmail as well as Windows Live Mail, the successor to Hotmail. From the article, "The move is a shift for the Hotmail business, which in the past, has charged users who wanted to read their mail using desktop software, rather than a Web browser. Microsoft charged $20 and up for its paid service."" 

(
   Read More...     160  of 203  comments   it.slashdot.org  

)

Games : Revolution Horsepower Revealed 

Posted by

samzenpus

on Wednesday March 29, @11:15PM 
from the look-inside  dept. 

Revo writes "IGN.com unveiled leaked specs for Nintendo's upcoming Revolution console  today. The system really is about twice as powerful as a GameCube and a far cry from the Xbox 360 and PS3. Of course, the focus is on the innovative controller and the affordable price." 

(
   Read More...     531  of 668  comments   games.slashdot.org  

)

IT : Australian Rules to Crackdown on Spam 

Posted by

samzenpus

on Wednesday March 29, @10:33PM 
from the no-pills-down-under  dept. 

siffty writes "Internet service providers could face huge fines  if they do not provide spam filtering or impose email sending limits under new rules set down by a communications watchdog.
The Australian Communications and Media Authority ( ACMA Media Release  ) today registered the world's first legislative code of practice for internet and email service providers.

Dealing with unsolicited email or spam costs business and home internet users millions of dollars each year in wasted time and upgrading security systems.

But under the new code, ISPs will have to offer spam filtering options to subscribers and provide a system of handling complaints.

They will also have to impose reasonable limits on the rate at which subscribers can send email." 

(
   Read More...     81  of 107  comments   it.slashdot.org  

)

Hardware : Unmanned Aerial Drones Coming Soon Above U.S. 

Posted by

samzenpus

on Wednesday March 29, @09:15PM 
from the the-eye-in-the-sky  dept. 

cnet-declan  writes "Unmanned aerial vehicles (UAVs) have been flying over Iraq and Afghanistan, but now the Bush administration wants to use them for domestic surveillance . A top Homeland Security official told Congress today, according to this CNET News.com article, that: "We need additional technology to supplement manned aircraft surveillance and current ground assets to ensure more effective monitoring of United States territory." One county in North Carolina is already using UAVs to monitor public gatherings . But what happens when lots of relatively dumb drones have to share airspace with aircraft carrying passengers? A pilot's association is worried ." 

(
   Read More...     547  of 709  comments   hardware.slashdot.org  

)

IT  : Quasars Used for Encryption    40 of 47 comments
 <&nbsp; Yesterday's News 

&nbsp;>   

Words can never express what words can never express. 

All trademarks and copyrights on this page are owned by their respective owners.  Comments are owned by the Poster.  The Rest &copy; 1997-2006 OSTG .

home    awards    contribute story    older articles    OSTG    advertise    about    terms of service    privacy    faq    rss