This page contains automated test results for code from O'Reilly's Ruby Cookbook. If this code looks interesting or useful, you might want to buy the whole book.
Converting HTML Documents From the Web into Text | ||
---|---|---|
Code | Expected | Actual |
require 'open-uri' example = open('http://www.example.com/') |
#<StringIO:0xb7bb601c> | #<StringIO:0xb7d2ec90> |
html = example.read plain_text = html.sub(%r{<body.*?>(.*?)</body>}mi, '\1').gsub(/<.*?>/m, ' '). gsub(%r{(\n\s*){2}}, "\n\n") require 'cgi' plain_text = CGI.unescapeHTML(plain_text) puts plain_text |
Example Web Page You have reached this web page by typing "example.com", "example.net", or "example.org" into your web browser. These domain names are reserved for use in documentation and are not available for registration. See RFC 2606 , Section 3. |
Example Web Page You have reached this web page by typing "example.com", "example.net", or "example.org" into your web browser. These domain names are reserved for use in documentation and are not available for registration. See RFC 2606 , Section 3. |
require 'open-uri' require 'cgi' class HTMLSanitizer attr_accessor :html @@ignore_tags = ['head', 'script', 'frameset' ] @@inline_tags = ['span', 'strong', 'i', 'u' ] @@block_tags = ['p', 'div', 'ul', 'ol' ] def initialize(source='') begin @html = open(source).read rescue Errno::ENOENT # If it's not a file, assume it's an HTML string @html = source end end def plain_text # remove pre-existing blank spaces between tags since we will # be adding spaces on our own @plain_text = @html.gsub(/\s*(<.*?>)/m, '\1') handle_ignore_tags handle_inline_tags handle_block_tags handle_all_other_tags return CGI.unescapeHTML(@plain_text) end private def tag_regex(tag) %r{<#{tag}.*?>(.*?)</#{tag}>}mi end def handle_ignore_tags @@ignore_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), '') } end def handle_inline_tags @@inline_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), '\1 ') } end def handle_block_tags @@block_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), "\n\\1\n") } end def handle_all_other_tags @plain_text.gsub!(/<br.*?>/mi, "\n") @plain_text.gsub!(/<.*?>/m, ' ') @plain_text.gsub!(/(\n\s*){2}/, "\n\n") end end puts HTMLSanitizer.new('http://slashdot.org/').plain_text |
Stories Slash Boxes Comments Slashdot News for nerds, stuff that matters Login Why Login? Why Subscribe? ... |
OSTG SourceForge ThinkGeek ITMJ Linux.com NewsForge freshmeat Newsletters Jobs Broadband Whitepapers X Stories Slash Boxes Comments Slashdot News for nerds, stuff that matters Login Why Login? Why Subscribe? Sections Main Apple AskSlashdot Books Developers Games Hardware Interviews IT Linux Politics Science YRO Vendors AMD Help FAQ Bugs Stories Old Stories Old Polls Topics Hall of Fame Submit Story About Supporters Code Services Broadband PriceGrabber Sponsor Solutions Special Offers Tech Jobs Advertisement Science Device Developed To Help Socially Challenged NASA Priorities Out of Whack? Google Accused of Bio-piracy Quasars Used for Encryption Scrutinizing a Stem Cell Trial Inside DARPA's Robot Race Want to Experience Zero G? Stay in Bed Electrical Noise Causing Physiological Stress? VR Treatment for Lazy Eye Drugs May Offer AIDS Prevention Slashdot Login Log in Nickname Password Public Terminal [ Create a new account ] Slashdot Poll Poll P=NP True False Undecidable Either way won't affect me in the slightest == CowboyNeal [ Results | Polls ] Comments: 557 | Votes: 31843 Older Stuff Wednesday March 29 Pair-Programming with a Wide Gap in Talent? (94) DRM and the Myth of the Analog Hole (277) Slashback: Vista Rewrite, Tuttle Travesty, Mac Botnets (223) Help for an MMORPG Addict? (494) iPod Update to Address Volume-Level Concerns (254) HP Lets User Take Linux for a Virtual Spin (35) UK Government Passes ID Card Bill (280) Recounting Bioware's Baldur's Gate II (83) A Web Based Solution to Replace Exchange? (64) Trustix, a Worthy Contender? (103) An Elder Scrolls Retrospective (80) DesktopBSD 1.0 Final Released (170) Bioware and Pandemic - Story So Far (10) MS Gives 60-Day Deadline to Web Devs (355) Game Site Space For $$ (40) Older Articles Yesterday's News Book Reviews Recent reviews from Slashdot readers: The Equation That Couldn't Be Solved looks at the last 150 years of math development, with a focus on group theory and its impact. ( Joe Kauzlarich's review ) The Areas of My Expertise is a great assorted set of lists, data and other odd pieces of data. Great reading in the bathroom or plane riders ( Peter Wayner's review ) Write Portable Code does an excellent job of explaining how to write code for multiple environments and variant systems. ( Simon P. Chappell's review ) Submitting a review for consideration is easy; please first read Slashdot's book review guidelines . Updated: 20051129 by hemos IT : Anti-malware Vendors Stare Down Microsoft Threat Posted by Zonk on Thursday March 30, @02:28PM from the angry-eyes dept. Captain Rose writes "Matt Hines at eWEEK has stepped up to report the other side of the story CNET inked recently on the perceived death knell that Vista will deliver to independent anti-spyware vendors . There's definitely a fight in store (David v. Goliath), though who knows how long we'll have to wait to see it play out now that Vista's delayed yet again. Is this a bit of foreshadowing on how the new Microsoft OS will address the self-replicating, zero-day spyware threats?" From the article:"Most industry watchers concede that it will be hard for Microsoft to easily displace the enterprise security businesses of leading vendors such as Symantec, McAfee and Trend Micro, which market integrated packages of applications to companies wishing to solve long lists of problems. However, for firms that are focused on only one of those problem areas, analysts said, Vista and the other Microsoft security products could pose a significant threat." ( Read More... 57 of 69 comments it.slashdot.org ) Games : The Oblivion of Western RPGs 30 of 36 comments Science : Device Developed To Help Socially Challenged Posted by Zonk on Thursday March 30, @01:55PM from the insert-your-own-self-referential-joke-here dept. An anonymous reader writes "A device from MIT Media Labs that can pick up on people's emotions is being developed to help people with autism relate to those around them. It will alert its autistic user if the person they are talking to starts showing signs of getting bored or annoyed." From the article:"The 'emotional social intelligence prosthetic' device, which El Kaliouby is constructing along with MIT colleagues Rosalind Picard and Alea Teeters, consists of a camera small enough to be pinned to the side of a pair of glasses, connected to a hand-held computer running image recognition software plus software that can read the emotions these images show. If the wearer seems to be failing to engage his or her listener, the software makes the hand-held computer vibrate." ( Read More... 109 of 143 comments science.slashdot.org ) Theaters Unhappy About Faster DVD Releases Posted by Zonk on Thursday March 30, @01:25PM from the just-get-to-the-downloading-already dept. dolphinlover writes "As movie studios such as Walt Disney Co. have pushed for more rapid DVD releases of movies to combat piracy on the Internet, executives of movie theater chains such as Regal Entertainment Group and National Amusements Inc. have countered, saying that seeing a movie in the theater is a 'fuller, more entertaining experience' and that the time window between movie and DVD releases should even be extended. Their views run counter to Disney's Chief Executive Rober Iger view that DVDs ought to come out simultaneously with the theater releases of movies. Both sides say their plans would benefit consumers. Is either correct, or are both approaching the situation from the wrong angle?" ( Read More... 289 of 352 comments ) Games : Grand Theft Auto Civil Case Moves Forward 53 of 65 comments IT : Why Phishing Works Posted by Zonk on Thursday March 30, @12:47PM from the lower-your-expectations dept. h0neyp0t writes "Harvard and Berkeley have released a study that shows why phishing attacks work (pdf). When asked if a phishing site was legit or a spoof, 23% of users use only the content of the website to make the decision! The majority of users ignore the address and SSL indicators in the browser. Some users think that favicons and lock icons in HTML are more important indicators. The paper hints that the proposed IE7 security indicators and multi-colored address bar will also suffer a similar fate. This study is brought to you by the people who developed the security skins Firefox extension ." ( Read More... 149 of 183 comments it.slashdot.org ) Games : More Xbox Titles Added to 360 List 40 of 48 comments Science : NASA Priorities Out of Whack? Posted by Zonk on Thursday March 30, @12:11PM from the still-no-word-on-search-for-spaghetti-monster dept. amerinese writes "Just last week, we saw a story on NASA reconsidering the fate of the DAWN mission , another reminder of the space agency's budget woes. Gregg Easterbrook over at Slate.com argues not only is the budget a little short, but NASA's priorities are all wrong . From the article: 'For at least a decade, it's been clear that the space shuttle program is a clunker. Nonetheless, NASA's funding remains heavy on the shuttle and the space station, while usually slighting science. This year's proposed budget for fiscal 2007 takes the cosmic cake.' Is NASA just not thinking creatively enough?" ( Read More... 136 of 174 comments science.slashdot.org ) Ask.Com's New Look Competes Well With Google Posted by Zonk on Thursday March 30, @11:44AM from the head-to-head-face-to-face dept. Carl Bialik from WSJ writes "Ask Jeeves has been overhauled and renamed Ask.com. The Wall Street Journal's Walt Mossberg tested the new site against Google and found that Ask.com holds its own and even beats the search champ in some cases. 'It has some very nice features Google lacks, including previews of the sites it finds, an easy way to narrow or broaden your search results, and frequent top-of-the-screen answers that lead you directly to core information,' Mossberg writes." ( Read More... 132 of 156 comments ) Games : UMD Format's Death Rattle Begins 86 of 103 comments Hardware : ILM's Datacenter Posted by CmdrTaco on Thursday March 30, @11:14AM from the something-to-read dept. kylegordon writes "CGW has inside scoop on Industrial Light and Magic's facilities after they moved from San Rafeal to San Franciscos Presidio. With 3000 disks, it can shift 170Tb to 5000 rendernodes over 10GbE and 1GbE network links. It's an impressive system, for impressive films." ( Read More... 87 of 114 comments hardware.slashdot.org ) Apple : Will Apple Disappoint on 30th Anniversary? Posted by Zonk on Thursday March 30, @10:59AM from the omg-so-old-so-old dept. An anonymous reader writes "We've seen the media get over-excited about an Apple launch before, but one CNET columnist is 'threatening suicide' if Apple don't announce something for their 30th Anniversary this Saturday. CNET is concerned at the lack of any news from Apple: 'You'd guess that Steve Jobs will at least have to walk out onto the lawn in Cupertino, light a few fireworks and make some whooping noises. It's that or risk an international incident.' Is Apple going to keep a low profile for their 30th?" ( Read More... 201 of 249 comments apple.slashdot.org ) IT : Lenovo Under U.S. Probe for Spying Posted by Zonk on Thursday March 30, @10:21AM from the seekrit-agent-man dept. BigControversy writes "The DailyTech has a report indicating that Lenovo, the giant Chinese PC manufacturer, is under a probe by the U.S.-China Economic Security Review Commission (USCC) for possible bugging. Apparently, the government has ordered 16,000 PCs from Lenovo but is now requesting that Lenovo be investigated by intelligence agencies. The fear is of foreign intelligence applying pressure to Lenovo to equip its PCs so that the U.S. can be spied on." From the article:"Despite the probe, Lenovo says that its international business, especially those that deal with the US, follow strictly laid out government regulations and rules. Lenovo also claims that even after purchasing IBM's PC division, its international business has not been affected negatively. Interestingly, in an interview with the BBC, Lenovo mentioned that an open investigation or probe may negatively affect the way that the company deals with future government contracts or bids." There just has to be better uses of our intelligence community's time. ( Read More... 205 of 250 comments it.slashdot.org ) Science : Google Accused of Bio-piracy Posted by Zonk on Thursday March 30, @09:56AM from the ahoy-maties-turn-over-those-ribonucleic-acids-if-you-please dept. Simon Phillips writes "ZDNet is reporting that Google has been accused of being the 'biggest threat to genetic privacy' this year for its plan to create a searchable database of genetic information. From the article: 'Google was presented with an award as part of the Captain Hook Awards for Biopiracy in Curitiba, Brazil, this week. The organisers allege that Google's collaboration with genomic research institute J. Craig Venter to create a searchable online database of all the genes on the planet is a clear example of biopiracy.'" ( Read More... 159 of 206 comments science.slashdot.org ) Games : Japan's Gaming History Now Safe Posted by Zonk on Thursday March 30, @09:10AM from the can't-keep-an-old-NES-down dept. An anonymous reader writes "The Guardian today has covered the final part of the ongoing saga regarding the Electrical Appliance and Material Safety Law in Japan. Thankfully, the law has been almost reversed allowing the continued sale of second hand electrical goods (including games consoles)." From the article:"The Japanese secondhand electrical goods market was officially estimated last year to be worth around £500m ... The government probably hoped the law would go largely unnoticed and bring a variety of benefits. By taking the money out of the secondhand market and injecting it into the market for new goods, regulation (of old products) and revivalisation (of the economy) would be achieved in one fell swoop. On paper, anyway. In practice it was rather different." ( Read More... 70 of 85 comments games.slashdot.org ) IT : Hotmail On Your Desktop Posted by samzenpus on Thursday March 30, @08:05AM from the mail-everywhere dept. thomas2you writes "Microsoft has just began its beta testing on a new program, made to have Microsoft's hotmail on your own desktop according to an article on CNET. it's going to be free software and your going to be able to manage multiple accounts and they are attempting to include the ability to also just control all pop3 and smtp accounts you have, including Google's gmail as well as Windows Live Mail, the successor to Hotmail. From the article, "The move is a shift for the Hotmail business, which in the past, has charged users who wanted to read their mail using desktop software, rather than a Web browser. Microsoft charged $20 and up for its paid service."" ( Read More... 160 of 203 comments it.slashdot.org ) Games : Revolution Horsepower Revealed Posted by samzenpus on Wednesday March 29, @11:15PM from the look-inside dept. Revo writes "IGN.com unveiled leaked specs for Nintendo's upcoming Revolution console today. The system really is about twice as powerful as a GameCube and a far cry from the Xbox 360 and PS3. Of course, the focus is on the innovative controller and the affordable price." ( Read More... 531 of 668 comments games.slashdot.org ) IT : Australian Rules to Crackdown on Spam Posted by samzenpus on Wednesday March 29, @10:33PM from the no-pills-down-under dept. siffty writes "Internet service providers could face huge fines if they do not provide spam filtering or impose email sending limits under new rules set down by a communications watchdog. The Australian Communications and Media Authority ( ACMA Media Release ) today registered the world's first legislative code of practice for internet and email service providers. Dealing with unsolicited email or spam costs business and home internet users millions of dollars each year in wasted time and upgrading security systems. But under the new code, ISPs will have to offer spam filtering options to subscribers and provide a system of handling complaints. They will also have to impose reasonable limits on the rate at which subscribers can send email." ( Read More... 81 of 107 comments it.slashdot.org ) Hardware : Unmanned Aerial Drones Coming Soon Above U.S. Posted by samzenpus on Wednesday March 29, @09:15PM from the the-eye-in-the-sky dept. cnet-declan writes "Unmanned aerial vehicles (UAVs) have been flying over Iraq and Afghanistan, but now the Bush administration wants to use them for domestic surveillance . A top Homeland Security official told Congress today, according to this CNET News.com article, that: "We need additional technology to supplement manned aircraft surveillance and current ground assets to ensure more effective monitoring of United States territory." One county in North Carolina is already using UAVs to monitor public gatherings . But what happens when lots of relatively dumb drones have to share airspace with aircraft carrying passengers? A pilot's association is worried ." ( Read More... 547 of 709 comments hardware.slashdot.org ) IT : Quasars Used for Encryption 40 of 47 comments < Yesterday's News > Words can never express what words can never express. All trademarks and copyrights on this page are owned by their respective owners. Comments are owned by the Poster. The Rest © 1997-2006 OSTG . home awards contribute story older articles OSTG advertise about terms of service privacy faq rss |