Wanna see TL;DRzr in action in a larger app? Check out my other hack XTractor - that extracts page content, cleans it and summarizes it using TL;DRzr code.

Try it out!

1. Summarize a Feed (RSS/Atom) or a URL
Sentences :

Defaults to the TechCrunch Feed Burner RSS url. Hit the Summarize Feed button and you're good to go. This will fetch/parse/summarize the text, so be a bit patient. :)

[NEW] You can also put in a web url and it will fetch the page and try to summarize the page content.

2. Summarize Text
Sentences :

If cutting/pasting, try to use the paste as plaintext option

What's New?

  • 2013-04-11 - TL;DRzr is now opensource. (Hacker News announcement). (tldrzr@github)
  • 2013-04-11 - OpenNLP based tokenizer is now online. Earlier, due to a bug in the code it was always falling back to the Regular Expression based tokenizer. This improves sentence quality.
  • 2013-04-10 - Thanks to BoilerPipe, now the url passed to TL;DRzr doesn't have to be a feed url only. It now can extract parse general web page content.
  • 2013-04-09 - Summarized URL's can now be saved as links of the form /tldr/?feed_url=url_goes_here.
  • 2013-04-09 - There's a POST based API to summarize text (upto 4MB) via a HTTP POST to /tldr/api/summarize. The parameters are input_text (mandatory) and sentence_count (optional: defaults to 5). This is running on a single heroku dyno. So feel free to use this API, but please be gentle. :)

How does it work?

TL;DRzr uses an algorithm derived from Classifier4J. I used the basic algo from Classifier4j, optimized it and added some refinements.

The basic algorithm for summarization is like this. It first tokenizes the text into words and then calculates the top N most frequent words (discarding stop words and single occurence words). It then scans the sentences and gets the first N sentences which feature any or all of the most frequent words. The sentences are sorted based on first occurence in original text and concatenated to create the summary. The user has control over how long the generated summary should be in terms of sentence count.

TL;DRzr is written in Java and uses Jsoup for html text scraping, ROME for RSS Feed parsing (which depends on JDOM). The parsing of sentences and word tokenization uses OpenNLP. It uses the Porter2 stemmer algorithm from here to process the tokens emitted by the tokenizer. The new summarize any url feature uses BoilerPipe

Credits

TL;DRzr is a weekend project/quick hack demo created by Saurav Mohapatra. I wrote this as a fun weekend hack after reading about the Summly acquisition by Yahoo!. I had drunk too many Red Bulls and sleep was not too forthcoming. :) I always wished to try out Heroku and after a couple of hours of googling + coding, I put this together.

The algorithm is a keyword density based one. As this is my current hobby project, I shall work on improving the algorithm. I plan on opensourcing this codebase on github..