org.switchboard.util
Class WordUtil

java.lang.Object
  extended by org.switchboard.util.WordUtil

public final class WordUtil
extends Object

A collection of methods to manipulate and analyze English text.


Constructor Summary
WordUtil()
           
 
Method Summary
static float calculateEnglishness(String s)
          Calculates the percentage of words in the sentence that can be found in the dictionary.
static float calculateEnglishness(String s, Dictionary d)
          Calculates the percentage of words in the sentence that can be found in the dictionary.
static String capitalize(String word)
          Capitalizes the first letter of the string
static int commonStrings(String[] one, String[] two)
          Calculates the number of strings that two arrays share, ignoring case.
static String convertHTMLEntities(String input)
          Converts the HTML entities in a string to the ASCII characters.
static String[] getAdjectives(String in)
          Gets all of the adjectives in a given sentence.
static String[] getAdjectives(String in, Dictionary dict)
          Gets all of the adjectives in a given sentence.
static String[] getAdverbs(String in)
          Gets all of the adverbs in a given sentence.
static String[] getAdverbs(String in, Dictionary dict)
          Gets all of the adverbs in a given sentence.
static String[] getNouns(String in)
          Gets all of the nouns out of the specified string.
static String[] getNouns(String in, Dictionary dict)
          Gets all of the nouns out of the specified string.
static String[] getPronouns(String in)
          Gets all of the pronouns in a given sentence.
static String[] getPronouns(String in, Dictionary dict)
          Gets all of the pronouns in a given sentence.
static String[] getSynonyms(String word, int maxNum)
          Gets the synonyms for the provided word.
static String[] getSynonyms(String word, int maxNum, Thesaurus thes)
          Gets the synonyms for the provided word.
static String[] getTheseWords(String in, String pos, Dictionary dict)
          Gets all words of a particular part of speech from a sentence.
static String[] getVerbs(String in)
          Gets all of the verbs out of the specified string.
static String[] getVerbs(String in, Dictionary dict)
          Gets all of the verbs out of the specified string.
static boolean isCapitalized(String s)
          Returns true if the first letter of the string is a capital letter.
static boolean isEnglishWord(String s, Dictionary d)
          Returns true if the word can be found in the dictionary.
static boolean isFloat(String s)
          Determines whether the the characters in a string are a valid floating point number.
static boolean isInteger(String s)
          Tells you if a string contains a number (floats alowed)
static String lastFewWords(String sentence, int num)
          Returns the lat few words of a sentence.
static String literal(String s)
          Puts quotes around the string.
static float match(String sentence, String query)
          Uses the Lucene text search to match a Lucene query to a sentence.
static String sentenceMakePretty(String sentence)
          Capitalizes the first word in the string, uncapitalizes the rest of the words, strips all tabs and newlines and superfluous spaces, and adds a period at the end.
static int similarity(String word1, String word2)
          Counts the number of synonyms that the words share
static int similarity(String word1, String word2, Thesaurus thes)
          Counts the number of synonyms that the words share
static String stem(String in)
          Returns the stem of the provided word using the PorterStemmer
static String stripHtml(String s)
          Strips the HTML out of a string
static String stripNonWords(String in)
          Strips all words that don't match [A-Za-z0-9,\\.'\"’\\-]+
static String stripStopwords(String in)
          Strips stopwords from the provided sentence.
static String wordWrap(String str, int n)
          Inserts newlines after n characters, or at the last word before that.
static String[] wordWrap(String str, int n, String[] lines)
          Breaks up the string into lines n characters long
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordUtil

public WordUtil()
Method Detail

lastFewWords

public static String lastFewWords(String sentence,
                                  int num)
Returns the lat few words of a sentence.

Parameters:
sentence - The sentence to grab from.
num - The number of words to get from the end of the sentence.
Returns:
A string containing the last few words of the sentence.

similarity

public static int similarity(String word1,
                             String word2)
Counts the number of synonyms that the words share

Parameters:
word1 - The first word to look up and compare
word2 - The second word to look up and compare
Returns:
The number of synonyms that the words share

similarity

public static int similarity(String word1,
                             String word2,
                             Thesaurus thes)
Counts the number of synonyms that the words share

Parameters:
word1 - The first word to look up and compare
word2 - The second word to look up and compare
thes - The thesaurus to use to do the comparison
Returns:
The number of synonyms that the words share

capitalize

public static String capitalize(String word)
Capitalizes the first letter of the string

Parameters:
word - Capitalizes the first letter of the provided string.
Returns:
The capitalized string

isCapitalized

public static boolean isCapitalized(String s)
Returns true if the first letter of the string is a capital letter.

Parameters:
s - The string to check
Returns:
True if the first letter in the string is a capital, false otherwise.

commonStrings

public static int commonStrings(String[] one,
                                String[] two)
Calculates the number of strings that two arrays share, ignoring case.

Parameters:
one - The first array of strings
two - The second array of strings to check
Returns:
The number of strings common to both arrays

stripHtml

public static String stripHtml(String s)
Strips the HTML out of a string

Parameters:
s - The HTML-filled string
Returns:
the HTML-free string

getSynonyms

public static String[] getSynonyms(String word,
                                   int maxNum)
Gets the synonyms for the provided word.

Parameters:
word - The word to look up
maxNum - The maximum number of synonyms to get
Returns:
An array of Strings that are synonyms for the provided word

isInteger

public static boolean isInteger(String s)
Tells you if a string contains a number (floats alowed)

Parameters:
s - The String to test
Returns:
True if the string is a valid number, false otherwise.

isFloat

public static boolean isFloat(String s)
Determines whether the the characters in a string are a valid floating point number.

Parameters:
s - The string to test
Returns:
True if the string represents a valid floating point number, false otherwise.

convertHTMLEntities

public static String convertHTMLEntities(String input)
Converts the HTML entities in a string to the ASCII characters.

Parameters:
input - The HTML Entity-filled String
Returns:
The ASCII-filled String

getSynonyms

public static String[] getSynonyms(String word,
                                   int maxNum,
                                   Thesaurus thes)
Gets the synonyms for the provided word.

Parameters:
thes - The thesaurus to use to look up the word
word - The word to look up
maxNum - The maximum number of synonyms to get
Returns:
An array of Strings that are synonyms for the provided word

stripStopwords

public static String stripStopwords(String in)
Strips stopwords from the provided sentence. Stopwords are common words such as a, an, the.

Parameters:
in - The stopword-filled sentence.
Returns:
The stopword-free string
See Also:
Stopwords

literal

public static String literal(String s)
Puts quotes around the string.

Parameters:
s - The unquoted String
Returns:
The quoted String

stripNonWords

public static String stripNonWords(String in)
Strips all words that don't match [A-Za-z0-9,\\.'\"’\\-]+

Parameters:
in - The non-word-filled String
Returns:
the non-word free String

stem

public static String stem(String in)
Returns the stem of the provided word using the PorterStemmer

Parameters:
in - An English word
Returns:
The stem of that word

sentenceMakePretty

public static String sentenceMakePretty(String sentence)
Capitalizes the first word in the string, uncapitalizes the rest of the words, strips all tabs and newlines and superfluous spaces, and adds a period at the end.

Parameters:
sentence - the un-pretty sentence.
Returns:
The pretty sentence.

getAdverbs

public static String[] getAdverbs(String in)
Gets all of the adverbs in a given sentence.

Parameters:
in - The full sentence
Returns:
An array of adverbs found in the sentence.

getAdverbs

public static String[] getAdverbs(String in,
                                  Dictionary dict)
Gets all of the adverbs in a given sentence.

Parameters:
dict - The dictionary to use to determine the part of speech
in - The full sentence
Returns:
An array of adverbs found in the sentence.

getPronouns

public static String[] getPronouns(String in)
Gets all of the pronouns in a given sentence.

Parameters:
in - The full sentence
Returns:
An array of pronouns found in the sentence.

getPronouns

public static String[] getPronouns(String in,
                                   Dictionary dict)
Gets all of the pronouns in a given sentence.

Parameters:
dict - The dictionary to use to determine the part of speech
in - The full sentence
Returns:
An array of pronouns found in the sentence.

getAdjectives

public static String[] getAdjectives(String in)
Gets all of the adjectives in a given sentence.

Parameters:
in - The full sentence
Returns:
An array of adjectives found in the sentence.

getAdjectives

public static String[] getAdjectives(String in,
                                     Dictionary dict)
Gets all of the adjectives in a given sentence.

Parameters:
dict - The dictionary to use to determine the part of speech
in - The full sentence
Returns:
An array of adjectives found in the sentence.

getNouns

public static String[] getNouns(String in)
Gets all of the nouns out of the specified string. Does not get proper nouns.

Parameters:
in - The string to get the nouns from.
Returns:
An array of strings - each of which is a noun.

getNouns

public static String[] getNouns(String in,
                                Dictionary dict)
Gets all of the nouns out of the specified string. Does not get proper nouns.

Parameters:
in - The string from which to get the nouns.
dict - The dictionary to use to do the checking.
Returns:
An array of Strings that are nouns.

getVerbs

public static String[] getVerbs(String in)
Gets all of the verbs out of the specified string.

Parameters:
in - The string to get the verbs from.
Returns:
An array of strings - each of which is a verb.

getVerbs

public static String[] getVerbs(String in,
                                Dictionary dict)
Gets all of the verbs out of the specified string.

Parameters:
in - The string from which to get the verbs.
dict - The dictionary to use to do the checking.
Returns:
An array of Strings that are verbs.

getTheseWords

public static String[] getTheseWords(String in,
                                     String pos,
                                     Dictionary dict)
Gets all words of a particular part of speech from a sentence. (prep, adv, n, a, v, pron, conj)

Parameters:
in - The word for which to find the part of speech
pos - The part of speech to look for.
dict - The dictionary to use to find the part of speech.
Returns:
An array of words.

calculateEnglishness

public static float calculateEnglishness(String s)
Calculates the percentage of words in the sentence that can be found in the dictionary.

Parameters:
s - The sentence to analyze
Returns:
The percentage of words in the sentence in the sentence that are english. Between 0 and 1.

calculateEnglishness

public static float calculateEnglishness(String s,
                                         Dictionary d)
Calculates the percentage of words in the sentence that can be found in the dictionary.

Parameters:
d - The dictionary to use to look up the word
s - The sentence to analyze
Returns:
The percentage of words in the sentence in the sentence that are english. Between 0 and 1.

isEnglishWord

public static boolean isEnglishWord(String s,
                                    Dictionary d)
Returns true if the word can be found in the dictionary.

Parameters:
s - The word to test
d - The dictionary to use to look up the word
Returns:
True if the word can be foundin the dictionary, false otherwise.

wordWrap

public static String wordWrap(String str,
                              int n)
Inserts newlines after n characters, or at the last word before that.

Parameters:
str - The string to wrap
n - The number of chars to wrap at.
Returns:
The wrapped string.

wordWrap

public static String[] wordWrap(String str,
                                int n,
                                String[] lines)
Breaks up the string into lines n characters long

Parameters:
str - The string to wrap
n - The number of chars to wrap at.
A - String array in which to put the broken lines
Returns:
The wrapped string.

match

public static float match(String sentence,
                          String query)
                   throws ParseException
Uses the Lucene text search to match a Lucene query to a sentence.

Parameters:
sentence - The sentence to search
query - the Lucene query
Returns:
The closeness of fit. Between 0 and 1.
Throws:
org.apache.lucene.queryParser.ParseException - If the Lucene query is not valid
ParseException