class ClassifierReborn::Bayes
Constants
- CategoryNotFoundError
Public Class Methods
The class can be created with one or more categories, each of which will be initialized and given a training method. E.g.,
b = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', 'Spam'
Options available are:
language: 'en' Used to select language specific stop words auto_categorize: false When true, enables ability to dynamically declare a category enable_threshold: false When true, enables a threshold requirement for classifition threshold: 0.0 Default threshold, only used when enabled
# File lib/classifier-reborn/bayes.rb, line 20 def initialize(*args) @categories = Hash.new options = { language: 'en', auto_categorize: false, enable_threshold: false, threshold: 0.0 } args.flatten.each { |arg| if arg.kind_of?(Hash) options.merge!(arg) else add_category(arg) end } @total_words = 0 @category_counts = Hash.new(0) @category_word_count = Hash.new(0) @language = options[:language] @auto_categorize = options[:auto_categorize] @enable_threshold = options[:enable_threshold] @threshold = options[:threshold] end
Public Instance Methods
Allows you to add categories to the classifier. For example:
b.add_category "Not spam"
WARNING: Adding categories to a trained classifier will result in an undertrained category that will tend to match more criteria than the trained selective categories. In short, try to initialize your categories at initialization.
# File lib/classifier-reborn/bayes.rb, line 202 def add_category(category) @categories[CategoryNamer.prepare_name(category)] ||= Hash.new(0) end
Returns the scores in each category the provided text
. E.g.,
b.classifications "I hate bad words and you" => {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
The largest of these scores (the one closest to 0) is the one picked out by classify
# File lib/classifier-reborn/bayes.rb, line 102 def classifications(text) score = Hash.new word_hash = Hasher.word_hash(text, @language) training_count = @category_counts.values.reduce(:+).to_f @categories.each do |category, category_words| score[category.to_s] = 0 total = (@category_word_count[category] || 1).to_f word_hash.each do |word, count| s = category_words.has_key?(word) ? category_words[word] : 0.1 score[category.to_s] += Math.log(s/total) end # now add prior probability for the category s = @category_counts.has_key?(category) ? @category_counts[category] : 0.1 score[category.to_s] += Math.log(s / training_count) end return score end
Return the classification without the score
# File lib/classifier-reborn/bayes.rb, line 129 def classify(text) result, score = classify_with_score(text) if threshold_enabled? result = nil if score < @threshold || score == Float::INFINITY end return result end
Returns the classification of the provided text
, which is one
of the categories given in the initializer along with the score. E.g.,
b.classify "I hate bad words and you" => ['Uninteresting', -4.852030263919617]
# File lib/classifier-reborn/bayes.rb, line 124 def classify_with_score(text) (classifications(text).sort_by { |a| -a[1] })[0] end
Dynamically disable threshold for classify results
# File lib/classifier-reborn/bayes.rb, line 153 def disable_threshold @enable_threshold = false end
Dynamically enable threshold for classify results
# File lib/classifier-reborn/bayes.rb, line 148 def enable_threshold @enable_threshold = true end
Provides training and untraining methods for the categories specified in Bayes#new For example:
b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other' b.train_this "This text" b.train_that "That text" b.untrain_that "That text" b.train_the_other "The other text"
# File lib/classifier-reborn/bayes.rb, line 174 def method_missing(name, *args) cleaned_name = name.to_s.gsub(/(un)?train_([\w]+)/, '\2') category = CategoryNamer.prepare_name(cleaned_name) if @categories.has_key? category args.each { |text| eval("#{$1}train(category, text)") } elsif name.to_s =~ /(un)?train_([\w]+)/ raise StandardError, "No such category: #{category}" else super #raise StandardError, "No such method: #{name}" end end
Retrieve the current threshold value
# File lib/classifier-reborn/bayes.rb, line 138 def threshold @threshold end
Dynamically set the threshold value
# File lib/classifier-reborn/bayes.rb, line 143 def threshold=(a_float) @threshold = a_float end
is threshold processing disabled?
# File lib/classifier-reborn/bayes.rb, line 163 def threshold_disabled? !@enable_threshold end
Is threshold processing enabled?
# File lib/classifier-reborn/bayes.rb, line 158 def threshold_enabled? @enable_threshold end
Provides a general training method for all categories specified in Bayes#new For example:
b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other' b.train :this, "This text" b.train "that", "That text" b.train "The other", "The other text"
# File lib/classifier-reborn/bayes.rb, line 51 def train(category, text) category = CategoryNamer.prepare_name(category) # Add the category dynamically or raise an error if !@categories.has_key?(category) if @auto_categorize add_category(category) else raise CategoryNotFoundError.new("Cannot train; category #{category} does not exist") end end @category_counts[category] += 1 Hasher.word_hash(text, @language).each do |word, count| @categories[category][word] += count @category_word_count[category] += count @total_words += count end end
Provides a untraining method for all categories specified in Bayes#new Be very careful with this method.
For example:
b = ClassifierReborn::Bayes.new 'This', 'That', 'the_other' b.train :this, "This text" b.untrain :this, "This text"
# File lib/classifier-reborn/bayes.rb, line 78 def untrain(category, text) category = CategoryNamer.prepare_name(category) @category_counts[category] -= 1 Hasher.word_hash(text, @language).each do |word, count| if @total_words >= 0 orig = @categories[category][word] || 0 @categories[category][word] -= count if @categories[category][word] <= 0 @categories[category].delete(word) count = orig end if @category_word_count[category] >= count @category_word_count[category] -= count end @total_words -= count end end end