In Files

Methods

Class/Module Index [+]

Quicksearch

Ferret::Analysis::MappingFilter

Summary

A MappingFilter maps strings in tokens. This is usually used to map UTF-8 characters to ASCII characters for easier searching and better search recall. The mapping is compiled into a Deterministic Finite Automata so it is super fast. This Filter can therefor be used for indexing very large datasets. Currently regular expressions are not supported. If you are really interested in the feature, please contact me at dbalmain@gmail.com.

Example

mapping = {
  ['à','á','â','ã','ä','å','ā','ă']                 => 'a',
  'æ'                                       => 'ae',
  ['ď','đ']                                   => 'd',
  ['ç','ć','č','ĉ','ċ']                          => 'c',
  ['è','é','ê','ë','ē','ę','ě','ĕ','ė',]             => 'e',
  ['ƒ']                                      => 'f',
  ['ĝ','ğ','ġ','ģ']                             => 'g',
  ['ĥ','ħ']                                   => 'h',
  ['ì','ì','í','î','ï','ī','ĩ','ĭ']                 => 'i',
  ['į','ı','ij','ĵ']                             => 'j',
  ['ķ','ĸ']                                   => 'k',
  ['ł','ľ','ĺ','ļ','ŀ']                          => 'l',
  ['ñ','ń','ň','ņ','ʼn','ŋ']                       => 'n',
  ['ò','ó','ô','õ','ö','ø','ō','ő','ŏ','ŏ']           => 'o',
  ['œ']                                      => 'oek',
  ['ą']                                      => 'q',
  ['ŕ','ř','ŗ']                                => 'r',
  ['ś','š','ş','ŝ','ș']                          => 's',
  ['ť','ţ','ŧ','ț']                             => 't',
  ['ù','ú','û','ü','ū','ů','ű','ŭ','ũ','ų']           => 'u',
  ['ŵ']                                      => 'w',
  ['ý','ÿ','ŷ']                                => 'y',
  ['ž','ż','ź']                                => 'z'
}
filt = MappingFilter.new(token_stream, mapping)

Public Class Methods

new(token_stream, mapping) → token_stream click to toggle source

Create an MappingFilter which maps strings in tokens. This is usually used to map UTF-8 characters to ASCII characters for easier searching and better search recall. The mapping is compiled into a Deterministic Finite Automata so it is super fast. This Filter can therefor be used for indexing very large datasets. Currently regular expressions are not supported. If you are really interested in the feature, please contact me at dbalmain@gmail.com.

token_stream

TokenStream to be filtered

mapping

Hash of mappings to apply to tokens. The key can be a String or an Array of Strings. The value must be a String

Example

filt = MappingFilter.new(token_stream,
                         {
                           ['à','á','â','ã','ä','å']       => 'a',
                           ['è','é','ê','ë','ē','ę']       => 'e'
                         })
static VALUE
frb_mapping_filter_init(VALUE self, VALUE rsub_ts, VALUE mapping) 
{
    TokenStream *ts;
    ts = frb_get_cwrapped_rts(rsub_ts);
    ts = mapping_filter_new(ts);
    rb_hash_foreach(mapping, frb_add_mappings_i, (VALUE)ts);
    mulmap_compile(((MappingFilter *)ts)->mapper);
    object_add(&(TkFilt(ts)->sub_ts), rsub_ts);

    Frt_Wrap_Struct(self, &frb_tf_mark, &frb_tf_free, ts);
    object_add(ts, self);
    return self;
}

[Validate]

Generated with the Darkfish Rdoc Generator 2.