class FeedNormalizer::HtmlCleaner
Various methods for cleaning up HTML and preparing it for safe public consumption.
Documents used for refrence:
Constants
- DODGY_URI_SCHEMES
- HTML_ATTRS
allowed attributes.
- HTML_ELEMENTS
allowed html elements.
- HTML_URI_ATTRS
allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.
Public Class Methods
Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. { will NOT become {
This method could be improved by adding a whitelist of html entities.
# File lib/html-cleaner.rb, line 152 def add_entities(str) str.to_s.gsub(/\"/n, '"').gsub(/>/n, '>').gsub(/</n, '<').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/nmi, '&') end
Does this:
-
Unescape HTML
-
Parse HTML into tree
-
Find 'body' if present, and extract tree inside that tag, otherwise parse whole tree
-
Each tag:
-
remove tag if not whitelisted
-
escape HTML tag contents
-
remove all attributes not on whitelist
-
extra-scrub URI attrs; see dodgy_uri?
-
Extra (i.e. unmatched) ending tags and comments are removed.
# File lib/html-cleaner.rb, line 60 def clean(str) str = unescapeHTML(str) doc = Hpricot(str, :fixup_tags => true) doc = subtree(doc, :body) # get all the tags in the document # Somewhere near hpricot 0.4.92 "*" starting to return all elements, # including text nodes instead of just tagged elements. tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq # Remove tags that aren't whitelisted. remove_tags!(doc, tags - HTML_ELEMENTS) remaining_tags = tags & HTML_ELEMENTS # Remove attributes that aren't on the whitelist, or are suspicious URLs. (doc/remaining_tags.join(",")).each do |element| next if element.raw_attributes.nil? || element.raw_attributes.empty? element.raw_attributes.reject! do |attr,val| !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val)) end element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]} end unless remaining_tags.empty? doc.traverse_text do |t| t.swap(add_entities(t.to_html)) end # Return the tree, without comments. Ugly way of removing comments, # but can't see a way to do this in Hpricot yet. doc.to_s.gsub(/<\!--.*?-->/mi, '') end
Returns true if the given string contains a suspicious URL, i.e. a javascript link.
This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.
# File lib/html-cleaner.rb, line 117 def dodgy_uri?(uri) uri = uri.to_s # special case for poorly-formed entities (missing ';') # if these occur *anywhere* within the string, then throw it out. return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/mi) # Try escaping as both HTML or URI encodings, and then trying # each scheme regexp on each [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri| DODGY_URI_SCHEMES.each do |scheme| regexp = "#{scheme}:".gsub(/./) do |char| "([\000-\037\177\s]*)#{char}" end # regexp looks something like # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi return true if (unesc_uri =~ %r{\A#{regexp}}mi) end end nil end
For all other feed elements:
-
Unescape HTML.
-
Parse HTML into tree (taking 'body' as root, if present)
-
Takes text out of each tag, and escapes HTML.
-
Returns all text concatenated.
# File lib/html-cleaner.rb, line 99 def flatten(str) str.gsub!("\n", " ") str = unescapeHTML(str) doc = Hpricot(str, :xhtml_strict => true) doc = subtree(doc, :body) out = [] doc.traverse_text {|t| out << add_entities(t.to_html)} return out.join end
unescapes HTML. If xml is true, also converts XML-only named entities to HTML.
# File lib/html-cleaner.rb, line 143 def unescapeHTML(str, xml = true) CGI.unescapeHTML(xml ? str.gsub("'", "'") : str) end
Private Class Methods
Everything below elment, or the just return the doc if element not present.
# File lib/html-cleaner.rb, line 159 def subtree(doc, element) doc.at("//#{element}/*") || doc end