Class ExtractingDocumentLoader.MostlyPassthroughHtmlMapper

  • All Implemented Interfaces:
    org.apache.tika.parser.html.HtmlMapper
    Enclosing class:
    ExtractingDocumentLoader

    public static class ExtractingDocumentLoader.MostlyPassthroughHtmlMapper
    extends Object
    implements org.apache.tika.parser.html.HtmlMapper
    • Field Detail

      • INSTANCE

        public static final org.apache.tika.parser.html.HtmlMapper INSTANCE
    • Constructor Detail

      • MostlyPassthroughHtmlMapper

        public MostlyPassthroughHtmlMapper()
    • Method Detail

      • isDiscardElement

        public boolean isDiscardElement​(String name)
        Keep all elements and their content.

        Apparently <SCRIPT> and <STYLE> elements are blocked elsewhere

        Specified by:
        isDiscardElement in interface org.apache.tika.parser.html.HtmlMapper
      • mapSafeAttribute

        public String mapSafeAttribute​(String elementName,
                                       String attributeName)
        Lowercases the attribute name
        Specified by:
        mapSafeAttribute in interface org.apache.tika.parser.html.HtmlMapper
      • mapSafeElement

        public String mapSafeElement​(String name)
        Lowercases the element name, but returns null for <BR>, which suppresses the start-element event for lt;BR> tags. This also suppresses the <BODY> tags because those are handled internally by Tika's XHTMLContentHandler.
        Specified by:
        mapSafeElement in interface org.apache.tika.parser.html.HtmlMapper