Class URLClassifyProcessor

java.lang.Object
org.apache.solr.update.processor.UpdateRequestProcessor
org.apache.solr.update.processor.URLClassifyProcessor
All Implemented Interfaces:
Closeable, AutoCloseable

public class URLClassifyProcessor extends UpdateRequestProcessor
Update processor which examines a URL and outputs to various other fields characteristics of that URL, including length, number of path levels, whether it is a top level URL (levels==0), whether it looks like a landing/index page, a canonical representation of the URL (e.g. stripping index.html), the domain and path parts of the URL etc.

This processor is intended used in connection with processing web resources, and helping to produce values which may be used for boosting or filtering later.

In the example configuration below, we construct a custom updateRequestProcessorChain and then instruct the /update requesthandler to use it for every incoming document.

 <updateRequestProcessorChain name="urlProcessor">
   <processor class="org.apache.solr.update.processor.URLClassifyProcessorFactory">
     <bool name="enabled">true</bool>
     <str name="inputField">id</str>
     <str name="domainOutputField">hostname</str>
   </processor>
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

 <requestHandler name="/update" class="solr.UpdateRequestHandler">
 <lst name="defaults">
 <str name="update.chain">urlProcessor</str>
 </lst>
 </requestHandler>
 

Then, at index time, Solr will look at the id field value and extract it's domain portion into a new hostname field. By default, the following fields will also be added:

  • url_length
  • url_levels
  • url_toplevel
  • url_landingpage

For example, adding the following document

 { "id":"http://wwww.mydomain.com/subpath/document.html" }
 

will result in this document in Solr:

 {
  "id":"http://wwww.mydomain.com/subpath/document.html",
  "url_length":46,
  "url_levels":2,
  "url_toplevel":0,
  "url_landingpage":0,
  "hostname":"wwww.mydomain.com",
  "_version_":1603193062117343232}]
 }
 
  • Constructor Details

  • Method Details

    • processAdd

      public void processAdd(AddUpdateCommand command) throws IOException
      Overrides:
      processAdd in class UpdateRequestProcessor
      Throws:
      IOException
    • getCanonicalUrl

      public URL getCanonicalUrl(URL url) throws MalformedURLException
      Gets a canonical form of the URL for use as main URL
      Parameters:
      url - The input url
      Returns:
      The URL object representing the canonical URL
      Throws:
      MalformedURLException
    • length

      public int length(URL url)
      Calculates the length of the URL in characters
      Parameters:
      url - The input URL
      Returns:
      the length of the URL
    • levels

      public int levels(URL url)
      Calculates the number of path levels in the given URL
      Parameters:
      url - The input URL
      Returns:
      the number of levels, where a top-level URL is 0
    • isTopLevelPage

      public boolean isTopLevelPage(URL url)
      Calculates whether a URL is a top level page
      Parameters:
      url - The input URL
      Returns:
      true if page is a top level page
    • isLandingPage

      public boolean isLandingPage(URL url)
      Calculates whether the URL is a landing page or not
      Parameters:
      url - The input URL
      Returns:
      true if URL represents a landing page (index page)
    • getNormalizedURL

      public URL getNormalizedURL(String url) throws MalformedURLException, URISyntaxException
      Throws:
      MalformedURLException
      URISyntaxException
    • isEnabled

      public boolean isEnabled()
    • setEnabled

      public void setEnabled(boolean enabled)