Package org.apache.solr.update.processor
Class URLClassifyProcessor
java.lang.Object
org.apache.solr.update.processor.UpdateRequestProcessor
org.apache.solr.update.processor.URLClassifyProcessor
- All Implemented Interfaces:
Closeable,AutoCloseable
Update processor which examines a URL and outputs to various other fields characteristics of that
URL, including length, number of path levels, whether it is a top level URL (levels==0), whether
it looks like a landing/index page, a canonical representation of the URL (e.g. stripping
index.html), the domain and path parts of the URL etc.
This processor is intended used in connection with processing web resources, and helping to produce values which may be used for boosting or filtering later.
In the example configuration below, we construct a custom updateRequestProcessorChain
and then instruct the /update requesthandler to use it for every incoming
document.
<updateRequestProcessorChain name="urlProcessor">
<processor class="org.apache.solr.update.processor.URLClassifyProcessorFactory">
<bool name="enabled">true</bool>
<str name="inputField">id</str>
<str name="domainOutputField">hostname</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">urlProcessor</str>
</lst>
</requestHandler>
Then, at index time, Solr will look at the id field value and extract it's domain
portion into a new hostname field. By default, the following fields will also be
added:
- url_length
- url_levels
- url_toplevel
- url_landingpage
For example, adding the following document
{ "id":"http://wwww.mydomain.com/subpath/document.html" }
will result in this document in Solr:
{
"id":"http://wwww.mydomain.com/subpath/document.html",
"url_length":46,
"url_levels":2,
"url_toplevel":0,
"url_landingpage":0,
"hostname":"wwww.mydomain.com",
"_version_":1603193062117343232}]
}
-
Field Summary
Fields inherited from class org.apache.solr.update.processor.UpdateRequestProcessor
next -
Constructor Summary
ConstructorsConstructorDescriptionURLClassifyProcessor(org.apache.solr.common.params.SolrParams parameters, SolrQueryRequest request, SolrQueryResponse response, UpdateRequestProcessor nextProcessor) -
Method Summary
Modifier and TypeMethodDescriptiongetCanonicalUrl(URL url) Gets a canonical form of the URL for use as main URLgetNormalizedURL(String url) booleanbooleanisLandingPage(URL url) Calculates whether the URL is a landing page or notbooleanisTopLevelPage(URL url) Calculates whether a URL is a top level pageintCalculates the length of the URL in charactersintCalculates the number of path levels in the given URLvoidprocessAdd(AddUpdateCommand command) voidsetEnabled(boolean enabled) Methods inherited from class org.apache.solr.update.processor.UpdateRequestProcessor
close, doClose, finish, processCommit, processDelete, processMergeIndexes, processRollback
-
Constructor Details
-
URLClassifyProcessor
public URLClassifyProcessor(org.apache.solr.common.params.SolrParams parameters, SolrQueryRequest request, SolrQueryResponse response, UpdateRequestProcessor nextProcessor)
-
-
Method Details
-
processAdd
- Overrides:
processAddin classUpdateRequestProcessor- Throws:
IOException
-
getCanonicalUrl
Gets a canonical form of the URL for use as main URL- Parameters:
url- The input url- Returns:
- The URL object representing the canonical URL
- Throws:
MalformedURLException
-
length
Calculates the length of the URL in characters- Parameters:
url- The input URL- Returns:
- the length of the URL
-
levels
Calculates the number of path levels in the given URL- Parameters:
url- The input URL- Returns:
- the number of levels, where a top-level URL is 0
-
isTopLevelPage
Calculates whether a URL is a top level page- Parameters:
url- The input URL- Returns:
- true if page is a top level page
-
isLandingPage
Calculates whether the URL is a landing page or not- Parameters:
url- The input URL- Returns:
- true if URL represents a landing page (index page)
-
getNormalizedURL
-
isEnabled
public boolean isEnabled() -
setEnabled
public void setEnabled(boolean enabled)
-