Package org.apache.solr.update.processor
Class URLClassifyProcessor
- java.lang.Object
-
- org.apache.solr.update.processor.UpdateRequestProcessor
-
- org.apache.solr.update.processor.URLClassifyProcessor
-
- All Implemented Interfaces:
Closeable,AutoCloseable
public class URLClassifyProcessor extends UpdateRequestProcessor
Update processor which examines a URL and outputs to various other fields characteristics of that URL, including length, number of path levels, whether it is a top level URL (levels==0), whether it looks like a landing/index page, a canonical representation of the URL (e.g. stripping index.html), the domain and path parts of the URL etc.This processor is intended used in connection with processing web resources, and helping to produce values which may be used for boosting or filtering later.
In the example configuration below, we construct a custom
updateRequestProcessorChainand then instruct the/updaterequesthandler to use it for every incoming document.<updateRequestProcessorChain name="urlProcessor"> <processor class="org.apache.solr.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">id</str> <str name="domainOutputField">hostname</str> </processor> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <requestHandler name="/update" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="update.chain">urlProcessor</str> </lst> </requestHandler>Then, at index time, Solr will look at the
idfield value and extract it's domain portion into a newhostnamefield. By default, the following fields will also be added:- url_length
- url_levels
- url_toplevel
- url_landingpage
For example, adding the following document
{ "id":"http://wwww.mydomain.com/subpath/document.html" }will result in this document in Solr:
{ "id":"http://wwww.mydomain.com/subpath/document.html", "url_length":46, "url_levels":2, "url_toplevel":0, "url_landingpage":0, "hostname":"wwww.mydomain.com", "_version_":1603193062117343232}] }
-
-
Field Summary
-
Fields inherited from class org.apache.solr.update.processor.UpdateRequestProcessor
next
-
-
Constructor Summary
Constructors Constructor Description URLClassifyProcessor(org.apache.solr.common.params.SolrParams parameters, SolrQueryRequest request, SolrQueryResponse response, UpdateRequestProcessor nextProcessor)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description URLgetCanonicalUrl(URL url)Gets a canonical form of the URL for use as main URLURLgetNormalizedURL(String url)booleanisEnabled()booleanisLandingPage(URL url)Calculates whether the URL is a landing page or notbooleanisTopLevelPage(URL url)Calculates whether a URL is a top level pageintlength(URL url)Calculates the length of the URL in charactersintlevels(URL url)Calculates the number of path levels in the given URLvoidprocessAdd(AddUpdateCommand command)voidsetEnabled(boolean enabled)-
Methods inherited from class org.apache.solr.update.processor.UpdateRequestProcessor
close, doClose, finish, processCommit, processDelete, processMergeIndexes, processRollback
-
-
-
-
Constructor Detail
-
URLClassifyProcessor
public URLClassifyProcessor(org.apache.solr.common.params.SolrParams parameters, SolrQueryRequest request, SolrQueryResponse response, UpdateRequestProcessor nextProcessor)
-
-
Method Detail
-
processAdd
public void processAdd(AddUpdateCommand command) throws IOException
- Overrides:
processAddin classUpdateRequestProcessor- Throws:
IOException
-
getCanonicalUrl
public URL getCanonicalUrl(URL url) throws MalformedURLException
Gets a canonical form of the URL for use as main URL- Parameters:
url- The input url- Returns:
- The URL object representing the canonical URL
- Throws:
MalformedURLException
-
length
public int length(URL url)
Calculates the length of the URL in characters- Parameters:
url- The input URL- Returns:
- the length of the URL
-
levels
public int levels(URL url)
Calculates the number of path levels in the given URL- Parameters:
url- The input URL- Returns:
- the number of levels, where a top-level URL is 0
-
isTopLevelPage
public boolean isTopLevelPage(URL url)
Calculates whether a URL is a top level page- Parameters:
url- The input URL- Returns:
- true if page is a top level page
-
isLandingPage
public boolean isLandingPage(URL url)
Calculates whether the URL is a landing page or not- Parameters:
url- The input URL- Returns:
- true if URL represents a landing page (index page)
-
getNormalizedURL
public URL getNormalizedURL(String url) throws MalformedURLException, URISyntaxException
-
isEnabled
public boolean isEnabled()
-
setEnabled
public void setEnabled(boolean enabled)
-
-