public class URLClassifyProcessor extends UpdateRequestProcessor
Update processor which examines a URL and outputs to various other fields characteristics of that URL, including length, number of path levels, whether it is a top level URL (levels==0), whether it looks like a landing/index page, a canonical representation of the URL (e.g. stripping index.html), the domain and path parts of the URL etc.
This processor is intended used in connection with processing web resources, and helping to produce values which may be used for boosting or filtering later.
In the example configuration below, we construct a custom
updateRequestProcessorChain and then instruct the
/update requesthandler to use it for every incoming document.
<updateRequestProcessorChain name="urlProcessor">
<processor class="org.apache.solr.update.processor.URLClassifyProcessorFactory">
<bool name="enabled">true</bool>
<str name="inputField">id</str>
<str name="domainOutputField">hostname</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">urlProcessor</str>
</lst>
</requestHandler>
Then, at index time, Solr will look at the id field value and extract
it's domain portion into a new hostname field. By default, the
following fields will also be added:
For example, adding the following document
{ "id":"http://wwww.mydomain.com/subpath/document.html" }
will result in this document in Solr:
{
"id":"http://wwww.mydomain.com/subpath/document.html",
"url_length":46,
"url_levels":2,
"url_toplevel":0,
"url_landingpage":0,
"hostname":"wwww.mydomain.com",
"_version_":1603193062117343232}]
}
next| Constructor and Description |
|---|
URLClassifyProcessor(SolrParams parameters,
SolrQueryRequest request,
SolrQueryResponse response,
UpdateRequestProcessor nextProcessor) |
| Modifier and Type | Method and Description |
|---|---|
URL |
getCanonicalUrl(URL url)
Gets a canonical form of the URL for use as main URL
|
URL |
getNormalizedURL(String url) |
boolean |
isEnabled() |
boolean |
isLandingPage(URL url)
Calculates whether the URL is a landing page or not
|
boolean |
isTopLevelPage(URL url)
Calculates whether a URL is a top level page
|
int |
length(URL url)
Calculates the length of the URL in characters
|
int |
levels(URL url)
Calculates the number of path levels in the given URL
|
void |
processAdd(AddUpdateCommand command) |
void |
setEnabled(boolean enabled) |
close, doClose, finish, processCommit, processDelete, processMergeIndexes, processRollbackpublic URLClassifyProcessor(SolrParams parameters, SolrQueryRequest request, SolrQueryResponse response, UpdateRequestProcessor nextProcessor)
public void processAdd(AddUpdateCommand command) throws IOException
processAdd in class UpdateRequestProcessorIOExceptionpublic URL getCanonicalUrl(URL url)
url - The input urlpublic int length(URL url)
url - The input URLpublic int levels(URL url)
url - The input URLpublic boolean isTopLevelPage(URL url)
url - The input URLpublic boolean isLandingPage(URL url)
url - The input URLpublic URL getNormalizedURL(String url) throws MalformedURLException, URISyntaxException
public boolean isEnabled()
public void setEnabled(boolean enabled)
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.