Class SimplePostTool


  • public class SimplePostTool
    extends Object
    A simple utility class for posting raw updates to a Solr server, has a main method so it can be run on the command line. View this not as a best-practice code example, but as a standalone example built with an explicit purpose of not having external jar dependencies.
    • Constructor Detail

      • SimplePostTool

        public SimplePostTool​(String mode,
                              URL url,
                              boolean auto,
                              String type,
                              String format,
                              int recursive,
                              int delay,
                              String fileTypes,
                              OutputStream out,
                              boolean commit,
                              boolean optimize,
                              String[] args)
        Constructor which takes in all mandatory input for the tool to work. Also see usage() for further explanation of the params.
        Parameters:
        mode - whether to post files, web pages, params or stdin
        url - the Solr base Url to post to, should end with /update
        auto - if true, we'll guess type and add resourcename/url
        type - content-type of the data you are posting
        recursive - number of levels for file/web mode, or 0 if one file only
        delay - if recursive then delay will be the wait time between posts
        fileTypes - a comma separated list of file-name endings to accept for file/web
        out - an OutputStream to write output to, e.g. stdout to print to console
        commit - if true, will commit at end of posting
        optimize - if true, will optimize at end of posting
        args - a String[] of arguments, varies between modes
      • SimplePostTool

        public SimplePostTool()
    • Method Detail

      • main

        public static void main​(String[] args)
        See usage() for valid command line usage
        Parameters:
        args - the params on the command line
      • execute

        public void execute()
        After initialization, call execute to start the post job. This method delegates to the correct mode method.
      • parseArgsAndInit

        protected static SimplePostTool parseArgsAndInit​(String[] args)
        Parses incoming arguments and system params and initializes the tool
        Parameters:
        args - the incoming cmd line args
        Returns:
        an instance of SimplePostTool
      • postFiles

        public int postFiles​(String[] args,
                             int startIndexInArgs,
                             OutputStream out,
                             String type)
        Post all filenames provided in args
        Parameters:
        args - array of file names
        startIndexInArgs - offset to start
        out - output stream to post data to
        type - default content-type to use when posting (may be overridden in auto mode)
        Returns:
        number of files posted
      • postFiles

        public int postFiles​(File[] files,
                             int startIndexInArgs,
                             OutputStream out,
                             String type)
        Post all filenames provided in args
        Parameters:
        files - array of Files
        startIndexInArgs - offset to start
        out - output stream to post data to
        type - default content-type to use when posting (may be overridden in auto mode)
        Returns:
        number of files posted
      • postWebPages

        public int postWebPages​(String[] args,
                                int startIndexInArgs,
                                OutputStream out)
        This method takes as input a list of start URL strings for crawling, converts the URL strings to URI strings and adds each one to a backlog and then starts crawling
        Parameters:
        args - the raw input args from main()
        startIndexInArgs - offset for where to start
        out - outputStream to write results to
        Returns:
        the number of web pages posted
      • normalizeUrlEnding

        protected static String normalizeUrlEnding​(String link)
        Normalizes a URL string by removing anchor part and trailing slash
        Returns:
        the normalized URL string
      • webCrawl

        protected int webCrawl​(int level,
                               OutputStream out)
        A very simple crawler, pulling URLs to fetch from a backlog and then recurses N levels deep if recursive>0. Links are parsed from HTML through first getting an XHTML version using SolrCell with extractOnly, and followed if they are local. The crawler pauses for a default delay of 10 seconds between each fetch, this can be configured in the delay variable. This is only meant for test purposes, as it does not respect robots or anything else fancy :)
        Parameters:
        level - which level to crawl
        out - output stream to write to
        Returns:
        number of pages crawled on this level and below
      • inputStreamToByteArray

        public static ByteBuffer inputStreamToByteArray​(InputStream is,
                                                        long maxSize)
                                                 throws IOException
        Reads an input stream into a byte array
        Parameters:
        is - the input stream
        Returns:
        the byte array
        Throws:
        IOException - If there is a low-level I/O error.
      • computeFullUrl

        protected String computeFullUrl​(URL baseUrl,
                                        String link)
        Computes the full URL based on a base url and a possibly relative link found in the href param of an HTML anchor.
        Parameters:
        baseUrl - the base url from where the link was found
        link - the absolute or relative link
        Returns:
        the string version of the full URL
      • typeSupported

        protected boolean typeSupported​(String type)
        Uses the mime-type map to reverse lookup whether the file ending for our type is supported by the fileTypes option
        Parameters:
        type - what content-type to lookup
        Returns:
        true if this is a supported content type
      • isOn

        protected static boolean isOn​(String property)
        Tests if a string is either "true", "on", "yes" or "1"
        Parameters:
        property - the string to test
        Returns:
        true if "on"
      • commit

        public void commit()
        Does a simple commit operation
      • optimize

        public void optimize()
        Does a simple optimize operation
      • appendParam

        public static String appendParam​(String url,
                                         String param)
        Appends a URL query parameter to a URL
        Parameters:
        url - the original URL
        param - the parameter(s) to append, separated by "&"
        Returns:
        the string version of the resulting URL
      • postFile

        public void postFile​(File file,
                             OutputStream output,
                             String type)
        Opens the file and posts its contents to the solrUrl, writes to response to output.
      • guessType

        protected static String guessType​(File file)
        Guesses the type of file, based on file name suffix Returns "application/octet-stream" if no corresponding mimeMap type.
        Parameters:
        file - the file
        Returns:
        the content-type guessed
      • doGet

        public void doGet​(String url)
        Performs a simple get on the given URL
      • doGet

        public void doGet​(URL url)
        Performs a simple get on the given URL
      • postData

        public boolean postData​(InputStream data,
                                Long length,
                                OutputStream output,
                                String type,
                                URL url)
        Reads data from the data stream and posts it to solr, writes to the response to output
        Returns:
        true if success
      • stringToStream

        public static InputStream stringToStream​(String s)
        Converts a string to an input stream
        Parameters:
        s - the string
        Returns:
        the input stream
      • getFileFilterFromFileTypes

        public FileFilter getFileFilterFromFileTypes​(String fileTypes)
      • getXP

        public static String getXP​(Node n,
                                   String xpath,
                                   boolean concatAll)
                            throws XPathExpressionException
        Gets the string content of the matching an XPath
        Parameters:
        n - the node (or doc)
        xpath - the xpath string
        concatAll - if true, text from all matching nodes will be concatenated, else only the first returned
        Throws:
        XPathExpressionException