I must be able to run these two projects on Eclipse. The first project must be similar to the screenshot that I have provided.
Building a Search Engine, Part I: Governance, Workflow, and UI
(This is the first project in this series
You are going to design, build, and test a scaled-down version of “Google Search”. Rather than searching the Internet’s files, you will only search local files added to your search engine’s index. Your search engine will allow an administrator to add, update, and remove files from the index. Users will be able to enter search terms, and select between Boolean AND, OR, or PHRASE search. The matching file names (if any) are then displayed in a list.
You also need to design the system architecture (the high-level design), so you can plan each part.
Search Engine Project Proposal:
Build a search engine with simple GUI, that can do AND, OR, and PHRASE Boolean searches on a small set of text files. The user should be able to say the type of search to do, and enter some search terms. The results should be a list of file pathnames that match the search. This should be a stand-alone application
In addition to the main user interface (for doing searching), you will need a separate administrator or maintenance interface to manage your application. It should be easy to add and remove files (from the set of indexed files), and to regenerate the index anytime. When starting, your application should check if any of the files have been changed or deleted since the application last saved the index. If so, the administrator should be able to have the index updated with the modified file(s).
Note that with HTML, Word, or other types of documents, you would need to extract a plain text version before indexing. That isn’t hard, but the search engine is complex enough already. For these projects, limit your search engine to only plain text files (including .txt, .html, and other text files).
The index must be stored on disk, so next time your application starts it can reload its data. The index, list of files, and other data, can be stored in one or more file(s) or in a database. The saved data should be read whenever your application starts. The saved data should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents), or perhaps just when your application exits. If you use files, the file formats are up to you; have a format that is fast and simple to load and store.
To keep things as simple as possible, in this project you can assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory at once. (That’s probably not the case for Google’s data!) All you need to do is be able to read the index data from disk at startup into memory, and write it back either when updating the index, or when your application shuts down. Note, the names (pathnames) of the added files as well as their last modification time must be stored in addition to the index.
If using XML file, you can define an XML schema for it and have some tool such as Notepad++ validate your file format for you. XML may have other benefits, but it isn’t as simple as using plain text files. JSON might be the easist format for storing and reading the index data. In any case, don’t forget to include the list of file pathnames and other data you decide is needed, along with the index itself.
In this project, we will follow the model-view-controller design pattern for the project organization. This allows one to develop each part mostly independently from the other parts.
Develop Stub User Interfaces:
In this part of the project, you must implement a non-functional (that means looks good but doesn’t do a thing) graphic user interface for the application. (The “view”.) The main (default) user interface must support searching and displaying results. It should have various other features, such as an “About…” menu or button, a way to quit the application (if a stand-alone application; if your group creates a web application, there is no need to quit), and a way to get to the administrator/maintenance view.
The maintenance/administrator view must allow the user to perform various administration operations: view the list of indexed file names, adding files to the index, remove files from the index, and update the index (when files have been modified since they were indexed).
The user interface should be complete, but none of the functionality needs to be implemented at this time. You should implement stub methods for the functionality not yet implemented, and invoke them from your event handlers. The stub methods can either return “canned” (fake but realistic) data, or throw an OperationNotSupported exception. The only button that needs to do anything is the one used to switch to the maintenance view.
Since the user interfaces don’t do anything, there is nothing to test yet. However, you must create a test class with at least one test method (it can just return success if you wish). I suggest you agree to use JUnit 4 style tests for now.
Building a Search Engine, Part II: Persistent Data
Please read the background information and full project description from Search Engine Project, Part I. In this project, you will implement the persistent data (the “model”) part of the project: the saving of data and the loading of data at the next start. The persistent data contains the list of files used in the index, and the index itself.
First discuss which persistence solution you will use: text files, XML or JSON files, or a database (and chose between embedded (my suggestion) or server, and if using a database, chose between the JDBC and JPA database APIs (I suggest JPA). You can make this decision before knowing the details of the data structures used.
Before working on actual code, you need to decide on the data structures to be used for the file list and the inverted index. Try to read the Java collections material before deciding.
It should be easy to add and remove files (from the set of indexed files). When starting, your application should check if any of the files used have been changed or deleted since the application last saved the index. If so, the “admin” user should be able to have the inverted index file(s) updated, from the maintenance interface.
(Note that with HTML or Word documents, you would need to extract a plain text version before indexing.) In this project, all the “indexible” files are plain text. You are free to assume the system-default text file encoding, or assume UTF-8 encoding, for all files.
The inverted index can be stored in one or more file(s), and that should be read whenever your application starts. The file(s) should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents). The file format is up to you, but should have a format that is fast and simple to search. However, to keep things simpler, in this project you can assume that only a small set of documents will be indexed, and thus the whole index can be kept in memory. All you need to do is be able to read the index data from a file at startup into memory, and write it back when updating the index. Don’t forget the names (pathnames) of the files as well as their last modification time must be stored as well. It is your choice to use a single file or multiple files, in plain text, JSON, XML, or any format your group chooses, to hold the persistent data. If you want, you can use any DBMS. (In that case, I suggest using the JavaDB included with the JDK, as an embedded database.) In any case, your file format(s) or database schema must be documented completely, so that someone else, without access to your source code could use your file(s) or database correctly.
If using XML format, you can define an XML schema for your file and have some tool such as Notepad++ validate your file format for you. XML may have other benefits, but it isn’t as simple as plain text files or even JSON files. In any case, don’t forget to include the list of file (path) names, along with the index itself, in your persistent data store.
Part II Requirements:
In this part, you must implement the file operations of your search engine application (the model). That includes reading and updating your persistent data (that is, the inverted index as well as any other information you need to store between runs of your application, such as the list of files (their pathnames) that have been indexed). The main file operations are reading each file to be indexed a “word” at a time; you also need to checking if the previously indexed files still exist or have been modified since last indexed.
The maintenance part of the user interface should allow users to select files for indexing, and to keep track of which files have been added to the index. For each file, you need to keep the full pathname of the file as well as the file’s last modification time. Your code should correctly handle the user entering in non-existent files and unreadable files. How you handle such errors is up to you
You can download a Search Engine model solution, to play with it and inspect its user interface. My solution keeps all persistent data in a single text file in the user’s home directory, but you can certainly use a different persistence solution.
Keep your code simple. .
Possible Data Structures you can use. In part III, you will implement the index operations, including Boolean searching, adding to the index, and removing files from the index. (The index is a complex collection of collections.) Because the format of the index and file list will affect the code used to read and write them to and from storage, you must decide on the in-memory data structures to be used early. In the model solution, I used a List of FileItem objects for the list of indexed files; each FileItem contained a file’s pathname and date it was read for the index. The index data itself is stored in a Map, with the using the indexed words as keys, and a Set of IndexData objects as the values. Each IndexData object holds the id of the file containing the word and the position of the word in that document. (The classes FileItem and IndexData were trivial to write.)
This is NOT the only, or the best, way to represent the index or file list! (For example, a List of int arrays might be simpler than a Set of IndexData objects.) Your should decide on the types of collections used. Only then can you implement the methods to read and write the data.