A free text indexer is a piece of software that is able to extract the textual content in a document that contains additional visual formatting (in this case, HTML tags), and store the same in a meaningful and easily retrievable format. The aim of this project was to develop a simple free text indexer for HTML based documents, capable of returning search results for one or more sites, comprised of between 50-200 pages each.
The primary objective of a free-text indexer for HTML documents is to retrieve the pure textual content of one or more HTML documents and store the results in a database, in a manner that can be easily retrieved. The free-text indexer developed in this project only parses HTML documents suffixed with a .htm or .html extension. Additional objectives of this project also include performance testing of the developed algorithm, both in terms of content indexing time and content retrieval time.
The key to developing an effective indexing algorithm lies in being able to successfully account for all the content in a page, while at the same time, ensuring that the space taken up for the same purpose, is kept to the bare minimum. The indexing algorithm developed in this project is quite primitive, when compared to grandiose schemes like Google,. However, the purpose of this project is to serve as the foundation for a more effective scheme aimed at improving content retrieval time. Content indexing time, though important, is not critical to the ‘user experience’, which should take a higher priority. Furthermore, developing algorithms that improve content indexing at the cost of content retrieval also increase the real-time load on web servers that service content retrieval requests.
Download the final report.