120 likes | 239 Views
A Parkinson’s Search Engine using an Intelligent Solution. Frederick Wythe Dabney. The Idea . Search Engines of the past (and present) have relied on flat data. In other words, there was no intelligent element to it.
E N D
A Parkinson’s Search Engine using an Intelligent Solution Frederick Wythe Dabney
The Idea • Search Engines of the past (and present) have relied on flat data. In other words, there was no intelligent element to it. • Google, for instance, takes only the search query provided by the user and returns results based solely on that input. • In order to make a search more relevant to a user, I created a primitive search engine which relies on user data input as a profile, and then using this data in conjunction with an ontological backbone to alter the query given and return a more powerful search.
The Big Picture • What I tried to create was a specific search based on a limited data set (or number of webpages), and a general form which could be easily adapted for any illness or disease, simply by editing the base ontology. • I have set up the base structure for running a customizable search and provided an example using Parkinson’s Disease as my base ontology.
The Base Ontology • The Base Ontology is a simple structure of what a typical disease profile might look like without user data filled in. The next slide is an illustration of the base ontology for Parkinson’s Disease.
The Implementation - Ontologies • To create the base Ontology, I first used Protégé to set up my class structure, although it was unnecessary. I then took the OWL/RDF document produced and opened it with Jena using the Java/Eclipse Environment. I added annotation properties using Jena for end classes which served as keywords which were injected into the search query given a certain weight. So for instance I would have a class “Tremor” and using Jena insert annotation properties “shaking 3”, and “tremor 4”. The first word is the keyword which will be added to the query, the second is the weight used with Lucene’ssetBoost method. I will show later how this was used. • Clearly this could have done a number of ways, the annotation properties could have been subclasses with something like “hasKeyword” property, but this was just how I chose to implement it.
The Implementation – Ontologies Cont. • In order to use these classes, some individuals need to be created. Here is where the base ontology is used to create a profile. Using Java Swing as a GUI frontend, I have the user enter profile data. This is a one time thing, and the user can log back in after they have filled in their info. When they check one of the radio buttons, each button corresponds to a specific class • When this class is recognized, I used Jena to create an individual named True or False. Again, clearly, I could have gone about this a number of different way. However, I thought that if this were up scaled or adapted, the data set used could be easily converted into a bit array, which is extremely fast and efficient. So instead of having a long list of strings and string comparisons, you have a single bitmap which takes very little storage and is easily traversed. I will later get into how the True/False individuals are used. • Tools Used for Ontology creation: Protégé, and Jena
The Implementation – Crawling and Indexing • Crawled several seed URLs. Created several indexes and merged them to form one index which was my beginner index. • Used the Pubmedweb service to return several thousand results on the query “Parkinson’s Disease”. Took these results, added each as a seed URL and did a crawl of depth one on each and combined these into a large index which served as the Advanced or Expert Index. • Merged both indexes to create the “all” index which searches everything. The option is available on the main search page. • I also wrote my own PDF Parser because the crawler provided with the book wasn’t fully functional. I integrated this into the book’s web crawler and indexer with the help of PDFBox • Tools used were “Algorithm’s of the Intelligent Web” Web Crawler, Lucene for indexing and merging the indices, and PDFBox for Parsing the PDF documents. All other parsing was provided by the crawl implementation provided. I used PubMed’s webservice to obtain the seedUrl’s for the Expert/Advanced crawl
GUI Frontend • This will be shown when I go over the code, but essentially I created a number of GUI windows in which one opens and the original closes when data is “submitted” to simulate a postback like a normal website, again for easy adaptability. • First page – Login or Create New Profile • Second Page – Enter UserName, Password and Basic Info • Third Page – Enter Profile Details (specifics) • Fourth Page – Search Page (can be reached from the First Page as well)
Features • PDF Parser • Indexing, Ranking, and Index Merging • Search With “Did you mean…” feature using LuceneSpellChecker and Dictionary Index • PubMed web service Integration • Weighted search using injected query, Non-weighted Search using injected query, Search on a class subset, and normal Search that doesn’t use profile or query weighting at all • Adding Hyperlinks to results list, and giving the ability to open the Hyperlink in the default browser. • A customized set of query results which are relevant to the user profile given.
Problems • One major problem was with URL parsing. If I had to do this over, I would have used a more professional webcrawler, even though this was very good for a learning experience. There were many instances where the crawler could not recognize that certain pages were not the same (ex. http://www.michaeljfox.org/living_additionalResources_glossary.cfm#b and http://www.michaeljfox.org/living_additionalResources_glossary.cfm#c). I didn’t realize until too late that this was occurring, but by that point I was too far along anyway. • Also had problems with forums. A thread on a forum could have many pages, but it would not be recognized as one logical unit. • Another issue, not problem I had, which is still quite a mystery to me is that when indexing the PubMed seed URL’s kept failing for about 90% of them. I tinkered with it a bit and found that if I kept that batch processing size (meaning the number of pages processed by the crawler and Lucene) to 7 then I didn’t have any problems. • Also I think the Lucene Indexer itself might have some minor issues that need to be worked out. When I was originally testing this, I was using an American Heart Association webpage for my crawl. Everytime I searched this index, no matter what the query was, it would always return three specific pages with a flash animation that actually had very little text in them as the most relevant result. I don’t know why it kept returning these pages, and I checked what it actually indexed from it, and it still confuses me, because it is certainly not the most relevant. That whole experience definitely made me question Lucene itself, although it’s more than likely the web crawler again, probably parsing incorrectly.
Results and Conclusions • In the end I was able to produce the results I was after. The weighted search was by far the most important element in the whole project, allowing query injection while maintaining the importance of the actual query given by the user at search time (what was entered in the search box). • In almost all cases, the search given at runtime was returned with results that pertained to the individual’s symptoms or areas of research interest. • The results were not perfect, but I blame that in part to the crawler used. • If I were to take this project to a practical level, I would integrate it into a website environment and use a professional crawler which would automatically update. I would also add a feed on that was displayed on the side of the search page upon logging in which would display new news articles which pertained to the users research interests. I was going to do this, but didn’t have time. • All in all, I believe the project was a success as a proof of concept, and it would actually be very worthwhile in real world use as a specialized website.