March 2015
This blog is intended for an expert audience that has to deal with (document) search in technical domains. It should reflect major challenges in search assisting tasks as well as in technical speedups using RDF techniques to implement search assistance.
When an expert person (student, professor, scientist) searches in her domain for some special keys or issues – she/he expects mostly to find documents containing the given keywords and reflecting exactly her/his need of information. A further possible motivation in searching is to see the relevance of her issues counting for example the number of result documents containing them or to know “who wrote this and that”. I will here focus here on the first use case, where the search expert types some keywords and gets results from the search engine containing information on those given keywords reflecting her needs. What can happen here?
1) Too many result documents containing the issue – what should I read first?
2) Too few result documents on that issue – are there such few documents about that?
Case 1) needs a reduction of results (“search reduction”), while case 2) requires possibly an expansion of results by considering adding opportune synonyms or further key words to the original search query text (“search expansion”).
As in /rdgi2010/, /esctl2012/ experienced there is a strong evidence for the need of search assistance by means of technical reference documents like (SKOS or OWL) ontologies which are “focused” / “positioned” on relevant key words out of the query text during search generating a further (meta) dimension giving hints and suggestions on related information while the person is searching for her keywords or issues.
Figure 1: Traditional search for „Cell growth inhibitory effect“ with a portion of result documents
Ontologies reflect special (meta) knowledge on technical issues on a given domain. Having one or several ontologies opened at concepts relevant to a query text help seeing more entities semantically related with the query than the query itself might express. This process can be thought of having a group of domain experts which focus on each relevant term inside your query and giving a complementing advice on what you have possibly not yet thought of formulating your query. These domain experts report to one main expert, which in turn filters and passes the suggestions to the searcher. Figure 2 shows an example of ontology empowered suggestions as additional informational layer besides the „naked“ result documents of figure 1.
Figure 2: Ontology Assisted Search for „Cell growth inhibitory effect“ with some result documents, semantic relations and semantic facets
An ontology based computer system should nowadays first linguistically analyze the query term, then extract all possible compound term combinations, link each of these terms with matched entities inside the active ontologies and produce an opportune compact representation of the so gained specific related expert information to the searcher: hints and suggestions coming from the used ontologies. Summarizing, there are here at least
1) Linguistics challenges – understanding the query text
2) Semantics challenges – retrieving opportunely semantic related information
3) Usability challenges – representing compact information as a useful suggestions
4) Completeness challenges – are these really all suggestions I can get?
In order to keep this blog short, I’d like to add some notes on the latter two challenges. Often even having domain experts opening up to your query terms and presenting suggestions we still experience too much (expert) data (3) which fortunately is still much less than the amount of retrievable result documents but still need to be “pruned“ in some way. If we prune that information – are we doing a “favor” to the searcher? A better practice addressing (4) would be to inform/signal the searcher that the suggestions she is considering were “pruned” and we must give her the possibility to exploit all the computed suggestions once the searcher decided to look at all of them. How can suggestions be automatically pruned? Although there might be some heuristics approximating a good job – e.g. calculating some textual or semantic distances from an “intended” meaning of the query and related terms – Personally I momentarily discourage system developers from doing this, since to my opinion the only actor which is nowadays able to successfully prune that (meta) knowledge should be again a person.
RDF spaces are large Linked Data spaces where knowledge is represented through so-called “knowledge-Graphs” and it is stored as RDF triples /RDF/. RDF spaces offer the great advantage of containing fluid, highly configurable and fusible entities descriptions, which can be easily taken to build compact (meta) information (i.e. suggestions) while still presenting a high degree of detail regarding entities. A SKOS thesaurus or an OWL ontology will be always representable by RDF Triples. They can be informationally „fused“ by means of SPARQL queries allowing for compacted (meta) information to be presented as ontology assistance (suggestions) during the searching process. SKOS and OWL are de-facto models to represent technical domain knowledge. They contain specific semantic relations, which – together with the query text – can be used to gather related (meta) information – so called in /rdfse2013/ “semantic facets” effectively usable to reduce ore expand the search results.
A further benefit of RDF spaces is the automatic exploitation of inferred ontology statements: current best RDF stores infer from implicit ontology facts all possible deducible facts materializing a so called transitive closure of facts which can be used as a valuable extension to semantic facets in Search Assistance (this inferencing is called „forward reasoning“).
In this section some pitfalls I discovered while realizing the demonstrator shown in /OASd2015/ are skeched.
1) Linguistic pitfalls – Searching for a natural language text – even if it is a shortened / abbreviated or full detailed query text must consider all abstractions and tricks used in that language (beeing that English or German and so on …). Curious pitfalls arose concerning matching of plural/singular entity names in conjunction with query compound terms. Using stop words lists might be inconvenient while trying to match complex compound terms containing them. Calculating compound terms should be done extensively – not stopping at the first most complex matched compound term but continuing matching each possible minor compound term.
2) RDF semantic retrieving (SPARQL) – The cleanness/completeness of relation data is crucial for SPARQL retrieving: before use prepare each ontology to be used in Search Assistance by filtering/repairing/completing each of its entities and relation descriptions before you use the ontology. Refer also to current validators like http://www.w3.org/2004/02/skos/validation, http://mowl-power.cs.man.ac.uk:8080/validator/ or http://www.w3.org/RDF/Validator/ and keep a parallel instance of your RDF store running for test queries in order to see what is still missing from your ontology data and add the missing RDF material to the same ontology.
3) Autocomplete retrieving – The autocomplete function (empowered by ontologies) help users in cases of misspelling terms helping them reaching the most relevant and more contextualised information, working as an effective cognitive „prothesis“. Here RDF stores offer no real help since every RDF triple is stored and (in some of them) indexed as a whole object – retrieving possible spell variants implies having properly indexed every single sub-token of each single triple object (text) value. I therefore advice to index in addition to your RDF store content every ontology describing triple into a powerful (but standard) search engine like SOLR or Elastic Search in order to effectively help spelling/completing the query text using the ontology content.
Automatically focusing ontologies on terms involved or gathered from a query text involves the use of NLP (natural language processing) tools for token analysis in a sentence and SPARQL retrieving for gathering the completing (meta) information for the search suggestions. But what if the searching person had omitted in her query text “response-inhibiting” from “immune response-inhibiting cell surface receptor signaling pathway”, thus expressing a query text like “immune cell surface receptor signaling pathway” ? The computer system using pure SPARQL queries will not match the corresponding class “immune response-inhibiting cell surface receptor signaling pathway”! That was the reason I decided to use in alternative to my own developed SPARQL (exactly) retrieving engine a text search based approximating matching, which is present in every current search engine. On the basis of that method we have every RDF Triple – from each ontology – indexed in that search engine. Using a “Spell-Check-Match” on the objects of that triples will match also a concept containing a superset of sub-token like the one before. On each textually retrieved triple an associated SPARQL query retrieves then quickly all the related semantic information concerning the entity partially described by that triple.
The results in retrieving using textual approximation are ranked by the search engine itself in accordance to some textual distance from the query text. Should the default ranking not be in every case satisfactorily, additional rankings can be programmed to help seeing “most complex” or Levensthein-closed concepts firsts. I would not advice using search engine boost technologies, since these could even here politically be used to influence the way information is presented.
The demonstrator shown in /OASd2015/ presents several positive aspects of retrieving (meta) information usable for reducing or expanding search on the basis of several (inferred and non inferred) ontologies like /GENEO/, /MEsH/, /EUROVOC/. Despite of a good concept matching using both Textual and SPARQL concept retrieving some usability issues (still quite complicated to use) seem to be addressable. It is still questionable whether a semantic autocomplete should offer all the ontology information as a spell. Thanks to the adopted RDF technology it is easily thinkable to foresee the use of remote ontologies from the outer linked data space over SPARQL endpoints for a specific personal and domain focused use of this Search Assistance method. I recognized some speedups in compiling (aggregating) RDF information from RDF Stores into Text Search Engines. In a previous prototype realized in /rdfse2013/ the information of all the triples describing an entity of one ontology were aggregated and stored as one and only search document indexed inside the search engine. With one retrieve step the whole semantic information of an entity (not only of its single constituting RDF triple) was retrieved and directly used for displaying semantic facets in a search portal. The only drawback of this method was the missing aggregation “on the spot” of semantic facets coming from other ontologies.
In a nutshell: I believe that pure RDF/SPARQL technologies do not always lead to the target, especially when dealing with possibly misspelled natural search query text. This lack of flexibility is on the contrary widely and easily filled by text search engines coupled with RDF Stores as e.g. shown in /OASd2015/.
/esctl2012/ | An E-Science Tool for Managing Information in the Web of Documents and the Web of Knowledge (en) [J. Belmonte, E. Blumer, F. Ricci, R. Schneider] – IMCW 2012, Ankara, (T) |
/EUROVOC/ | Eurovoc – http://eurovoc.europa.eu/drupal |
/GENEO/ | The GENE Ontology – http://geneontology.org |
/MEsH/ | Medical Subject Headings – http://bioportal.bioontology.org/ontologies/MESH |
/OASd2015/ | Ontology Assisted Search Demonstrator – http://semweb.ch/leistungen/rdfservices/en-semsearch |
/onto2013/ | Trialogo De Ontologia (de) [M. Grilli, F. Ricci, R. Schneider] – GMS MBI, Wien (A) – Only in German. |
/RDF/ | Resource Description Framework: http://www.w3.org/RDF |
/rdfse2013/ | Linking Search Results, Bibliographical Ontologies and Linked Open Data Resources (en) [F. Ricci, Javier Belmonte, Eliane Blumer, René Schneider] – MTSR, Thessaloniki (G) November 2013 |
/rdgi2010/ | Verwendung von SKOS-Daten zur semantischen Suchfragen- Erweiterung im Kontext des individualisierbaren Informationsportals RODIN [R. Schneider, F. Ricci] – DGI 2010, Frankfurt am Main – Only in German. |
/rodin2010/ | A medium-weight portal for the aggregation and mashing of heterogeneous data sources (en) [F. Ricci, R. Schneider] – WEBIST 2010, Valencia (E) |