User:Jnweiger/Wiki search
Written July 19, 2010, in the hope it is outdated soon.
google search prototype
Matt Ehle established http://en.opensuse.org/MediaWiki:GoogleSearch. This does exactly what I'd exepct from a decent search engine. The site is marked as a non-profit, so the advertising is gone now.
The Google search is more of a temporary solution until we can get Lucene in play. We have some technical barriers to getting Lucene up and running (Matt is the driving force here again. Thanks) BNC#625677
Why the default wiki search sucks
There is a number of complaints on our mailing lists.
- http://lists.opensuse.org/archive/opensuse-wiki/2010-07/msg00196.html
- http://lists.opensuse.org/opensuse-wiki/2010-07/msg00217.html
- http://lists.opensuse.org/opensuse-project/2010-07/msg00269.html
Examples of issues
- Try to find the e-mail address of an openSUSE user. You can e.g. search for a Board member like Alan Clark, who is definitly not hiding himself. Not even the SUSE internal tel* tools can find his e-mail address.
- Search for 'SUSE_ASNEEDED' nothing is found, as the one page /Packaging/Fixing has not been migrated. The google search also includes the old-en wiki, and locates the page.
--> Issues:- A user would not know that there is an old wiki with lots of valuable contents.
- SUSE_ASNEEDED is more often the answer than the question. I'd like to get some pointers in the right direction when searching 'linker library order symbol lookup "undefined reference to"'
- Search for 'video'. It falsely claims e.g.
No page title matches
There is no page titled "video". You can create this page.
--> Issues:- The wording suggests this is a general truth. No hint, that the list of selected namespaces make any difference.
- "I know that a page titled "video" exists, please tell me in which namespace it is." This type of question cannot be asked. (Answer in this case: 'HCL:Video').
- search for "nvidia" and note the first article shown there is something about compiz without the proper one being in sight.
--> Issues:- User might give up browsing the results too early.
- User might get distracted with a result that appears 'good enough'.
- I looked for 'Ambassadors' with no result. And the small notification ..only some name spaces are searched is especially tricky, the more you have a user in front of the wiki.
--> Issues:- A user might not know "What's a name space?"
- A user might not know "what's the difference between Main and openSUSE?"
- "etc."
- The need of teaching this to our users is an entry barrier.
- The official workaround is to create a portal page containing all the words that a user might want to search for.
--> Issues:- It is hard to predict, what a user might want to search. We could try to use the past (list of failed searches) as a predictor for the future, but this low fun work and never ending.
- It will cause proliferation of portal pages, once this concept is well known. Thus spoiling the effect of reduced hit count in the long run. (Okay, I am a pessimist here).
- Search for page titles need to be exact. 'bugreport', 'bug report', 'report a bug' all fail to match a page title, only 'report a Bug' currently succeeds.
--> Issues:- User is guided to a random page content match, and may believe we have nothing better.
- User may learn over time that the search engine 'randomly fails' without learning how to avoid this situation.
Suggestions
- MWSearch and GoogleSiteSearch (or alternatively the Google Custom Search Engine) do a much much much better job at producing useful results.
- http://www.mediawiki.org/wiki/Extension:MWSearch combined with http://www.mediawiki.org/wiki/Extension:Lucene-search
- Work on Page title matches.
- Allow page title matches on all namespaces per default
- Review the code if we can easily allow substring, case insensitive matches; or even similarity based matches.
- Work on feeding relevance data into the search engine.
- Associate each namespace with a default relevance.
- some metrics of hit count (this month and ever since.)
- some metrics of edit frequency,
- Nr. of inbound and outbound links.
- Nr. of different contributors.
- Nr. of stars in some rating.
- work on search algorithms
- allow for typos and run-together/split words with a small penalty in relevance. (ouch, that is hard)
- work on presenting the results sorted by relevance
- http://lists.opensuse.org/archive/opensuse-wiki/2010-07/msg00221.html
- matches with words in same order first.
- matches with the words close together first.
- AND combinations before OR.