http://datamining.typepad.com/data_mining/2005/07/
evaluating_blog.html
Firstly, what are blogs? A year or two ago, it
would have been relatively easy to answer that
question. However, the definition has been
getting more and more muddied over time. Problems
include:
* The location of the blog. If a blog is
defined by a URL and I place the blog on the home
page of my corporate web site, then are links to
that home page links to the blog, or links to the
site?
* RSS gave blogs their big break, and the use
of ping servers and the aggregation of updates
that they provide were a win for everyone - sort
of. There is nothing in the structure of these
systems which indicates if the pinger is a blog
or not. Consequently, if you listen to a ping
service, you will get non-blog content - e.g.
discussion threads on flickr.
Any definition of blogs will also include models
of link structure (trackbacks and citations),
comments, blogrolls and so on. All of which
affect the expectations of the user (searching
for posts with comments? searching for comments
on a topic?)
Secondly, what is search? Or rather, what is the
intention of the searcher? For blogs, there is a
large proportion of search focused on figuring
out who is talking about the searcher - so called
vanity searches. Due to the long tail, this type
of search is very sensitive to coverage issues,
which is not the case if you are looking for
information. In addition, as there are types of
search which are variations to those of major
search engines, or entirely novel in form, the
semantics of the search terms presented to blog
search engines are still evolving. What does it
mean to enter a URL into a search engine? Are you
looking for citations of that URL? Are you
looking for posts from that blog?
Thirdly, what are the results. We have all been
led to expect the type of results that major
search engines deliver - a ranked list of web
pages. However, there are other factors that are
important to rank as well as other ways in which
blog data can very naturally be delivered. One
obvious issue is the distinction between a search
for a blog (find me a good blog on this topic)
and the search for a post (find me posts that
match my criteria). This underlines the fact that
blogs results are often at the granularity of sub-
page documents. In addition, blogs are far more
timely than web data in general. Consequently,
time is an important factor (witness the several
blog search engines that now provide trend graphs
as search results).
Fourthly, what are the quality issues for search
results? These must include (at least):
* Segmentation - the separation of blog posts
from the blog template and peripheral data.
* Deduplication - the filtering out of
multiple copies of the same post (something you
will never see on a web search engine).
* Spam filtering - the removal of spam blog
data.
* Time - the accurate representation of the
time of a post.
* Relevance - the boosting of results that
are more relevant than others.
* Speed of query execution - how fast the
results come back.
* Comprehension - how complete the coverage
is.
* Time to index - how long it takes for a
post to become part of the search engine's index.
* Repeatability - if I issue the same query
(immediately) do I get the same result?
* Result count estimation.
In addition, the search engine must be judged on
a number of service issues such as:
* Ability to request the inclusion of a blog.
* Ability to remove a blog and all posts from
an index.
So how do we go about testing blog search
engines? Ultimately, the quality of a search
engine can only really be measured by the success
or failure of a user to achieve some task, a task
in which the search engine is a tool and the
search results are not the final goal of the
task. This means that anecdotal tests (in which
one determines that one engine is better than
another because they return more hits) are
completely out. It also means that a
representative set of tasks needs to be captured
and translated into realistic queries. Note that,
interestingly, the willingness of the market to
provide more interesting tools than the
monolithic one dimensional list of ranked results
means that those who innovate with tools (like
trend mining, etc.) that are accessible to users
have an advantage.