Researching Usability

Archive for the ‘Usability testing’ Category

As the project embarks on usability testing using mobile devices, it was important to evaluate mobile specific research methods and understand the important differences between desktop usability testing and that of mobile devices. The most important difference to be aware of when designing and testing mobile devices is that it IS different to traditional testing on desktop computers. Additional differences are provided below:

  • You may spend hours seated in front of the same computer, but mobile context is ever-changing. This impacts (amongst other things) the users’ locations, their attention, their access to stable connectivity, and the orientation of their devices.
  • Desktop computers are ideal for consumption of lengthy content and completion of complex interactions. Mobile interactions and content should be simple, focused, and should (where possible) take advantage of unique and useful device capabilities.
  • Mobile devices are personal, often carrying a wealth of photos, private data, and treasured memories. This creates unique opportunities, but privacy is also a real concern.
  • There are many mobile platforms, each with its own patterns and constraints. The more you understand each platform, the better you can design for it.
  • And then there are tablets. As you may have noticed, they’re larger than your average mobile device. We’re also told they’re ideal for reading.
  • The desktop is about broadband, big displays, full attention, a mouse, keyboard and comfortable seating. Mobile is about poor connections, small screens, one-handed use, glancing, interruptions, and (lately), touch screens.

~ It’s About People Not Devices by Stephanie Rieger and Bryan Rieger (UX Booth, 8th February 2011)

Field or Laboratory Testing?

As our interaction with mobile devices happens in a different way to desktop computers, it seems a logical conclusion that the context of use is important in order to observe realistic behaviour. Brian Fling states in his book that you should “go to the user, don’t have them come to you” (Fling, 2009). However, testing users in the field has its own problems, especially when trying to record everything going on during tests (facial expressions, screen capture and hand movements). Carrying out contextual enquiries using diary studies are beneficial, they also have drawbacks as they rely on the participant to provide an accurate account of their behaviour which is typically not always easy to achieve, even with the best intentions. Carrying out research in a coffee shop for example provides the real-world environment which maximizes external validity (Demetrius Madrigal & Bryan McClain, Usability for Mobile Devices). However, for those who field studies are impractical for one reason or another, simulating a real-world environment within a testing lab has been adopted. Researchers believe they can also help to provide external validity which traditional lab testing cannot (Madrigal & McClain, 2011). In the past researchers have attempted a variety of techniques to do this and are listed below:

participant on a treadmill

Image from Kjeldskov & Stage (2004)

  • Playing music or videos in the background while a participant carries out tasks
  • Periodically inserting people into the test environment to interact with the participant, acting as a temporary distraction
  • Distraction tasks including asking participants to stop what they are doing, perform a prescribed task and then return to what they’re doing (e.g. Whenever you hear the bell ring, stop what you are doing and write down what time it is in this notebook.) (Madrigal & McClain, 2010)
  • Having participants walk on a treadmill while carrying out tasks (continuous speed and varying speed)
  • Having participants walk at a continuous speed on a course that is constantly changing (such as a hallway with fixed obstructions)
  • Having participants walk at varying speeds on a course that is constantly changing (Kjeldskov & Stage, 2003)

Although realism and context of use would appear important to the validity of research findings, previous research has refuted this assumption. Comparing the usability findings of a field test and a realistic laboratory test (where the lab was set up to recreate a realistic setting such a hospital ward) found that there was little added value in taking the evaluation into a field condition (Kjeldskov et al., 2004). The research revealed that lab participants on average experienced 18.8% usability problems compared to field participants who experienced 11.8%. In addition to this, 65 man-hours were spent on the field evaluation compared to 34 man-hours for the lab evaluation, almost half the time.

Subsequent research has provided additional evidence to suggest that lab environments are as effective in uncovering usability issues (Kaikkonen et al., 2005). In this study, researchers did not attempt to recreate a realistic mobile environment, instead comparing their field study with a traditional usability test laboratory set-up. They found that the same issues were found in both environments. Laboratory tests found more cosmetic or low-priority issues than in the field and the frequency of findings in general varied (Kjeldskov & Stage, 2004). The research did find benefits or conducting a mobile evaluation in the field.  It was able to inadvertently evaluate the difficulty of tasks by observing participant behaviour; participants would stop, often look for a quieter spot and ignore outside distractions in order to complete the task. This is something that would be much more difficult to capture in a laboratory setting. The research also found that the field study provided a more relaxed setting which influenced how much verbal feedback the participant provided, however this is refuted by other studies which found the opposite to be true (Kjeldskov & Stage, 2004).

Both studies concluded that the laboratory tests provided sufficient information to improve the user experience, in one case without trying to recreate a realistic environment. Both found field studies to be more time-consuming. Unsurprisingly this also means the field studies are more expensive and require more resources to carry out. It’s fair to say that running a mobile test in the lab will provide results similar to running the evaluation in the field. If time, money and/or access to equipment is an issue it certainly won’t be a limitation to test in a lab or empty room with appropriate recording equipment. Many user experience practitioners will agree that any testing is always better than none at all. However, there will always be exceptions where field testing will be more appropriate. For example, if a geo-based mobile application is being evaluated this will be easier to do in the field than in the laboratory.

Capturing data

Deciding how to capture data is something UX2 is currently thinking about. Finding the best way to capture all relevant information is trickier on mobile devices than desktop computers. Various strategies have been adopted by researchers, a popular one being the use of a sled which the participant can hold comfortably and have a camera positioned above to capture the screen. In addition to this it is possible to capture the mobile screen using specialised software specific to each platform ( If you are lucky enough to have access to Morae usability recording software, they have a specific setting for testing mobile devices which allows you to record from two cameras simultaneously; one to capture the mobile device and the other to capture body language. Other configurations include a lamp-cam which clips to a table with the camera positioned in front of the light. This set-up does not cater for an additional camera to capture body language and would require a separate camera set up on a tripod. A more expensive solution is the ELMO-cam, specifically their document camera, which is stationary and requires the mobile device to remain static on the table.  This piece of kit is more likely to be found in specialised research laboratories which can be hired for the purpose of testing.

lamp-cam configurations

Lamp-cam, image courtesy of Barbara Ballard


Based on the findings from previous research, the limitations of the project and its current mobile service development stage, it seems appropriate for the UX2 project to conduct initial mobile testing in a laboratory. Adapting a meeting room with additional cameras and using participant’s own mobile device (where a specific device is recruited) will provide the best solution and uncover as many usability issues than if it took place in the field. A subsequent blog will provide more details of our own test methods with reflections on its success.


Fling, B., (2009). Mobile Design and Development, O’Reilly, Sebastopol, CA, USA.

Kaikkonen, A., Kallio, T., Kekäläinen, A., Kankainen, A and Cankar, M. (2005) Usability Testing of Mobile Applications: A Comparison between Laboratory and Field Testing, Journal of Usability Studies, Issue 1 Vol 1.

Kjeldskov, J., Stage, J. (2004). New techniques for usability evaluation of mobile systems, International Journal of Human-Computer Studies, Issue 60.

Kjeldskov, J., Skov, M.B., Als, B.S. and Høegh, R.T. (2004). Is It Worth the Hassle? Exploring the Added Value of Evaluating the Usability of Context-Aware Mobile Systems in the Field, in Proceedings of the 5th International Mobile HCI 2004 Conference, Udine, Italy, Sringer-Verlag.

Roto, V., Oulasvirta, A., Haikarainen, T., Kuorelahti, J., Lehmuskallio, H. and Nyyssönen, T. (2004) Examining Mobile Phone Use in the Wild with Quasi-Experimentation, Helsinki Institute for Information Technology Technical Report.

Tamminen, S., Oulasvirta, A., Toiskallio, K., Kankainen, A. (2004). Understanding mobile contexts. Special issue of Journal of Personal and Ubiquitous Computing, Issue 8

Last month a usability study was carried out on the UX2 digital library prototype. The study involved 10 student participants who tried to complete a variety of tasks using the prototype. The report is now available to read in full and can be accessed via the library (

The prototype is based on an open source ruby-on rails discovery interface, Blacklight which has been further developed for the project to provide additional features. Existing component services have been ‘mashed-up’ to generate the UX2.0 digital library. The prototype currently indexes the catalogues provided by the National e-Science Centre at The University of Edinburgh (NeSC) and CERN – The European Organisation for Nuclear Research. The report presents the findings of the usability testing (work package 2 – WP2.3) of the prototype which was conducted with representative users at the university. The study reveals a range of issues uncovered by observing participants using the prototype by trying to complete tasks. The findings outlined in the report provide a number of recommendations for changes to the prototype in order to improve the user’s experience.

In order to identify and fully explain the technology responsible for each issue in the report, supplementary blogs will be published on the project website in stages, as and when developmental changes are made (follow the ux2 Twitter account for announcements). It is hoped that this supplemental development documentation will make it more accessible to other digital library developers and the wider JISC community. Some of the main findings from the report are summarised below.

Positive findings from the study highlighted positive aspects of the prototype:

  • Allowing users to narrow results using the faceted navigation links was useful.
  • Providing users with details of the item content including full text preview, video and presentation slides was informative.
  • Allowing users to bookmark items and add notes was considered a useful feature.
  • Overall the layout and design was clean, simple and useful.

These positive findings from the testing are reflected in the word cloud questionnaire participants were asked to complete:

UX2 word cloud

However there were some usability issues with the prototype:

  • It was not obvious when the system saved previously selected facets in the Scope, often misleading participant’s expectations.
  • The external ‘Other’ links were not relevant to participants and often mistrusted or considered a distraction.
  • It was not clear when an item had a full text preview feature
  • Links to information resources were not prominent and often missed by participants
  • The format of text within the item details page made it difficult to read and consequently participants often ignored it.

There were also a few lessons learned from the user study which I thought would be useful to share:

  1. Recruiting participants via telephone does not necessarily guarantee attendance. Two participants did not show up to their slot after arranging the appointment with them by phone and sending them an email confirmation. However, this could also have been affected by the time of year. It transpired that many students had coursework deadlines the same week and the offending students did say they forgot because they had a heavy workload.
  2. User generated tasks are not easy to replicate using prototypes. This was not unexpected but something which was tried anyway. As suspected, it was difficult to create a task which could generate results when using such a specialised and relatively small database. However, when it was successful it did return some useful findings.
  3. It’s difficult to facilitate usability tests and log details using Morae. Any usability practitioner will tell you that it’s important concentrate on observing the participant and interacting with them and to avoid breaking the flow by stopping to take detailed notes. I found it impossible to observe a participant, engage with what they were doing and log behaviour on Morae so would recommend you recruit a note-taker if this is important for your usability study.

When carrying out usability studies on search interfaces, it’s often better to favour interview-based tasks over pre-defined ‘scavenger-hunt’ tasks. In this post I’ll explain why this is the case and why you may have to sacrifice capturing metrics in order to achieve this realism.

In 2006, Jared Spool of User Interface Engineering wrote an article entitled Interview-Based Tasks: Learning from Leonardo DiCaprio in it he explains that it often isn’t enough to create test tasks that ask participants to find a specific item on a website. He calls such a task a Scavenger-Hunt task. Instead he introduces the idea of interview-based tasks.

When testing the search interface for a library catalogue, a Scavenger Hunt task might read:

You are studying Russian Literature and your will be reading Leo Tolstoy soon. Find the English version of Tolstoy’s ‘War and Peace’ in the library catalogue.

I’ll refer to this as the Tolstoy Task in this post. Most of your participants (if they’re university students) should have no trouble understanding the task. But it probably won’t feel real to any of them. Most of them will simply type ‘war and peace’ into the search and see what happens.

Red routes

The Tolstoy Task is not useless, you’ll probably still witness things of interest. So it’s better than having no testing at all.

But it answers only one question – When users know the title of the book, author and how to spell them both correctly, how easy is it to find the English version of Leo Tolstoy’s War and Peace?

A very specific question like this can still be useful for many websites. For example a car insurance company could ask – When the user has all of his vehicle documents in front of him, how easy is it for them to get a quote from our website?

Answering this question would give them a pretty good idea of how well their website was working. This is because it’s probably the most important journey on the site. Most websites have what Dr David Travis calls Red Routes – the key journeys on a website. When you measure the usability of a website’s red routes you effectively measure the usability of the site.

However many search interfaces such as that for a university library catalogue, don’t have one or two specific tasks that are more important than any others. It’s possible to categorise tasks but difficult to introduce them into a usability test without sacrificing a lot of realism.

Interview-based tasks

The interview-based task is Spool’s answer to the shortfalls of the Scavenger Hunt task. This is where you create a task with the input of the participant and agree what successful completion of the task will mean before they begin.

When using search interfaces, people often develop search tactics based upon the results they are being shown. As a result they can change tactics several times. They can change their view of the problem based upon the feedback they are getting.

Whilst testing the Aquabrowser catalogue for the University of Edinburgh, participants helped me to create tasks that I’d never have been able to do so on my own. Had we not done this, I wouldn’t have been able to observe their true behaviour.

One participant used the search interface to decide her approach to an essay question. Together we created a task scenario where she was given an essay to write on National identity in the work of Robert Louis Stevenson.

She had decided that the architecture in Jekyll and Hyde whilst set in London, had reminded her more of Edinburgh. She searched for sources that referred to Edinburgh’s architecture in Scottish literature, opinion on architecture in Stevenson’s work and opinion on architecture in national identity.

The level of engagement she had in the task allowed me to observe behaviour that a pre-written task would never have been able to do.

It also made no assumptions about how she uses the interface. In the Tolstoy task, I’d be assuming that people arrive at the interface with a set amount of knowledge. In an interview-based task I can establish how much knowledge they would have about a specific task before they use the interface. I simply ask them.

Realism versus measurement

The downside to using such personalised tasks is that it’s very difficult to report useful measurements. When you pre-define tasks you know that each participant will perform the same task. So you can measure the performance of that task. By doing this you can ask “How usable is this interface?” and provide an answer.

With interview-based tasks this is often impossible because the tasks vary in subject and complexity. It’s often  then inappropriate to use them to provide an overall measure of usability.

Exposing issues

I believe that usability testing is more reliable as a method for exposing issues than it is at providing a measure of usability. This is why I favour using interview-based tasks in most cases.

It’s difficult to say how true to life the experience you’re watching is. If they were sitting at home attempting a task then there’d be nobody watching them and taking notes. Nobody would be asking them to think aloud and showing interest in what they were doing. So if they fail a task in a lab, can you be sure they’d fail it at home?

But for observing issues I feel it’s more reliable. If participants misunderstand something about the interface in a test, you can be fairly sure that someone at home will be making that same misunderstanding.

And it can never hurt to make something more obvious.

As stated in the previous blog post, the second phase of guerilla testing was conducted on 30th June. It was hoped that in the second phase we would have the opportunity to test with more people as the course finished before lunch. This did appear to be the case as I managed to test 4 individuals who were happy to stay up to 30 minutes in some cases to take part.

Before the testing took place some design changes were made to the prototype based on the feedback from the first phase. Based on the findings a number of recommended changes were made (see Recommendations on project Wiki page). Within the time scale it was possible to implement the following changes before phase two testing:

  • Ensure that categories within the Year facet are presented in a chronological order, most recent first.
  • Provide an additional facet to include Author.
  • Allow users to de-select facets within the facet navigation and not just using the breadcrumb system at the top.

The test plan remained the same and again I alternated the websites between each participant to reduce bias. The most startling finding in phase 2 was the user’s preferred site. Three of the four participants preferred the prototype in one way or another to the NeSC Library. With such small numbers it’s difficult to say that this is a trend, but it is an improvement on the first round of testing. It would be interesting to see if this pattern continues during Phase 3 in August and indeed with the focus group planned in September.

There was more evidence to support a faceted navigation with expanded facet values. Some of the users commented that they liked aspects of the NeSC navigation because the values were visible. However there was also a feeling that this could be overwhelming at times, particularly on the homepage before the user had begun their search. This suggests that a middle ground between the collapsed (or accordian) facet navigation of the prototype and the fully expanded navigation of NeSC may be a realistic compromise. Further discussion on the pros and cons of facet navigation design can be read in the excellent blog post by James Kalbach.

Some users commented that they did not notice the facet navigation in the prototype because it did not immediately look like a faceted navigation system or because it’s design and position meant that users were more attracted to the results in the centre of the page. Currently the facets are closed by default and are styled to look like ordinary links. Although this accordian design of a facet can work, it requires additional features to communicate its purpose to the user. An arrow next to each facet is a common device used to indicate that it can be toggled to reveal the facet values. Additionally, expanding the first two facets and providing the rest closed is another strategy used demonstrate how the system works. It seems clear from the feedback during both phases of testing that additional design features or a different approach is required to make it easier for users to understand and successfully use the facet.

World Digital Library expanded facet example

However, if the prototype facet is open by default then the same issues may arise as was reported in the NeSC library. A compromise could be to limit the number of category values in each facet and provide a ‘Choose more…’ link so that users can expand the list if required. The University of Edinburgh Aquabrowser catalogue and World Digital Library are both examples of digital libraries using this feature in their faceted navigation systems, however each library implements the feature in quite different ways. Aquabrowser’s system is more user-friendly because it provides the full value list on a separate page and gives the user control over its presentation; relevance and alphabetical. World Digital Library expands the values within the facet, often with wordy facet values which are organised by relevance only (image). The list could clearly be difficult for users to navigate quickly and consequently may not enhance the feature. Indeed, this has already been witnessed while testing the NeSC digital library.

Another finding from the testing was the design and implementation of the combined facet navigation and breadcrumb system. I intend to discuss the feedback surrounding this in my next blog post.

Workpackage 2 (WP2) within the UX2 project plan focuses on usability and UX research through a variety of methods. Part of the WP2 plan includes a usability inspection of the National e-Science Centre (NeSC) digital library. In addition to the NeSC library interface, developments within WP4 (Concentration UX Enhancement) means that it has been possible to test a prototype system alongside NeSC library and make some comparisons. The differences between each system are essentially based on the version of Blacklight they are using; NeSC is a developed system based on an older version of Blacklight while the prototype uses the latest version, although it has not undergone any additional development yet.

An opportunity arose to conduct rapid user testing on both systems by tagging onto an existing usability training course run regularly at the university. Attendees were asked to participate at the end of the course for 15 mins. This Ambush-style Guerilla testing leant itself well to the current state of the prototype; giving participants one task that fitted the current capabilities of the system meant that longer usability testing was not appropriate yet. It also meant that feedback could be received relatively quickly and fed back into the user centred design process to be tested on the next group of participants two weeks later. To learn more about Guerilla testing (also referred to as discount usability) please check out these excellent resources:
The Least You Can Do by Steve Krug (video presentation)

As stated, the Ambush-style Guerilla testing was planned to coincide with the university’s usability training course. The first available courses were organised to take place on the 15th  and 30th June. The course itself is provided for those with little/no experience or knowledge of usability testing and typically has around 20 attendees. Recruiting participants in this way made it much easier to do at short notice. However, one downside to this method is that you don’t know anything about who you are testing and time limitations make it difficult to capture any demographic or profile information. Although the prototypes are not guaranteed to be tested on representative users, this method of testing does highlight issues during the early design stages of development prior to testing with representative users. This makes it a valuable exercise even if it is not exhaustive or scientific.

This blog post intends to layout the test plan created for the first round of testing which took place on the 15th June and provide a snapshot of some initial findings. The intention is to run the Guerilla tests again during the next training course on the 30th June using an updated version of the prototype.

Task scenarios

The task scenarios were constructed based on the data set used in each system. The full test plan was piloted on a member of staff before conducting the Guerilla tests.

Task 1a:

“As part of your work you need to read a selection of current presentations on particle physics. Using the prototype, can you find a suitable presentation published in the last 2 years?”

Task 1b:

“A lecturer has asked you to find the most recent presentation on Grid computing for an event coming up. Can you find this information using the prototype?”

Tasks 1 a and b were alternated between each participant to ensure one system did not benefit from familiarisation in the previous task. However, in total 3 participants completed the tasks leaving an odd number.  This was due in part to a delay at the start of the training course which meant it finished later than expected. People are understandably less likely to remain past the time when they would normally leave work. Consequently, time restrictions made it difficult to recruit additional participants. However, the next training course is a morning session and will likely have a better chance of recruiting more participants.


All three participants preferred to use the NeSC library as opposed to the prototype.

Participants used the facet navigation much less in the prototype than NeSC library.

The facet navigation in the prototype did not behave as a participant expected.  They also did not realise what it was at first.

A participant was surprised when NeSC library placed selections within bread crumb trail in a different order to that which they had selected.

Facets in the prototype don’t look like expandable sections or links.

Facets did not always cope presenting results when facets were selected in different order e.g. participants expected to find relevant presentations within the most recent date when narrowing by year first. However, this could be symptomatic of the small data set used for testing purposes and would need to be testing on a larger data set.

Facets with a large number of categories e.g. Subject, were difficult to scan in the NeSC library. For example, keywords such as Grid Computing featured in more than one subject category making it difficult for the user to know which category to select.

The facet navigation moves from left hand side to right hand side once a search has been conducted using the NeSC library. This was unexpected and confusing to one of the participants.

Often participants were unable to find results from the most recent year (2009), because it was not in the top ten of results presented in the facet of the prototype. The absence of a ‘more’ button made it difficult to search all years or most recent.

Participants often sought and expected years to be presented in a chronological order and often requested that this be provided to make searching easier. bookmarks

Twitter feed