Sticky notes with testing results.

Creating a Testing Process

In December 2018, a minimum viable version of Cambridge One was released by the Press. This online learning environment had a simple goal – to create one place in which teachers and students could access all of their English Language Teaching (ELT) material.

With the priority for 2019 adding more features to this MVP, a worrying trend became clear – designs were being built and released without being tested beforehand. This led to me developing and implementing a new testing process.

So what did I do? How? And most importantly… why?

For confidentiality, some information (including insights into user behaviour and motivations) may be hidden or omitted. Information correct as of January 2020. All screenshots can be found on Cambridge One.
Cambridge One on different sized devices.

Prologue

When I joined the Press, usability testing was rarely carried out.

While a small amount had been done (with key changes made as a result), it wasn’t included in the workflow of a user story. The priority was closing tickets – which meant that testing seemed to be seen as an “optional extra”, or something that actively blocked delivery.

The irony was that while a design could be released without testing it first, it would have been unthinkable to do this for code.

This began to change in August 2019, when a gap between sprints allowed the Design Team to carry out a week of usability testing with colleagues who were former English teachers and learners.

The visibility that this gave testing, as well as the findings we presented, helped persuade some stakeholders (and eventually the business) that this needed to be done on a more regular basis.

The challenge

Image of the Cambridge Learning Management System (CLMS).
Cambridge One was being built as a replacement for the Cambridge Learning Management System (CLMS), a collaboration between the Press and Cambridge Assessment.

Having been challenged to “own” testing, my ultimate goal was to build a system that wasn’t completely dependent on one person. This seemed to be a big problem across the overall project.

Given the amount of testing that needed to be done, any process needed to be easily used by other designers, ideally with no supervision.

Through talking to key members of the Design Team (including our Lead and Director), I started the task of gaining a deeper understanding of the problems testing could solve.

These conversations opened my eyes to a delicate, behind-the-scenes situation, with three key questions;

How could the frequency of testing be increased?

With six scrum teams and plans to introduce a seventh, any process would have to be scalable and sustainable.

How could test results be published quicker?

With one person covering the testing of six teams, these would have to be organised, run and analysed quickly to allow delivery.

How could the visibility of testing be increased?

Testing was an chance to raise the profile of the Design Team, but would have to be carried out in a way that was impossible to ignore.

How were these questions answered? Let’s fast forward to 2020…

The solution

… where a lot has been achieved.

Flowchart showing testing process for Cambridge One.

The bulk of our testing is now done online, with a different platform being used depending on the type of test. We currently have the ability to run usability, A/B and user flow tests, as well as interviews, surveys and questionnaires.

This current system allows me to spend the first week of a sprint gathering requirements, creating hypotheses and writing tasks to test them. The tests are run over the weekend, producing a series of recordings of people using our designs that are opened on Monday.

These videos are then watched, before a traffic light system is used to grade the success of each task. With key observations collected, a summary is then placed on both an online Miro board and in our testing channel in Teams. This has links to the full results and any assets, as well as the video playlist (on our private YouTube channel).

Language teacher mid lesson.
A lesson observation in Turkey. Due to frequent technical faults with our IWB software making it virtually unusable, the teacher has decided to teach his lesson on a normal whiteboard.

Research has also been carried out in the field. Recent examples of this include trips to language schools to understand how students would use speech recognition software to supplement their learning (see note), and a series of lesson observations in Turkey to develop insights into how IWB software is used in the classroom.

So, all things said, a pretty big change. But how was it done?

NOTE: A key finding from this trip was that the use of speech recognition software depended on the level of the learner. Advanced learners (B2 and above) tended to want to use it to perfect their pronunciation (which they were insecure about), while lower level learners (A1 - B1) were more likely to use it to learn words and phrases.

Increasing testing frequency

With a NN/G seminar predicting it would take over seven hours per user to complete the key tasks in a simple test, the best case scenario was one test per week (assuming five participants).

However, the requirement was to run one test per team per sprint. With six teams to cover, it was obvious that any low-value work had to be cut down to create time for higher value activities.

The time taken to recruit participants was, by far, the biggest obstacle to running regular testing.
Image showing statistics of testing panel.
My original idea was to create a testing panel, and this was something I worked with data protection, compliance and marketing to create. Work is ongoing to see how this can be integrated with existing Press software.

With restrictions making recruiting participants unrealistic (e.g. payment lead times), the focus moved towards finding an external company that could both recruit and pay users. This was complicated by the fact that there were instructions to test with our target audience (teachers, learners etc).

Outsourcing this could have cut the time needed to conduct a test by around 25%. However, this still wasn’t enough for the volume required, and so I proposed the idea of running unmoderated tests.

Running a trial of this method in December 2019 (having finally being able to convince my manager) was able to generate a more detailed set of results in less time. With this common sense argument impossible to ignore, unmoderated testing became standard.

In my opinion, the move to unmoderated testing was the key to increasing testing frequency.

Publishing results quicker

External deadlines meant that teams needed to implement any issues raised by testing ASAP. Therefore, ways to publish results quicker needed to be found.

This was resolved in two ways.

Firstly, detailed reports were no longer created due to the belief that they were not being read. Testing summaries (or “headlines”) were published in their place, with these being placed on a Miro board (along with links to all related assets).

However, a deeper analysis of the results could be requested if required. This involved getting the teams together to watch, discuss and turn observations into insights.

It made no sense to produce a detailed report when it was unlikely that it would be fully read.

Secondly, due to infrequent but serious errors being ignored, I created a spreadsheet that moved the main focus of results away from observations and onto task impact. This was done using a combination of “traffic light” and task criticality scoring.

Traffic light scoring

Green: Task completed without difficulty.
Orange: Task completed with difficulty.
Red: Task not completed.

Task criticality scoring

5: This is a critical task for users.
3: This is an important task for users.
1: This is a generic task for users.

These methods allowed a task impact summary to be created, which in turn generated recommendations based on pre-defined criteria. Further information was given on a second page, which kept the old results format – ranking observations based on frequency.

Results spreadsheet used since Jan 2020.
The new results format considered difficulty and criticality, something the old one didn’t. The need to report both became clear to me when I conducted an onboarding test that ended up with one user locking themselves out of their account with no warning – a critical failure.

Increasing testing visibility

The final piece of the puzzle was figuring out a way to get testing noticed more. This had two goals; to try and encourage more testing requests from scrum teams, and also to make the Design Team look as good as possible.

Creating a “single source of truth” was critical to this. In the past, finding older tests was difficult due to them being scattered in numerous locations (our server, hard drives, desktops and online).

Centralising all this in one location meant that finding any required info became quicker and eaiser than before. This also meant that it was easier to keep all scrum teams up to date – which was great, as we were becoming increasingly vocal.

Due to a combination of emails, posts on Teams and bi-weekly presentations, testing updates are now both more frequent and visible.
Testing results wall in Miro.
Each card on the board represents a test, and has a summary as well as links to all related assets (scripts, videos etc.). This board was originally combined with our research findings, but has since been separated to minimise confusion.

All testing (and related communication) carried out by the Design Team was done under the label “Design Testing”, with any software needed accessed using the email of a shared inbox in this name.

By doing this, any designer in the Team was able to sign in and independently set up testing if needed. We were also able to save money on these programs by not needing to buy additional seats.

Image of a television with a person pointing.
A 75’ television was installed in our department, with recent work the Team had done shown on loop in order to communicate what we were up to.

Epilogue

Going into 2020, we are definitely a lot closer to having a comprehensive testing process that could be used by other designers without supervision. While not fully completed, the current solution is effective enough for testing responsibility to be moved back to designers within scrum teams.

However, there are still areas to be ironed out, and a lot more to be done. In my opinion, our biggest challenges going forwards are;

Testing with users of primary and secondary school age

With products being designed exclusively for people in these age groups, it is becoming more and more important to test with them. As our current platforms only allow us to recruit adults, a new approach might be needed.

Testing with teachers on an interactive whiteboard

With new programs being designed for this, the need to test on the type of device that will actually be used in a classroom is essential. This is likely to result in us doing even more studies and usability tests in the field.

While change is still ongoing, testing is in a much better place today than it was three months ago.

Want to read more?