Level Up Your Automated Tests
I presented a new talk at GOTO Chicago 2015 about how to change a team’s attitude towards writing automated tests. The talk covers the same case study as Groovy vs Java for Testing, adopting Spock in MongoDB, but this is a more process/agile/people perspective, not a technical look at the merits of one language over another. Slides available below. As always, the slides are not super-useful out of context, but they do contain my conclusions (also note that due to a technology fail, my hand-drawn style is even more hand-drawn than usual). Questions I sadly did not have a lot of time for questions during the presentation, but thanks to the wonders of modern technology, I have a list of unanswered questions which I will attempt to address here. Is testing to find out your system works? Or is it so you know when your system is broken? Excellent question. I would expect that if you have a system that’s in production (which is probably the large majority of the projects we work on), we can assume the system is working, for some definition of working. Automated testing is particularly good at catching when your system stops doing the things you thought it was doing when you wrote the tests (which may, or may not, mean the system is genuinely “broken”). Regression testing is to find out when your system is no longer doing what you expect, and automated tests are really good for this. But testing can also make sure you implement code that behaves the way you expect, especially if you write the tests first. Automated tests can be used to determine that your code is complete, according to some pre-agreed specification (in this case, the automated tests you wrote up front). So I guess what I’m trying to say is, when you first write the tests you have tests that, when they pass, proves the system works (assumingyour tests are testing the right things and/or not giving you false positives). Subsequent passes show that you haven’t broken anything. At what level do “tests documenting code” actually become useful? And who is/should the documentation be targeted to? In the presentation, my case study is the MongoDB Java Driver. Our users were Java programmers, who were going to be coding using our driver. So in this example, it makes a lot of sense to document the code using a language that our users understood. We started with Java, and ended up using Groovy because it was also understandable for our users and a bit more succinct. On a previous project we had different types of tests. The unit and system tests documented what the expected behaviour was at the class or module level, and was aimed at developers in the team. The acceptance tests were written in Java, but in a friendly DSL-style way. These were usually written by a triad of tester, business analyst and developer, and documented to all these guys and girls what the top-level behaviour should be. Our audience here was fairly technical though, so there was no need to go to the extent of trying to write English-language-style tests, they were readable enough for a reasonably techy (but non-programmer) audience. These were not designed to be read by “the business” - us developers might use them to answer questions about the behaviour of the system, but they didn’t document it in a way that just anyone could understand. These are two different approaches for two different-sized team/organisations, with different users. So I guess in summary the answer is “it depends”. But at the very least, developers on your own team should be able to read your tests and understand what the expected behaviour of the code is. How do you become a team champion? I.e. get authority and acceptance that people listen to you? In my case, it was just by accident - I happened to care about the tests being green and also being useful, so I moaned at people until it happened. But it’s not just about nagging, you get more buy-in if other people see you doing the right things the right way, and it’s not too painful for them to follow your example. There are going to be things that you care about that you’ll never get other people to care about, and this will be different from team to team. You have two choices here - if you care that much, and it bothers you that much, you have to do it yourself (often on your own time, especially if your boss doesn’t buy into it). Or, you have to let it go - when it comes to quality, there are so many things you could care about that it might be more beneficial to drop one cause and pick another that you can get people to care about. For example, I wanted us to use assertThat instead of assertFalse (or true, or equals, or whatever). I tried to demo the advantages (as I saw them) of my approach to the team, and tried to push this in code reviews, but in the end the other developers weren’t sold on the benefits, and from my point of view the benefits weren’t big enough to force the issue. Those of us who cared, used assertThat. For the rest, I was just happy people were writing and maintaining tests. So, pick your battles. You’ll be surprised at how many people do get on board with things. I thought implementing checkstyle and setting draconian formatting standards was going to be a tough battle, but in the end people were just happy to have any standards, especially when they were enforced by the build. Do you report test, style, coverage, etc failures separately? Why? We didn’t fail on coverage. Enforcing a coverage percentage is a really good way to end up with crappy tests, like for getters/setters and constructors (by the way, if there’s enough logic in your constructor that it needs a test, You’re Doing It Wrong). Generally different types of failures are found by different tools, so for this reason alone they will be reported separately - for example, checkstyle will fail the build if it doesn’t conform to our style standards, codenarc fails it for Groovy style failures, and Gradle will run the tests in a different task to these two. What’s actually important, though, is time-to-failure. For checkstyle, for example, it will fail on something silly like curly braces in the wrong place. You want this to fail within seconds, so you can fix the silly mistake quickly. Ideally you’d have IntelliJ (perhaps) run your checks before it even makes it into your CI environment. Compiler errors should, of course, fail things before you run a test, short-running tests should fail before long-running tests. Basically, the easier it is to fix the problem, the sooner you want to know, I guess. Our build was relatively small and not too complex, so actually we ran all our types of tests (integration and unit, both Groovy and Java) in a single task, because this turned out to be much quicker in Gradle (in our case) than splitting things up into a simple pipeline. You might have a reason to report stuff separately, but for me it’s much more important to understand how fast I need to be aware of a particular type of failure. Sometimes I find myself modifying code design and architecture to enable testing. How can I avoid damaging design? This is a great question, and a common one too. The short answer is: in general writing code that’s easier to test leads to a cleaner design anyway (for example, dependency injection at that appropriate places). If you find you need to rip your design apart to test it, there’s a smell there somewhere - either your design isn’t following SOLID principals, or you’re trying to test the wrong things. Of course, the common example here is testing private methods - how do you test these without exposing secrets? I think for me, if it’s important enough to be tested it’s important enough to be exposed in some way - it might belong in some sort of util or helper (right now I’m not going to go into whether utils or helpers are, in themselves a smell), in a smaller class that only provides this sort of functionality, or simply a protected method. Or, if you’re testing with Groovy, you can access private methods anyway so this becomes a moot point (i.e. your testing framework may be limiting you). In another story from LMAX, we found we had created methods just for testing. It seemed a bit wrong to have these methods only available for testing, but later on down the line, we needed access to many of these methods In Real Life (well, from our Admin app), so our testing had “found” a missing feature. When we came to implement it, it was pretty easy as we’d already done most of it for testing. My co-workers often point to a lack of end-to-end testing as the reason why a lot of bugs get out to production even though they don’t have much unit tests nor integration tests. What, in your experience, is a good balance between unit tests, integration tests and end-to-end testing? Hmm, sounds to me like “lack of tests” is your problem! This is a big (and contentious!) topic. Martin Fowler has written about it, Google wrote something I completely disagree with (so I’m not even going to link to it, but you’ll find references in the links in this paragraph), and my ex-colleague Adrian talks about what we, at LMAX, meant by end-to-end tests. I hope that’s enough to get you started, there’s plenty more out there too. How did you go about getting buy in from the team to use Spock? I cover this in my other presentation on the topic - the short version is, I did a week-long spike to investigate whether Spock would make testing easier for us, showed the pros and cons to the whole team, and then led by example writing tests that (I thought) were more readable than what we had before and, probably most importantly, much easier to write than what we were previously doing. I basically got buy-in by showing how much easier it was for us to use the tool than even JUnit (which we were all familiar with). It did help that we were already using Gradle, so we already had a development dependency on Groovy. It also helped that adding Spock made no changes to the dependencies of the final Jar, which was very important. Over time, further buy-in (certainly from management) came when the new tests started catching more errors - usually regressions in our code or regressions in the server’s overnight builds. I don’t think it was Spock specifically that caught more problems - I think it was writing more tests, and better tests, that caught the issues. Can we do data driven style tests in frameworks like junit or cucumber? I don’t think you can in JUnit (although maybe there’s something out there). I believe someone told me you can do it in TestNG. Are there drawbacks to having tests that only run in ci? I.e I have Java 8 on my machine, but the test requires Java 7 Yes, definitely - the drawback is Time. You have to commit your code to a branch that is being checked by CI and wait for CI to finish before you find the error. In practice, we found very little that was different between Java 7 and 8, for example, but this is a valid concern (otherwise you wouldn’t be testing a complex matrix of dependencies at all). In our case, our Java 6 driver used Netty for async capabilities, as the stuff we were using from Java 7 wasn’t available. This was clearly a different code path that wasn’t tested by us locally as we were all running Java 8. Probably more importantly for us is we were testing against at least 3 different major versions of the server, which all supported different features (and had different APIs). I would often find I’d broken the tests for version 2.2 as I’d only been running it on 2.6, and had forgotten to either turn off the new tests for the old server versions, or didn’t realise the new functionality wouldn’t work there. So the main drawback is time - it takes a lot longer to find out about these errors. There are a few ways to get around this: Commit often!! And to a branch that’s actually going to be run by CI Make your build as fast as possible, so you get failures fast (you should be doing this anyway) You could set up virtual machines locally or somewhere cloudy to run these configurations before committing, but that sounds kinda painful (and to my mind defeats a lot of the point of CI). I set up Travis on my fork of the project, so I could have that running a different version of Java and MongoDB when I committed to my own fork - I’d be able to see some errors before they made it into the “real” project. If you can, you probably want these specific tests run first so they can fail fast. E.g. if you’re running a Java 6 & MongoDB 2.2 configuration on CI, run those tests that only work in that environment first. Would probably need some Gradle magic, and/or might need you to separate these into a different set of folders. The advantage of this approach though is if you set up some aliases on your local machine you could sanity check just these special cases before checking in. For example, I had aliases to start MongoDB versions/configurations from a single command, and to set JAVA_HOME to whichever version I wanted. Do you have any tips for unit tests that pass on dev machines but not on Jenkins because it’s not as powerful as our own machines? E.g. Synchronous calls timeout on the Jenkins builds intermittently. Erk! Yes, not uncommon. No, not really. We had our timeouts set longer than I would have liked to prevent these sorts of errors, and they still intermittently failed. You can also set some sort of retry on the test, and get your build system to re-run those that fail to see if they pass later. It’s kinda nasty though. At LMAX they were able to take testing seriously enough to really invest in their testing architecture, and, of course, this is The Correct Answer. Just often very difficult to sell. If you ask where are tests and dev asks if code is correct? And you say yes. Then dev asks why you’re delaying shipping value, how do you manage that? These are my opinions: Your code is not complete without tests that show me it’s complete. Your code might do what you think it’s supposed to do right now, but given Shared Code Ownership, anyone can come in and change it at any time, you want tests in place to make sure they don’t change it to break what you thought it did The tests are not so much to show it works right now, the tests are to show it continues to work in future Having automated tests will speed you up in future. You can refactor more safely, you can fix bugs and know almost immediately if you broke something, you can read from the test what the author of the code thought the code should do, getting you up to speed faster. You don’t know you’re shipping value without tests - you’re only shipping code (to be honest, you never know if you’re shipping value until much later on when you also analyse if people are even using the feature). Testing almost never slows you down in the long run. Show me the bits of your code base which are poorly tested, and I bet I can show you the bits of your code base that frequently have bugs (either because the code is not really doing what the author thinks, or because subsequent changes break things in subtle ways). If you say code is hard to understand and dev asks if you seriously don’t understand the code, how do you explain you mean easy to understand without thinking rather than ‘can I compile this in my head’? I have zero problem with saying “I’m too stupid to understand this code, and I expect you’re much smarter than me for writing it. Can you please write it in a way so that a less smart person like myself won’t trample all over your beautiful code at a later date through lack of understanding?” By definition, code should be easy to understand by someone who’s not the author. If someone who is not the author says the code is hard to understand, then the code is hard to understand. This is not negotiable. This is what code reviews or pair programming should address. What is effective nagging like? (Whether or not you get what you want) Mmm, good question. Off the top of my head: Don’t make the people who are the target of the nagging feel stupid - they’ll get defensive. If necessary, take the burden of “stupidity” on yourself. E.g. “I’m just not smart enough to be able to tell if this test is failing because the test is bad or because the code is bad. Can you walk me through it and help me fix it?” Do at least your fair share of the work, if not more. When I wanted to get the code to a state where we could fail style errors, I fixed 99% of the problems, and delegated the handful of remaining ones that I just didn’t have the context to fix. In the face of three errors to fix each, the team could hardly say “no” after I’d fixed over 6000. Explain why things need to be done. Developers are adults and don’t want to be treated like children. Give them a good reason and they’ll follow the rules. The few times I didn’t have good reasons, I could not get the team to do what I wanted. Find carrots and sticks that work. At LMAX, a short e-mail at the start of the day summarising the errors that had happened overnight, who seemed to be responsible, and whether they looked like real errors or intermittencies, was enough to get people to fix their problems2 - they didn’t like to look bad, but they also had enough information to get right on it, they didn’t have to wade through all the build info. On occasion, when people were ignoring this, I’d turn up to work with bags of chocolate that I’d bought with my own money, offering chocolate bars to anyone who fixed up the tests. I was random with my carrot offerings so people didn’t game the system. Give up if it’s not working. If you’ve tried to phrase the “why” in a number of ways, if you’ve tried to show examples of the benefits, if you’ve tried to work the results you want into a process, but it’s still not getting done, just accept the fact that this isn’t working for the team. Move on to something else, or find a new angle. 1 I had a colleague at LMAX who was working with a hypothesis that All Private Methods Were Evil - they were clearly only sharable within single class, so provided no reuse elsewhere, and if you have the same bit of code being called multiple times from within the same class (but it’s not valuable elsewhere) then maybe your design is wrong. I’m still pondering this specific hypothesis 4 years on, and I admit I see its pros and cons. 2 This worked so well that this process was automated by one of the guys and turned into a tool called AutoTrish, which as far as I know is still used at LMAX. Dave Farley talks about it in some of hisContinuous Delivery talks. Resources My talk that specifically looks at the advantages of Spock over JUnit, plus some Spock-specific resources. I love Jay Fields book Working Effectively With Unit Tests - if I could have made the whole team read this before moving to Spock, we might have stuck with JUnit. Go read everything Adrian Sutton has written about testing at LMAX. If not everything, definitely Abstraction by DSL and Making End-to-End Tests Work If you can’t make it all the way through Dave Farley and Jez Humble’s excellent Continuous Delivery book, do take a look at one of Dave’s presentations on the subject, for example The Rationale for Continuous Delivery or The Process, Technology and Practice of Continuous Delivery - my own talk was around testing, but I’m working off the assumption that you’re at least running some sort of Continuous Integration, if not Continuous Delivery. Martin Fowler has loads of interesting and useful articles on testing. Abstract What can you do to help developers a) write tests b) write meaningful tests and c) write readable tests? Trisha will talk about her experiences of working in a team that wanted to build quality into their new software version without a painful overhead - without a QA / Testing team, without putting in place any formal processes, without slavishly improving the coverage percentage. The team had been writing automated tests and running them in a continuous integration environment, but they were simply writing tests as another tick box to check, there to verify the developer had done what the developer had aimed to do. The team needed to move to a model where tests provided more than this. The tests needed to: Demonstrate that the library code was meeting the requirements Document in a readable fashion what those requirements were, and what should happen under non-happy-path situations Provide enough coverage so a developer could confidently refactor the code This talk will cover how the team selected a new testing framework (Spock, a framework written in Groovy that can be used to test JVM code) to aid with this effort, and how they evaluated whether this tool would meet the team’s needs. And now, two years after starting to use Spock, Trisha can talk about how both the tool and the shift in the focus of the purpose of tests has affected the quality of the code. And, interestingly, the happiness of the developers.
June 29, 2015
by Trisha Gee
·
1,812 Views