This is a topic that's been brought up in the past (so, no, I'm not
referring to the current lack of any tests running on Linux treeherder):
we have a few major key pieces of our infrastructure that are untested,
indeed, untestable without major developer investment.
To wit:
LDAP*:* Unlike the other major mailnews protocols, LDAP is based not
on a more-or-less textual protocol but a complex ASN.1 binary protocol.
This means that writing a fakeserver is by no means trivial. A better
solution is to run a real LDAP server with some real data, but that's
something we can't do in our current frameworks. And setting up OpenLDAP
is very much painful if you're a novice at system administration or LDAP
itself. Getting LDAP working in automated tests requires infrastructure
work: probably the easiest way to spin up a LDAP server is to create it
in a Docker container (which should be more or less possible on every
major desktop OS, although some OSes may involve transparently starting
VMs), something our infrastructure doesn't allow. Another option might
be to find some toy LDAP server implementations and import them in the
tree for running as a fake server, but it doesn't look to me to involve
substantially less effort than the previous approach.
MAPI: In this case, I'm referring to the ability to use the MAPI C
interface to call MAPISendMessage (or MAPISendMessageW, which we don't
implement yet) to send a message via Thunderbird. It's actually fairly
easy in principle to test this: the MAPI library in question can be
directly LoadLibrary'd and have its methods called directly to trigger
calls.
System address books and system files for import: When it comes to
system integration, it's pretty hard to mock the environments in full
enough detail for tests, particularly Windows' Extended MAPI interface.
This is probably an example where we're not looking to accommodate these
tests in our current suites but rather add some new suites that can test
these details. System integration is very difficult to test, and for
most system integration components, the value-add of the tests is not
worth the cost of maintaining the test infrastructure. However, these
pieces, particularly the conversion from MAPI to message/rfc822 in
import, is complex enough that I think it is worth testing in some
fashion. As for how to do it, that is more difficult. The only real
solution is VMs (or VM-like solutions such as Docker, which only
provides a system integration story for Linux and maybe Windows), but
that is not a sufficient answer for how to make the tests runnable by
contributors, although it is the limit of my system administration
knowledge.
S/MIME, and certificates in general: I'm dropping these two things
into the same categories because they both run into the same major
issue--certificate handling is the major problem for getting these tests
running. In my recent spurt of free time, I've chosen to use it to start
reformulating S/MIME policy. On the one hand, I've managed to fix some
long-standing issues (such as treating encrypt-and-sign the same as
sign-and-encrypt when they are not in fact the same [1]). On the other
hand, having had to spend so long mangling the code to even get to that
point has made really stark the fact that this is code that's very old,
very complex, has very critical side effects, and is completely
untested. It's also turned up a few things that I'd swear are
regressions (albeit over the past 5 years or so, so good luck tracking
them down :-(). Generating test messages for S/MIME requires generating
certificates and importing them, which has historically been somewhat
difficult to pull off since it tends to spawn UI. Another trouble is the
fact that there's a steady march in the need to generate certificates to
keep up with validity times and newly-required algorithmic changes.
Testing against real servers: Related to our lack of LDAP tests, it's
useful to test against real servers. Indeed, when I first started
writing the fakeserver for tests, I did so in part because I thought I
could somewhat mimic the salient characteristics of real servers (oh, to
be young and naive again). Actually setting up servers can be annoying,
particularly for people with weak system administration skills. I
started working on a project to build some real servers and package them
in Docker containers, together with some sample testing data that might
be used for performance tests in the future (better to test performance
against a real implementation than a hacked-up variant, IMHO). The main
thing the project lacks is bulking up the implementation of other
servers, as well as getting better representative data.
This list isn't an exhaustive list of every untested line of code in our
codebase. We almost certainly have lousy coverage of I/O failures, for
example. But it's the major features that generally do deserve tests,
particularly if anyone is going to touch them in the future. Some of the
other features without tests (GSSAPI, compression in IMAP, etc.) just
aren't as important to test and the value of the test could well be
outweighed by the pain having to maintain the test. As I once saw in a
blog, the main value of a test is that it fails: if the test never fails
when you make changes, then it's literally just a waste of time.
Right now, I'm looking again at getting the S/MIME test infrastructure
ready, since I'm interested in doing a more thorough overhaul of our
S/MIME policy, and tests are invaluable for that. This requires
generating some valid and invalid certificates, as well as triggering
some things like OCSP failures (which is outside my knowledge level and
probably that of anyone else on the project alas). It also requires
tickling CMS blobs (the actual cryptographic structures used in S/MIME)
in more specific ways than our NSS infrastructure allows. What this will
probably require is building a tool that can generate the CMS blobs and
package them in an email template, likely using Python libraries in the
vein of pycert.py and pykey.py for Firefox's certificate tests (since we
already have pyasn1 in the tree and I don't want to write DER-encoding
myself). However, since we need to do actual cryptography, it's likely
that we'll end up with committed test files with manual generation run
by executing a command that's going to need to call OpenSSL to generate
CMS blobs. I don't see any easy way around that.
Thoughts/comments/questions/concerns?
[1] Long aside: these two formulations are not equivalent, and that's
because encrypting and signing have very important limitations about
what they cryptographically guarantee. Encrypting first means that the
signature covers only knowledge of the ciphertext, not the plaintext,
and doesn't guarantee that the signer actually knows what the message
says. The most trivial way to do this would have Eve intercept the
message, strip Alice's signature and attach her own to the end, thus
purporting to have originated the message. Sign-and-encrypt itself has
some problems (which basically boils down to "security is hard even for
experts"), but it's far safer than encrypt-and-sign.
--
Joshua Cranmer
Thunderbird module owner
DXR coauthor
Joshua Cranmer wrote on 06.09.17 18:17:
Thoughts/comments/questions/concerns?
IMHO, the primary thing we're lacking are integration tests.
As you mentioned and found out, and as I did myself as well with the
fakeserver, correctingly testing everything would mean to implement the
entire (externally visible) feature set of every server we support,
including all bugs. Obviously, that's a hopeless endeavor.
More importantly, even if we manage to do that, it doesn't do us any
good to have covered past GMail quirks once Google introduces another
one and breaks our users overnight. (More likely in reality, it's
somebody like Verizon, actually.) Once that happens, we'd want to know
immediately. To our users, it doesn't matter whether we broke it or the
ISP broke it, all they want is that it works. Integration tests test
exactly that.
So, I propose to have a test suite where we use test accounts on
important ISPs, and let Thunderbird execute some common functions
(login, mail fetch, mail polling, send email to another test account,
etc.) and check that they work.
If one of the tests break, we'd have something to look into. These tests
would have a higher false positive ratio, i.e. they'd be orange
sometimes without good reason, e.g. a test account might be blocked, a
sent mail might be caught in a spam filter or similar, but it would give
an indication to look into it. Most importantly, we'd immediately notice
when one ISP stops working, be it because they changed configurations,
or introduced a bug, or whatever.
Personally, I'd find that very helpful.
Ben
On 06/09/2017 18:17, Joshua Cranmer wrote:
MAPI: In this case, I'm referring to the ability to use the MAPI C interface to call MAPISendMessage (or MAPISendMessageW, which we don't implement yet) to send a message via Thunderbird. It's actually fairly easy in principle to test this: the MAPI library in question can be directly LoadLibrary'd and have its methods called directly to trigger calls.
Related to https://bugzilla.mozilla.org/show_bug.cgi?id=547027 ?
On 9/6/2017 1:08 PM, Jörg Knobloch wrote:
On 06/09/2017 18:17, Joshua Cranmer wrote:
MAPI: In this case, I'm referring to the ability to use the MAPI C
interface to call MAPISendMessage (or MAPISendMessageW, which we
don't implement yet) to send a message via Thunderbird. It's actually
fairly easy in principle to test this: the MAPI library in question
can be directly LoadLibrary'd and have its methods called directly to
trigger calls.
Related to https://bugzilla.mozilla.org/show_bug.cgi?id=547027 ?
Yes.
--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist
On 9/6/2017 12:29 PM, Ben Bucksch wrote:
Joshua Cranmer wrote on 06.09.17 18:17:
Thoughts/comments/questions/concerns?
IMHO, the primary thing we're lacking are integration tests.
As you mentioned and found out, and as I did myself as well with the
fakeserver, correctingly testing everything would mean to implement
the entire (externally visible) feature set of every server we
support, including all bugs. Obviously, that's a hopeless endeavor.
More importantly, even if we manage to do that, it doesn't do us any
good to have covered past GMail quirks once Google introduces another
one and breaks our users overnight. (More likely in reality, it's
somebody like Verizon, actually.) Once that happens, we'd want to know
immediately. To our users, it doesn't matter whether we broke it or
the ISP broke it, all they want is that it works. Integration tests
test exactly that.
So, I propose to have a test suite where we use test accounts on
important ISPs, and let Thunderbird execute some common functions
(login, mail fetch, mail polling, send email to another test account,
etc.) and check that they work.
I disagree here. For starters, that means our infrastructure is going to
have to rely on the correctness of others' code, and will fail through
no fault of our own. Our developers, or should I say Jörg specifically,
spend too much time trying to track down the others' failures, so adding
more is not a good thing. In addition, what's the point? Most of our
nightly testers are likely to be using one of the larger ISP
configurations anyways. If GMail broke something, how much advance
notice would we really have in our test infrastructure versus someone
filing a bug saying "help, this doesn't work with GMail"? It's also
supremely useless for doing preemptive testing--if I'm making some major
changes locally, I want to run tests to make sure I'm not breaking anything.
That's why I'm specifically proposing that, instead of using major ISPs,
we use the major server implementations, and put them in containers that
can be run on test machines or on local developers' machines. This also
means we can leverage them to build performance tests (you ABSOLUTELY do
not want external services impinging on your performance tests), which
is another thing we are sorely lacking.
--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist
Joshua Cranmer 🐧 wrote on 06.09.17 19:52:
our infrastructure is going to have to rely on the correctness of
others' code, and will fail through no fault of our own.
Right. But if the ISP makes an incorrect change that breaks us, our
application and our users will fail at the same time. So, it is our problem.
Our developers, or should I say Jörg specifically, spend too much time
trying to track down the others' failures, so adding more is not a
good thing.
I didn't say that it would have to be our developers. That's generally a
job for QA, not dev, and may be done by the community or some other
employee.
In addition, what's the point? Most of our nightly testers are likely
to be using one of the larger ISP configurations anyways.
My experience with ISPDB tells me otherwise.
And even if they do, how do they know that it's not an individual
problem? How do we know? By the time we found out and took the problem
seriously, days or weeks have passed. That's too slow. If users can't
get their email, they change mail clients after 2 days.
If GMail broke something
We might know quickly about Gmail, but what about GMX?
It's also supremely useless for doing preemptive testing--if I'm
making some major changes locally, I want to run tests to make sure
I'm not breaking anything.
Yes, and you would be able to do that with the integration tests,
because you'd be able to trigger try builds which run the tests. Right
now, you push the code change, wait, and break users, if you overlooked
something.
That's why I'm specifically proposing that, instead of using major
ISPs, we use the major server implementations, and put them in
containers that can be run on test machines or on local developers'
machines.
You cannot run Gmail's servers, nor those of GMX, nor WEB.DE, nor Yahoo,
nor Verizon.
Ben
It sounds like you have a different idea of what you want these kinds of
tests to be than I do. And since you're saying that developers won't
need to worry about those test results, and I'm speaking with my
developer hat, I'm not going to continue discussion down this vein.
Moving back to an earlier point that you didn't respond to, but response
to which I am curious about:
One value in having containerized server implementations is as the basis
of performance tests, which is an entire test framework that we are
sorely missing.
--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist
On 9/6/2017 11:04 AM, Ben Bucksch wrote:
Joshua Cranmer 🐧 wrote on 06.09.17 19:52:
our infrastructure is going to have to rely on the correctness of
others' code, and will fail through no fault of our own.
Right. But if the ISP makes an incorrect change that breaks us, our
application and our users will fail at the same time. So, it is our
problem.
Our developers, or should I say Jörg specifically, spend too much
time trying to track down the others' failures, so adding more is not
a good thing.
I didn't say that it would have to be our developers. That's generally
a job for QA, not dev, and may be done by the community or some other
employe
While I believe that integration tests would be valuable (and I have a
lot of experience with running automated tests against real servers with
ExQuilla), I think this is going to be an issue of tradeoff of resource
demands. I really doubt if this is the sort of thing that community
volunteers are likely to step forward and lead for the long term. Given
our likely available resources, this is not an area we are likely to be
able to invest in. Better to allocate available resources to developers
who can respond quickly to reports of issues.
:rkent
R Kent James wrote on 06.09.17 23:05:
While I believe that integration tests would be valuable (and I have a
lot of experience with running automated tests against real servers
with ExQuilla), I think this is going to be an issue of tradeoff of
resource demands. I really doubt if this is the sort of thing that
community volunteers are likely to step forward and lead for the long
term. Given our likely available resources, this is not an area we are
likely to be able to invest in. Better to allocate available resources
to developers who can respond quickly to reports of issues.
Are you concerned about server costs or implementation time?
I think both are relatively minor, compared to alternatives.
Server costs: The tests would run only 4 times a day, and on request by
a developer.
Implementation time: Would be fairly fast. We only need to consider
implement 4-5 steps (create account, fetch mail, poll mail, send email,
check that it arrived), and run that for a few dozen accounts, to get
basic test coverage. Nonetheless, because we run these tests end-to-end
and on various servers, we get a lot of test coverage.
Ben
On 9/6/2017 5:19 PM, Ben Bucksch wrote:
Implementation time: Would be fairly fast. We only need to consider
implement 4-5 steps (create account, fetch mail, poll mail, send
email, check that it arrived), and run that for a few dozen accounts,
to get basic test coverage. Nonetheless, because we run these tests
end-to-end and on various servers, we get a lot of test coverage.
It takes a significant amount of time to get anything running on
automated infrastructure, based on my prior experience with Mozilla
automation and automation elsewhere. You're also not accounting for the
time it will take to track down automation failures, which can have a
lot of teething problems (particularly early on in the process).
A more serious problem is your belief that repeating very basic tests on
lots of servers is in any way good test coverage. The basic tests are so
shallow that the diversity of servers is illusory: servers really don't
act differently in basic scenarios, they act differently when things get
hard. When you put message/rfc822 attachments in messages and
base64-encode them. When you have two simultaneous connections to the
same folder in IMAP and start deleting things with one connection and
adding them in the other. When massive messages get packet boundaries
that routinely sit in the middle of the CRLF endings. Quite frankly, for
the test you've suggested, by the second real-world server, running any
more tests isn't going to tell me anything interesting about diversity
of implementation. The only thing it tells you is "is ISPDB accurate?",
which isn't really testing Thunderbird at all.
--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist