No more Patreon for me…

Going on with it would mean that from now on, I would need to put some serious time into it, creating a bunch of new material. I have plenty of ideas, but if I create this for only ten people, then why bother? That set me to think: why did this not work? I can think of a bunch of reasons:

  1. It wasn’t as good, fun or original as I thought it was. I tried to come up with angles to the topics that you don’t often see in the other online material, but who knows that this is what people would sign up for?
  2. The audience I was trying to reach was too much of a niche. I aimed at people who already knew some data science, but still were surprised by some of the topics I presented.
  3. I didn’t do enough marketing or advertisement. In the first weeks I spammed a bunch of platforms with it, and basically all of my membership came from that. After that, I sometimes posted teasers every here and there, but when this doesn’t seem fruitful, I get demotivated quickly. (If someone knows how to do this properly, I’d be all ears!)

When I asked my members what they would want to see treated in the episodes, I got requests for two types of things, that I wasn’t going to deliver:

  1. Examples from real life. I have done these things in the wild, but obviously can’t get too specific about use cases, and certainly can’t share the data I have worked with in the past. These are typically sensitive…
  2. Data engineering. Building a model is easy, deploying it isn’t, right? That’s probably right (although I think there is much to gain still in the sciency part, rather than engineering part of data science), but I just don’t enjoy the engineering part that much, so in an experiment for fun, I am not going to treat stuff I don’t like. Stubborn me.

Either way, the journey ends here. The site will go down, the Twitter account will be deleted and I don’t need to worry about finishing content on time anymore. It was fun, I learned more than my members did and at some point in the future I’ll just post all the stuff that I created out in the open on Github. The money my patrons were charged (or what is left of it) will go to open source projects. On to the next adventure!

New laptop, new Linux distro!

I’m a Linux user. Perhaps that should have read “I’m a Linux fan”. At work (for the last 8 years) I was forced to have a Windows workstation, but VMs and my personal laptops have been running Linux since 2001, and so have many of my work computers. Next month, in my new job, I’ll have a Linux machine again! I like it better than Windows, which I have used often for work, since 2013, and some versions of MacOS, which I have used for work while in the US (2010-2013). I’m not quite a distro-hopper, but I do try to switch every once in a while. Getting a new laptop is a great incentive to try something new (I have to install something anyway!) and I happened to get a new private laptop three weeks ago. It’s a Lenovo IdeaPad Gaming 3i. I’m not a gamer, but GPUs have more than one potential use (some specs in the image below). To use it properly, I need to get rid of Windows asap, obviously.

Before, I have used Fedora and RHEL, and a whole bunch of Ubuntu based systems: Ubuntu (until they started Unity), Mint, Kubuntu and Xubuntu. I thought I never was a big fan of GNOME. After my recent purchase, I was struck by the enthusiasm for Pop!_OS that I saw on the web. A distro that is basically just another Ubuntu fork, with some tweaks to the GNOME desktop environment. I was a little bit skeptical, but I was also looking to see what distro I would in September put on my work laptop (and, during work hours, I don’t want to be bothered with stuff that doesn’t work, so I’d rather try it out first). I gave it a try and was planning to try OpenSUSE or some Arch-based distro like Manjaro as well (not that I feel the need to feel very hardcore, but some people are fans, right?). Pop!_OS 21.04 was just out, so I created the bootable USB stick and went ahead. Installation was ultra-easy.

I’m not moving back, nor away! I like the look and feel of Pop!_OS very much and became a fan of a tiling window manager in just a few hours. Only marginal tweaking of default settings was needed for me, and not a lot of bloat-ware was installed (but I did remove vim, HA!). The fact that their desktop environment is called “Cosmic” and that they have a lot of desktop art in that general theme appeals to me as well. One of the really nice things is that they have an iso for computers with NVIDIA graphics, that makes the gpu work basically out of the box. Moreover, you can set the computer to use the internal graphical card (intel) only, the NVIDIA only, a hybrid scheme (internal only doesn’t let you use a second monitor…) or… a setting in which the NVIDIA card is only available for computation. Great for battery life. Also, the seamless integration of apt and flatpaks is quite nice. I feel confident to throw this on the cute Dell XPS 15 that I will get next month and I will likely be up and running in just a few hours. (And then I hope the web-based version of the whole 365 suite will not let me down…because yes, also this employer is quite MS-based in their tooling).

Terminal art, and some hints of what the desktop might look like.

So… Who’s gonna convince me of another distro to try? Suggestions, preferably with some arguments, are welcome!

The bumpy road to academia’s side entrance

I have indicated already earlier on this blog that I miss academia, and that I wouldn’t mind moving back into an academic job. I have made some attempts recently and want to reflect on the process here. Spoiler alert: I’ll start a job at the University of Amsterdam very soon!

In my journey back into academic life I have also applied less successfully twice, and for reflection on that, it is probably useful to understand my boundary conditions:

  • I left my academic career in astrophysics now roughly 8 years ago and have not done any pure science since (at least not visibly).
  • I have few, too few papers from three years of postdoc. I left my postdoc position with no intention to go back, and therefore have just dropped all three first-author papers that were in the making on the spot. They were never and will never be published.
  • I am strongly geographically bound. I can commute, but I can not move. Hence, I am bound to local options.

I have spent these last 8 years on data science and gained a fair amount of experience in that field. All that experience is in applied work. I have not done any fundamental research on data science methodology, As an aside, I have of course learned a lot about software development and team work in companies of different sizes. I have seen the process of going from a Proof-of-Concept study to building actual products in a scalable, maintainable production environment (often in the cloud) up close, very close. Much of that experience could be very useful for academia. If I (and/or my collaborators) back then had worked with standards even remotely resembling what is common in industry, science would progress faster, it would suffer much less from reproducibility issues and it would be much easier to build and use science products for a large community of collaborators.

But I digress…. The first application for an assistant professorship connected closely to some of the work I have done in my first data science job. I spent 5,5 years at a healthcare insurance provider, where some projects were about the healthcare side of things, as opposed to the insurance business. The position was shared between a university hospital and the computer science institute. I applied and got shortlisted, to my surprise. After the first interview, I was still in the race, with only one other candidate left. I was asked to prepare a proposal for research on “Data Science in Population Health” and discussed the proposal with a panel. It needed to be interesting for both the hospital as well as for the computer scientists, so that was an interesting combination of people to please. It was a lot of fun to do, actually, and I was proud of what I presented. The committee said they were impressed and the choice was difficult, but the other candidate was chosen. The main reason was supposedly my lack of a recent scientific track record.

What to think of that? The lack of track record is very apparent. It is also, I think, understandable. I have a full time job next to my private/family life, so there is very little time to build a scientific track record. I have gained very relevant experience in industry, which in fact could help academic research groups as well, but you can’t expect people to build experience in a non-academic job and build a scientific track record on the side, in my humble opinion. I was offered to compete for a prestigious postdoc-like fellowship at the hospital for which I could fine-tune my proposal. I respectfully declined, as that was guaranteed to be short-term, after which I would be without a position again. In fact, I was proud to end with the silver medal here, but also slightly frustrated about the main reason for not getting gold. If this is a general pattern, things would look a little hopeless.

As part of my job, and as a freelancer, I have spent a lot of time and effort on educational projects. I developed training material and gave trainings, workshops and masterclasses on a large variety of data science-related topics, to a large variety of audiences. Some of those were soft skill trainings, some were hard skill. Most were of the executive education type, but some were more ‘academic’ as well. When at the astronomical institute at biking distance a job opening with the title “Teaching assistant professor” appeared I was more than interested. It seemed to be aimed at Early Career Scientists, with a very heavy focus on education and education management. Contrary to far most of the job openings I have seen at astronomical institutes, I did not have to write a research statement, nor did they ask for any scientific accomplishment (at least not literally in the ad, perhaps this was assumed to go without saying). They asked for a teaching portfolio, which I could fill with an amount of teaching that must have been at least on par with successful candidates (I would guess the equivalent of 6 ECTS per year, for 3 years on end, and some smaller, but in topic more relevant stuff before that) and with excellent evaluations all across. Whatever was left of the two pages was open for a vision on teaching, which I gladly filled up as well. Another ingredient that would increase my chances was that this role was for Dutch speaking applicants and that knowledge of the Dutch educational system was considered a plus. Score and score. That should have significantly narrowed the pool of competitors. In my letter, I highlighted some of the other relevant experience I gained, that I would gladly bring into the institute’s research groups.

Right about at the promised date (I was plenty impressed!), the email from the selection committee came in! “I am sorry that we have to inform you that your application was not shortlisted.” Without any explanation given, I am left to guess what was the main issue with my application here. I wouldn’t have been overly surprised if I wasn’t offered the job, but I had good hopes of at least a shortlist, giving me the opportunity to explain in person why I was so motivated, and in my view qualified. So, were they in fact looking for a currently practicing astronomer? Was research more important than the job ad made it seem? Is my teaching experience too far from relevant, or actually not (good) enough? Dare I even question whether even this job ad was actually aiming for top-tier researchers rather than for people with just a heart (and perhaps even talent) for teaching? It’s hard to guess what the main reason was, and I shouldn’t try. One thing I am reluctantly concluding from this application is that a job in professional astronomy is hard to get for somebody who has long left the field. I think this vacancy asked for experience and skills that match my profile very well, so not even being shortlisted says a lot to me. Perhaps that’s not grounded, but that’s how it goes with sentiment, I guess. Perhaps a dedicated data science job in astronomy is still feasible, who knows.

In September, I’ll join the University of Amsterdam.

But alas, as said, I have also applied successfully. Yay! The University of Amsterdam (UvA) had an opening for a lead data scientist in the department of policy and strategy. Working for, rather than in higher education was something that previously didn’t really occur to me, but this really sounds like an opportunity to do what I like to do and do well, in the field where my heart is. The UvA is putting emphasis on data literacy in education as well as (inter-disciplinary) research. Big part of the job will be to build and maintain a network inside and outside of the university with data science communities. The Amsterdam Data Science Center fosters research that uses data science methods and meets around the corner. I will strive to take a central, or at least very visible role in that Center and be very close to academic interdisciplinary research! I’m excited! In due time, I’ll report on my experience.

Proper statistics is Bayesian

I was already convinced years ago, by videos and a series of blog posts by Jake VanderPlas (links below), that if you do inference, you do it the bayesian way. What am I talking about? Here is a simple example: we have a data set with x and y values and are interested in their linear relationship: y = a x + b. The numbers a and b will tell us something we’re interested in, so we want to learn their values from our data set. This is called inference. This linear relationship is a model, often a simplification of reality.

The way to do this you learn in (most) school(s) is to make a Maximum Likelihood estimate, for example through Ordinary Least Squares. That is a frequentist method, which means that probabilities are interpreted as the limit of the relative frequencies for very many trials. Confidence intervals are the intervals that describe where the data would lie after many repetitions of the experiment, given your model.

In practice, your data is what it is, and you’re unlikely to repeat your experiment many times. What you are after is an interpretation of probability that describes the belief you have in your model and its parameters. That is exactly what Bayesian statistics gives you: the probability of your model, given your data.

For basically all situations in science and data science, the Bayesian interpretation of probability is what suits your needs.

What, in practice, is the difference between the two? All of this is based on Bayes’ theorem, which is a pretty basic theorem following from conditional probability. Without all the interpretation business, it is a very basic result, that no one doubts and about which there is no discussion. Here it is:

Bayes' Theorem
Bayes’ theorem reads as: The probability of A, given B (the “|B” means “given that B is true”) equals the probability of B, given A times the probability of A, divided by the probability of B. Multiplying both sides with P(B) results in the probability of A and B on both sides of the equation.

Now replace A by “my model (parameters)” and B by “the data” and the difference between inference in the frequentist vs Bayesian way shines through. The Maximum Likelihood estimates P(data | model), while the bayesian posterior (the left side of the equation) estimates P (model | data), which arguably is what you should be after.

As is obvious, the term that makes the difference is P(A) / P(B), which is where the magic happens. The probability of the model is hat is called the prior: stuff we knew about our model (parameters) before our data came in. In a sense, we knew something about the model parameters (even if we knew very very little) and update that knowledge with our data, using the likelihood P(B|A). The normalizing factor P(data) is a constant for your data set and for now we will just say that it normalizes the likelihood and the prior to result in a proper probability density function for the posterior.

The posterior describes the belief in our model, after updating prior knowledge with new data.

The likelihood is often a fairly tractable analytical function and finding it’s maximum can either be done analytically or numerically. The posterior, being a combination of the likelihood and the prior distribution functions can quickly become a very complicated function that you don’t even know the functional form of. Therefore, we need to rely on numerical sampling methods to fully specify the posterior probability density (i.e. to get a description of how well we can constrain our model parameters with our data): We sample from the priors, calculate the likelihood for those model parameters and as such estimate the posterior for that set of model parameters. Then we pick a new point in the priors for our model parameters, calculate the likelihood and then the posterior again and see if that gets us in the right direction (higher posterior probability). We might discard the new choice as well. We keep on sampling, in what is called Markov Chain Monte Carlo process in order to fully sample the posterior probability density function.

On my github, there is a notebook showing how to do a simple linear regression the Bayesian way, using PyMC3. I also compare it to the frequentist methods with stasmodels and scikit-learn. Even in this simple example, a nice advantage of getting the full posterior probability density shines: there is an anti-correlation between the values for slope and intercept that makes perfect sense, but only shows up when you actually have the joint posterior probability density function:

Admittedly, there is extra work to do, for a simple example like this. Things get better, though, for examples that are not easy to calculate. With the exact same machinery, we can attack virtually any inference problem! Let’s extend the example slightly, like in my second bayesian fit notebook: An ever so slightly more tricky example, even with plain vanilla PyMC3. With the tiniest hack it works like a charm, even though the likelihood we use isn’t even incorporated in PyMC3 (which makes it quite advanced already, actually), we can very easily construct it. Once you get through the overhead that seems killing for simple examples, generalizing to much harder problems becomes almost trivial!

I hope you’re convinced, too. I hope you found the notebooks instructive and helpful. If you are giving it a try and get stuck: reach out! Good luck and enjoy!

A great set of references in “literature”, at least that I like are:

An old love: magnetic fields on the Sun

Dutch Open Telescope on La Palma, credits: astronomie.nl and Rob Hammerschlag.

Late 2003 and early 2004 I was a third year BSc student. I was lucky enough to be taken to La Palma by my professor Rob Rutten, presumably to use the Dutch Open Telescope, a fancy looking half alien with a 45 cm reflective telescope on top, that makes some awesome images (and movies!) of the solar atmosphere. It turned out to be cloudy for two weeks (on the mountaintop, lovely weather on the rest of beautiful island, not too bad….), so my “project” became more of a literature research and playing with some old data than an observational one.

A tiny part of the solar surface
A small part of the surface of the Sun. The lighter areas are the top of convective cells where hot gas boils up. Cooling gas flows down in the darker lanes in between. The bright white points are places where the magnetic field of the Sun is strong and sticks through the surface.

I started looking into “bright points” in “intergranular lanes”. The granules, see left, are the top of convection cells, pretty much the upper layer of the boiling sun. In the darker lanes in between the cooling gas falls back down to be heated and boiled up and again later. Inside these darker lanes you sometimes see very bright spots. These are places where bundles of magnetic field stick up through the surface. These contain some less gas, so you look deeper into the Sun where it is hotter, hence the brightness of these spots.

With different filters you can choose at which height in the solar atmosphere you look (more or less; I guess you understand I take few shortcuts in my explanation here…), so we can see how such regions look higher up and see what the gas and these magnetic bundles do. Looking at a bundle from the top isn’t necessarily very informative, but when you look near the limb of the sun, you are basically getting a side view of these things:

Images near the limb of the Sun in three different filters taken simulateously. On the left you see a side view of the image above (on another part of the Sun at another time, though). Moving to the right are filters that typically see a little bit higher up in the atmosphere (the so-called chromosphere). For the interested reader: the filters are G-band, Ca II H and H-alpha. Image taken from Rutten et al. 2012.

I spent about 3 months discussing with the solar physics group in Utrecht, which still existed back in the day, trying to figure out how these magnetic fluxtubes actually work. I wrote a (admittedly mediocre quality) Bachelor’s thesis about this work that ends with some speculations. I fondly remember how my supervisor did not like my magnetic field structure based interpretation and believed it was more related to effects of the temperature structure and corresponding effect on how the radiation travels through these bits of atmosphere.

I suddenly was drawn back to this topic two weeks ago, when I read the Dutch popular astronomy magazine Zenit. It was a special about Minnaert’s famous “Natuurkunde van ‘t vrije veld” and featured an an article by the designer of the DOT, Rob Hammerschlag, and talked about observations of the Sun, including these magnetic structures. I followed some links from the article and found myself browsing the professional solar physics literature from the past decade or so, on the look for the current verdict on these “straws” in the middle image above.

A detailed simulation of the solar surface and its magnetic field, by Chitta and Cameron from the MPS in Germany.

Obviously, the answer is slightly complicated and not as simple as mine (or my professor’s!) interpretation from 2004. While I was suggesting twisted magnetic field lines to be important, it turned out likely that the relevant phenomenon that is important in these straws are torsoidal waves through the tube (waves that you can make by grabbing it, and twisting like you twist the throttle of a motorcycle). Great fun to be taken back on a journey into the solar atmosphere, just by reading a simple popular article!

I regret quitting astrophysics

In 2013 I decided to quit my career in astrophysics, move back “home” and become a data scientist. The blog post I wrote about my decision was probably my best read publication as a professional astronomer and it was moving to read all the reactions from people who were struggling with similar decisions. I meant every word in that blog post and I still agree with most of what I said. Now, 7 years after the fact, it is time to confess: I deeply regret quitting.

This post is meant to give my point of view. Many people who left academia are very happy that they did. Here I present some arguments why one might not want to leave, which I hope will be of help for people facing decisions like these.

I miss being motivated. In the first few years after jumping ship many people asked me why I would ever wanted to not be a professional astronomer. I have always said that my day-to-day work wasn’t too different, except that what I did with data was about financial services or some other business I was in, rather than about galaxies and the Universe, but that the “core activities” of work were quite similar. That is kind of true. On an hour by hour basis, often I’m just writing (Python) code to figure things out or build a useful software product. The motivation to do what you do, though, is very very different. The duty cycle and technical depth of projects are short and shallow and the emphasis of projects is much more on getting working products than on understanding. I am doing quite well (in my own humble opinion), but it is hard to get satisfaction out of my current job.

I miss academic research. The seeds of astronomy were planted at very young age (8, if I remember correctly). The fascination for the wonders of the cosmos has changed somewhat in nature while growing up but hasn’t faded. Being at the forefront of figuring things out about the workings of the Universe is amazing, and unparalleled in any business setting. Having the freedom to pick up new techniques that may be useful for your research is something that happened to me only sporadically after the academic years. The freedom to learn and explore are valuable for creative and investigative minds and it doesn’t fit as well in most business settings that I have seen.

I miss working at academic institutions. The vibe of being at a large research institute, surrounded by people who are intrinsically motivated to do what they do was of great value to me. Having visitors over from around the globe with interesting, perhaps related work was a big motivator. That journal clubs, coffee discussions, lunch talks, colloquiums etc. are all “part of the job” is something that even most scientists don’t always seem to fully appreciate. Teaching, at the depth of university level classes, as a part of the job is greatly rewarding (I do teach nowadays!).

I miss passion and being proud of what I do. The internet says I have ”the sexiest job of the 21st century”, but I think my previous job was more enjoyable to brag about at birthday parties. I can do astro as a hobby, but that simply doesn’t give you enough time to do something substantial enough.

I don’t miss … Indeed, the academic career also had its downsides. There is strong competition and people typically experience quite some pressure to achieve. The culture wasn’t always very healthy and diversity and equality are in bad shape in academia. Success criteria of your projects and of you as a person are typically better motivated in business. The obligatory nomadic lifestyle that you are bound to have as an early career scientist were a very enjoyable and educational experience, but it can easily become a burden on your personal life. The drawbacks and benefits of any career path will balance out differently for everybody. If you get to such a point, don’t take the decision lightly.

The people who questioned my decision to become an extronomer were right. I was wrong. It seems too late to get back in. I think I have gained skills and experience that can be very valuable to the astronomical community, but I know that that is simply not what candidates for academic positions are selected on. On top of that, being geographically bound doesn’t help. At least I will try to stay close to the field and who knows what might once cross my path.

Using a graphics card for neural networks

Just a short blurb of text today, because I ran into an issue that ever so slightly annoyed me last weekend. I had bought an old second-hand desktop machine lately, because I wanted to try out a couple of things that are not particularly useful on a laptop (having a server to log on to/ssh into, host my own Jupyter Hub, doing some I/O-heavy things with images from the internet, etc.). One more thing I wanted to try was using TensorFlow with a GPU in your machine. The idea was to check out whether this indeed is as simple as just installing tensorflow-gpu (note that in versions newer than 1.15 of tensorflow, there is no need to install CPU and GPU capabilities separately) and the software would figure out itself whether or not to use the GPU.

Then TensorFlow logo, taken from the Wikipedia site linked above.

“TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. It is used for both research and production at Google.‍” (source: Wikipedia)

I use it mostly with the lovely Keras API, which allows easy set-up of neural networks in TensorFlow. I have used it, for example (disclaimer: old code…) in a blog post titled “Beyond the recognition of handwritten digits”.

It turned out that TensorFlow will indeed do this for you, but not with just any graphics card. I have an NVIDIA GeForce GT 620. This is quite an old model indeed, but it still has 96 CUDA cores and a gigabyte of memory. Sounded good to me. After loading tensorflow from a Python session you can have it detect the graphics card, so you know it can be used for the calculations:

This is a screenshot of a console in Jupyter Lab.

As you can see, it did not detect a graphics card…. There is a hint in a terminal session that is behind my jupyter session though:

Aplogies for the sub-optimal reading experience here….

Apparently, the cuda compute capability of my graphics card is lower than it should be. Only cards with cuda capability of 3.5 or higher will work (I know the terminal says 3.0, but the documentation says 3.5). On this NVIDIA page, there is a list of their cards and the corresponding cuda capabilities. Pleas check that out before trying to run TensorFlow on it (and before getting frustrated). And when you consider buying a (second-hand?) graphics card to play with, definitely make sure you buy a useful one!

Test for COVID-19 in groups and be done much quicker!

In these times of a pandemic the world is changing. On large scales, but also for people in their every day lives. The same holds for me, so I figured I could post something on the blog again, to break with the habit of not doing so. Disclaimer: besides the exercises below being of almost trivial over-simplicity, I’m a data scientist and not an epidemiologist. Please believe specialists and not me!

Inspired by this blog post (in Dutch) I decided to look at simple versions of testing strategies for infection tests (a popular conversation topic nowadays), in a rather quick-and-dirty way. The idea is that if being infected (i.e. testing positive) is rare, you could start out by testing a large group as a whole. As it were, the cotton swabs of many people are put into one tube with testing fluid. If there’s no infection in that whole group you’re done with one test for all of them! If there, on the other hand, is an infection, you can cut the group in two and do the same for both halves. You can continue this process until you have isolated the few individuals that are infected.

It is clear, though, that many people get tested more than once and especially the infected people are getting tested quite a number of times. Therefore, this is only going to help if a relatively low number of people is infected. Here, I look at the numbers with very simple “simulations” (of the Monte Carlo type). Note that these are not very realistic, they are just meant to be an over-simplified example of how group testing strategy can work.

A graphical example of why this can work is given below (image courtesy of Bureau WO):

Above the line you see the current strategy displayed: everyobody gets one test. Below, the group is tested and after we found an infection, the group is cut in halves. Those halves are tested again and those halves with an infection gradually get cut up in pieces again. This leads, in the end, to the identification of infected people. In the mean time, parts of the data without infection are not split up further and everyone in those sections is declared healthy.

Normally, by testing people one by one, you would need as many tests as people to identify all infected people. To investigate the gain by group testing, I divide the number of tests the simulation needs by this total number. The number of people in a very large population that can be tested is a factor gain higher, given a maximum number of tests, like we have available in the Netherlands.

In this notebook, that you don’t need to follow the story, but that you can check out to play with code, I create batches of people. Randomly, some fraction gets assigned “infected”, the rest is “healthy”. Then I start the test, which I assume to be perfect (i.e. every infected person gets detected and there are no false positives). For different infection rates (true percentage overall that is infected), and for different original batch sizes (the size of the group that initially gets tested) I study how many tests are needed to isolate every single infected person.

In a simple example, where I use batches of 256 people (note that this conveniently is a power of 2, but that is not necessary for this to work), I assume a overall infected fraction of 1%. This is lower than the current test results in the Netherlands, but that is likely due to only testing very high risk groups. This results in a factor 8 gain, which means that with the number of tests we have available per day, we could test 8 times more people than we do now, if the 1% is a reasonable guess of the overall infection rate.

To get a sense of these numbers for other infection rates and other batch sizes I did many runs, the results of which are summarized below:

As can be seen, going through the hassle of group testing is not worth it if the true infected fraction is well above a percent. If it is below, the gain can be high, and ideal batch sizes are around 50 to 100 people or so. If we are lucky, and significantly less than a percent of people is infected, gains can be more than an order of magnitude, which would be awesome.

Obviously, group testing comes at a price as well. First of all, people need to be tested more than once in many cases (which requires test results to come in relatively quickly). Also, there’s administrative overhead, as we need to keep track of which batch you were in to see if further testing is necessary. Last, but certainly not least, it needs to be possible to test many people at once without them infecting each other. In the current standard setup, this is tricky, but given that testing is basically getting some cotton swab in a fluid, I’m confident that we could make that work if we want!

If we are unlucky, and far more than a percent of people are infected, different strategies are needed to combine several people in a test. As always, wikipedia is a great source of information for these.

And the real caveat… realistic tests aren’t perfect… I’m a data scientist, and not an epidemiologist. Please believe specialists and not me!

Stay safe and stay healthy!

Beyond the recognition of handwritten digits

There are many tutorials on neural networks and deep learning, that use the handwritten digits data set (MNIST) to show what deep learning can give you. It generally trains a model to recognize the digits and show it’s better than a logistic regression. Many such tutorials also say something about auto-encoders and how they should be able to pre-process images, for example to get rid of noise and improve recognition on noisy (and hence more realistic?) images. Rarely, though, is that worked out into any amount of detail.

This blog post is a short summary, and a full version with code and output is available at my github. Lazy me has taken some shortcuts: I have pretty much always used default values of all models I trained, I have not investigated how they can be improved by tweaking (hyper-)parameters and I have only used simple dense networks (while convolutional neural nets might be a very good choice for improvement in this application). In some sense, the model can only get better in a realistic setting. I have compared to simple but popular other models, and sometimes that comparison isn’t very fair: the training time of the (deep) neural network models often is much longer. It is nevertheless not very easy to define a “fair comparison”. That said, going just that little bit beyond the recognition of the handwritten set can be done as follows.

The usual first steps

To start off, the data set of MNIST has a whole bunch of handwritten digits, a random sample of which looks like this:

The images are 28×28 pixels and every pixel has a value ranging from 0 to 255. All these 784 pixel values can be thought of as features of every image, and the corresponding labels of the images are the digits 0-9. As such machine learning models can be trained, to categorize the images into 10 categories, corresponding to the labels, based on 784 input variables. The labels in the data set are given.

Logistic regression can do this fairly well, and get roughly 92% of the labels right on images it has not seen while training the model (an independent test set), after being trained on about 50k such images. A neural network with one hidden layer (of 100 neurons, the default of the multi-layer perceptron model in scikit-learn) will get about  97.5% right, a significant improvement to the simple logistic regression. The same model with 3 hidden layers of 500, 200 and 50 neurons respectively will further improve that to 98%. When similar models are implemented in Tensorflow/Keras, with proper activation functions, will get about the same score. So far, so good, and this stuff is in basically every tutorial you have seen.

Let’s get beyond that!

Auto-encoders and bottleneck networks

Auto-encoders are neural networks that typically use many input features, then go through a narrower hidden layer and then as output reproduce the input features. Graphically, it would look like this, with the output being identical to the input:

This means that, if the network performs well (i.e. if the input is reasonably well reproduced), all information about the images is stored into a smaller number of features (equal to the number of neurons in the narrowest hidden layer), which can be used as compression technique for example. It turns out that this also works very well to recover images from noisy variants of the images (the idea being that the network figures out the important bits, i.e. the actual image, from the unimportant, i.e. the noise pixels).

I created a set of noisy MNIST images looking, for 10 random examples, like this:

A simple auto-encoder, with hidden layers of 784, 256, 128, 256 and 784 neurons (note the symmetry around the bottleneck layer!), respectively does a fair job at reproducing noise-free images:

It’s not perfect, but it is clear that the “3” is much cleaner than it was before “de-noising”. A little comparison of recognizing the digits on noisy versus de-noised images shows that it pays to do such a pre-processing step before: the model trained on the clean images only recovers 89% of the correct labels on the noisy images, but 94% after de-noising (a factor 2 reduction in the number of wrongly identified labels). Note that all of this is done without any optimization of the model. The Kaggle competition page on this data set shows many optimized models and their amazing performance!

Dimension reduction and visualization

To investigate whether any structure exists in the data set of 784 pixels per image, people often, for good reasons, resort to manifold learning algorithms like t-SNE. Such algorithm go from 784 dimensions to, for example, 2, thereby keeping local structure intact as much as possible. A full explanation of such algorithms goes beyond the scope of this post, I will just show the result of it here. The 784 pixels are reduced to two dimensions and in this figure I plot those two dimensions against each other, color coded by the image label (so every dot is one image in the data set):

The labels seem very well separated, suggesting that reduction of dimensions can go as far as down to two dimensions, still keeping all the information to recover the labels to reasonable precision. This inspired me to try the same with a simple deep neural network. I go through a bottleneck layers of 2 neurons, surrounded by two layers of 100 neurons. Between the input and that there is another layer of 500 neurons and the output layer obvioulsy has 10 neurons. Note that this is not an auto-encoder: the output are the 10 labels, not the 784 input pixels. That network recovers more than 96% of the labels correctly! The output of the bottleneck layer can be visualized in much the same way as the t-SNE output:

It looks very different from the t-SNE results, but upon careful examination, there are similarities in terms if which digits are more closely together than others, for example. Again, this distinction is good enough to recover 96% of the labels! All that needs to be stored about images is 2 numbers, obtained from the encoding part of the bottleneck network, and using the decoding bit of the network, the labels can be recovered very well. Amazing, isn’t it?

Outlook

I hope I have shown you that there are a few simple steps beyond most tutorials that suddenly make these whole deep neural network exercises seem just a little bit more useful and practical. Are there applications of such networks that you would like to see worked out in some detail as well? Let me know, and I just might expand the notebook on github!

Hacking for a future data flood

Astronomy has always been a “big data science”. Astronomy is an observational science: we just have to wait, watch, see and interpret what happens somewhere on the sky. We can’t control it, we can’t plan it, we can just observe in any kind of radiation imaginable and hope that we understand enough of the physics that governs the celestial objects to make sense of it. In recent years, more and more tools that are so very common in the world of data science have also penetrated the field of astrophysics. Where observational astronomy has largely been a hypothesis driven field, data driven “serendipitous” discoveries have become more commonplace in the last decade, and in fact entire surveys and instruments are now designed to be mostly effective through statistics, rather than through technology (even though it is still stat of the art!).

In order to illustrate how astronomy is leading the revolutions in data streams, this infographic was used by the organizers of a hackathon I went to nearing the end of April:
Streams and volumes of data!

The Square Kilometer Array will be a gigantic radio telescope that is going to result in a humongous 160 TB/s rate of data coming out of antennas. This needs to be managed and analysed on the fly somehow. At ASTRON a hackathon was organized to bring together a few dozen people from academia and industry to work on projects that can prepare astronomers for the immense data rates they will face in just a few years.

As usual, and for the better, smaller working groups split up and started working on different projects. Very different projects, in fact. Here, I will focus on the one I have worked on, but by searching for the right hash tag on twitter, I’m sure you can find info on many more of them!

ZFOURGE

We jumped on two large public data sets on galaxies and AGN (Active Galactic Nuclei: galaxies with a supermassive black hole in the center that is actively growing). One of them was a very large data set with millions of galaxies, but not very many properties of every galaxy (from SDSS), the other, out of which the coolest result (in my own, not very humble opinion) was distilled was from the ZFOURGE survey. In that data set, there are “only” just under 400k galaxies, but there were very many properties known, such as brightnesses through 39 filters, derived properties such as the total mass in stars in them, the rate at which stars were formed, as well as an indicator whether or not the galaxies have an active nucleus, as determined from their properties in X-rays, radio, or infrared.

I decided to try something simple and take the full photometric set of columns, so the brightness of the objects in many many wavelengths as well as a measure of their distance to us into account and do some unsupervised machine learning on that data set. The data set had 45 dimensions, so an obvious first choice was to do some dimensionality reduction on it. I played with PCA and my favorite bit of magic: t-SNE. A dimensionality reduction algorithm like that is supposed to reveal if any substructure in the data is present. In short, it tends to conserve local structure and screw up global structure just enough to give a rather clear representation of any clumping in the original high dimensional data set, in two dimensions (or more, if you want, but two is easiest to visualize). I made this plot without putting in any knowledge about which galaxies are AGN, but colored the AGNs and made them a bit bigger, just to see where they would end up:
t-SNE representation of galaxy data from ZFOURGE

To me, it was absolutely astonishing to see how that simple first try came up with something that seems too good to be true. The AGN cluster within clumps that were identified without any knowledge of the galaxies having an active nucleus or not. Many galaxies in there are not classified as AGN. Is that because they were simply not observed at the right wavelengths? Or are they observed but would their flux be just below detectable levels? Are the few AGN far away from the rest possible mis-classifications? Enough questions to follow up!

On the fly, we needed to solve some pretty nasty problems in order to get to this point, and that’s exactly what makes these projects so much fun to do. In the data set, there were a lot of null values, no observed flux in some filters. This could either mean that the observatory that was supposed to measure that flux didn’t point in the direction of the objects (yet), or that there was no detected flux above the noise. Working with cells that have no number at all or only upper limits on the brightness in some of the features that were fed to the machine learning algorithm is something most ML models are not very good at. We made some simple approximations and informed guesses about what numbers to impute into the data set. Did that have any influence on the results? Likely! Hard to test though… For me, this has sprung a new investigation on how to deal with ML on data with upper or lower limits on some of the features. I might report on that some time in the future!

The hackathon was a huge success. It is a lot of fun to gather with people with a lot of different backgrounds to just sit together for two days and in fact get to useful results, and interesting questions for follow-up. Many of the projects had either some semi-finished product, or leads into interesting further investigation that wouldn’t fit in two days. All the data is available online and all code is uploaded to github. Open science for the win!