Archiv für den Monat Juni 2019

Steve Rymell Head of Technology, Airbus CyberSecurity answers What Should Frighten us about AI-Based Malware?


Of all the cybersecurity industry’s problems, one of the most striking is the way attackers are often able to stay one step ahead of defenders without working terribly hard. It’s an issue whose root causes are mostly technical: the prime example are software vulnerabilities which cyber-criminals have a habit of finding out about before vendors and their customers, leading to the almost undefendable zero-day phenomenon which has propelled many famous cyber-attacks.

A second is that organizations struggling with the complexity of unfamiliar and new technologies make mistakes, inadvertently leaving vulnerable ports and services exposed. Starkest of all, perhaps, is the way techniques, tools, and infrastructure set up to help organizations defend themselves (Shodan, for example but also numerous pen-test tools) are now just as likely to be turned against businesses by attackers who tear into networks with the aggression of red teams gone rogue.

Add to this the polymorphic nature of modern malware, and attackers can appear so conceptually unstoppable that it’s no wonder security vendors increasingly emphasize the need not to block attacks but instead respond to them as quickly as possible.

The AI fightback
Some years back, a list of mostly US-based start-ups started a bit of a counter-attack against the doom and gloom with a brave new idea – AI machine learning (ML) security powered by algorithms. In an age of big data, this makes complete sense and the idea has since been taken up by all manner of systems used to for anti-spam, malware detection, threat analysis and intelligence, and Security Operations Centre (SoC) automation where it has been proposed to help patch skills shortages.

I’d rate these as useful advances, but there’s no getting away from the controversial nature of the theory, which has been branded by some as the ultimate example of technology as a ‘black box’ nobody really understands. How do we know that machine learning is able to detect new and unknown types of attack that conventional systems fail to spot? In some cases, it could be because the product brochure says so.

Then the even bigger gotcha hits you – what’s stopping attackers from outfoxing defensive ML with even better ML of their own? If this were possible, even some of the time, the industry would find itself back at square one.

This is pure speculation, of course, because to date nobody has detected AI being used in a cyber-attack, which is why our understanding of how it might work remains largely based around academic research such as IBM’s proof-of-concept DeepLocker malware project.

What might malicious ML look like?
It would be unwise to ignore the potential for trouble. One of the biggest hurdles faced by attackers is quickly understanding what works, for example when sending spam, phishing and, increasingly, political disinformation.

It’s not hard to imagine that big data techniques allied to ML could hugely improve the efficiency of these threats by analyzing how targets react to and share them in real time. This implies the possibility that such campaigns might one day evolve in a matter of hours or minutes; a timescale defender would struggle to counter using today’s technologies.

A second scenario is one that defenders would even see: that cyber-criminals might simulate the defenses of a target using their own ML to gauge the success of different attacks (a technique already routinely used to evade anti-virus). Once again, this exploits the advantage that attackers always have sight of the target, while defenders must rely on good guesses.

Or perhaps ML could simply be used to crank out vast quantities of new and unique malware than is possible today. Whichever of these approaches is taken – and this is only a sample of the possibilities – it jumps out at you how awkward it would be to defend against even relatively simple ML-based attacks. About the only consolation is that if ML-based AI really is a black box that nobody understands then, logically, the attackers won’t understand it either and will waste time experimenting.

Unintended consequences
If we should fear anything it’s precisely this black box effect. There are two parts to this, the biggest of which is the potential for ML-based malware to cause something unintended to happen, especially when targeting critical infrastructure.

This phenomenon has already come to pass with non-AI malware – Stuxnet in 2010 and NotPetya in 2017 are the obvious examples – both of which infected thousands of organizations not on their original target list after unexpectedly ‘escaping’ into the wild.

When it comes to powerful malware exploiting multiple zero days there’s no such thing as a reliably contained attack. Once released, this kind of malware remains pathogenically dangerous until every system it can infect is patched or taken offline, which might be years or decades down the line.

Another anxiety is that because the expertise to understand ML is still thin on the ground, there’s a danger that engineers could come to rely on it without fully understanding its limitations, both for defense and by over-estimating its usefulness in attack. The mistake, then, might be that too many over-invest in it based on marketing promises that end up consuming resources better deployed elsewhere.  Once a more realistic assessment takes hold, ML could end up as just another tool that is good at solving certain very specific problems.

My contradictory-sounding conclusion is that perhaps ML and AI makes no fundamental difference at all. It’s just another stop on a journey computer security has been making since the beginning of digital time. The problem is overcoming our preconceptions about what it is and what it means. Chiefly, we must overcome the tendency to think of ML and AI as mysteriously ‘other’ because we don’t understand it and therefore find it difficult to process the concept of machines making complex decisions.

It’s not as if attackers aren’t breaching networks already with today’s pre-ML technology or that well-prepared defenders aren’t regularly stopping them using the same technology. What AI reminds us is that the real difference is how organizations are defended, not whether they or their attackers use ML and AI or not. That has always been what separates survivors from victims. Cybersecurity remains a working demonstration of how the devil takes the hindmost.



Do you know who your iPhone is talking to?

Yet these days, we spend more time in apps. Apple is strict about requiring apps to get permission to access certain parts of the iPhone, including your camera, microphone, location, health information, photos and contacts. (You can check and change those permissions under privacy settings.) But Apple turns more of a blind eye to what apps do with data we provide them or they generate about us — witness the sorts of tracking I found by looking under the covers for a few days.

“For the data and services that apps create on their own, our App Store Guidelines require developers to have clearly posted privacy policies and to ask users for permission to collect data before doing so. When we learn that apps have not followed our Guidelines in these areas, we either make apps change their practice or keep those apps from being on the store,” Apple says.

Yet very few apps I found using third-party trackers disclosed the names of those companies or how they protect my data. And what good is burying this information in privacy policies, anyway? What we need is accountability.

Getting more deeply involved in app data practices is complicated for Apple. Today’s technology frequently is built on third-party services, so Apple couldn’t simply ban all connections to outside servers. And some companies are so big they don’t even need the help of outsiders to track us.

The result shouldn’t be to increase Apple’s power. “I would like to make sure they’re not stifling innovation,” says Andrés Arrieta, the director of consumer privacy engineering at the Electronic Frontier Foundation. If Apple becomes the Internet’s privacy police, it could shut down rivals.

Jackson suggests Apple could also add controls into iOS like the ones built into Privacy Pro to give everyone more visibility.

Or perhaps Apple could require apps to label when they’re using third-party trackers. If I opened the DoorDash app and saw nine tracker notices, it might make me think twice about using it.

I don’t mind letting your trackers see my private data as long as I get something useful in exchange.

Forget privacy: you’re terrible at targeting anyway

I don’t mind letting your programs see my private data as long as I get something useful in exchange. But that’s not what happens.

A former co-worker told me once: „Everyone loves collecting data, but nobody loves analyzing it later.“ This claim is almost shocking, but people who have been involved in data collection and analysis have all seen it. It starts with a brilliant idea: we’ll collect information about every click someone makes on every page in our app! And we’ll track how long they hesitate over a particular choice! And how often they use the back button! How many seconds they watch our intro video before they abort! How many times they reshare our social media post!

And then they do track all that. Tracking it all is easy. Add some log events, dump them into a database, off we go.

But then what? Well, after that, we have to analyze it. And as someone who has analyzed a lot of data about various things, let me tell you: being a data analyst is difficult and mostly unrewarding (except financially).

See, the problem is there’s almost no way to know if you’re right. (It’s also not clear what the definition of „right“ is, which I’ll get to in a bit.) There are almost never any easy conclusions, just hard ones, and the hard ones are error prone. What analysts don’t talk about is how many incorrect charts (and therefore conclusions) get made on the way to making correct ones. Or ones we think are correct. A good chart is so incredibly persuasive that it almost doesn’t even matter if it’s right, as long as what you want is to persuade someone… which is probably why newpapers, magazines, and lobbyists publish so many misleading charts.

But let’s leave errors aside for the moment. Let’s assume, very unrealistically, that we as a profession are good at analyzing things. What then?

Well, then, let’s get rich on targeted ads and personalized recommendation algorithms. It’s what everyone else does!

Or do they?

The state of personalized recommendations is surprisingly terrible. At this point, the top recommendation is always a clickbait rage-creating article about movie stars or whatever Trump did or didn’t do in the last 6 hours. Or if not an article, then a video or documentary. That’s not what I want to read or to watch, but I sometimes get sucked in anyway, and then it’s recommendation apocalypse time, because the algorithm now thinks I like reading about Trump, and now everything is Trump. Never give positive feedback to an AI.

This is, by the way, the dirty secret of the machine learning movement: almost everything produced by ML could have been produced, more cheaply, using a very dumb heuristic you coded up by hand, because mostly the ML is trained by feeding it examples of what humans did while following a very dumb heuristic. There’s no magic here. If you use ML to teach a computer how to sort through resumes, it will recommend you interview people with male, white-sounding names, because it turns out that’s what your HR department already does. If you ask it what video a person like you wants to see next, it will recommend some political propaganda crap, because 50% of the time 90% of the people do watch that next, because they can’t help themselves, and that’s a pretty good success rate.

(Side note: there really are some excellent uses of ML out there, for things traditional algorithms are bad at, like image processing or winning at strategy games. That’s wonderful, but chances are good that your pet ML application is an expensive replacement for a dumb heuristic.)

Someone who works on web search once told me that they already have an algorithm that guarantees the maximum click-through rate for any web search: just return a page full of porn links. (Someone else said you can reverse this to make a porn detector: any link which has a high click-through rate, regardless of which query it’s answering, is probably porn.)

Now, the thing is, legitimate-seeming businesses can’t just give you porn links all the time, because that’s Not Safe For Work, so the job of most modern recommendation algorithms is to return the closest thing to porn that is still Safe For Work. In other words, celebrities (ideally attractive ones, or at least controversial ones), or politics, or both. They walk that line as closely as they can, because that’s the local maximum for their profitability. Sometimes they accidentally cross that line, and then have to apologize or pay a token fine, and then go back to what they were doing.

This makes me sad, but okay, it’s just math. And maybe human nature. And maybe capitalism. Whatever. I might not like it, but I understand it.

My complaint is that none of the above had anything to do with hoarding my personal information.

The hottest recommendations have nothing to do with me

Let’s be clear: the best targeted ads I will ever see are the ones I get from a search engine when it serves an ad for exactly the thing I was searching for. Everybody wins: I find what I wanted, the vendor helps me buy their thing, and the search engine gets paid for connecting us. I don’t know anybody who complains about this sort of ad. It’s a good ad.

And it, too, had nothing to do with my personal information!

Google was serving targeted search ads decades ago, before it ever occurred to them to ask me to log in. Even today you can still use every search engine web site without logging in. They all still serve ads targeted to your search keyword. It’s an excellent business.

There’s another kind of ad that works well on me. I play video games sometimes, and I use Steam, and sometimes I browse through games on Steam and star the ones I’m considering buying. Later, when those games go on sale, Steam emails me to tell me they are on sale, and sometimes then I buy them. Again, everybody wins: I got a game I wanted (at a discount!), the game maker gets paid, and Steam gets paid for connecting us. And I can disable the emails if I want, but I don’t want, because they are good ads.

But nobody had to profile me to make that happen! Steam has my account, and I told it what games I wanted and then it sold me those games. That’s not profiling, that’s just remembering a list that I explicitly handed to you.

Amazon shows a box that suggests I might want to re-buy certain kinds of consumable products that I’ve bought in the past. This is useful too, and requires no profiling other than remembering the transactions we’ve had with each other in the past, which they kinda have to do anyway. And again, everybody wins.

Now, Amazon also recommends products like the ones I’ve bought before, or looked at before. That’s, say, 20% useful. If I just bought a computer monitor, and you know I did because I bought it from you, then you might as well stop selling them to me. But for a few days after I buy any electronics they also keep offering to sell me USB cables, and they’re probably right. So okay, 20% useful targeting is better than 0% useful. I give Amazon some credit for building a useful profile of me, although it’s specifically a profile of stuff I did on their site and which they keep to themselves. That doesn’t seem too invasive. Nobody is surprised that Amazon remembers what I bought or browsed on their site.

Worse is when (non-Amazon) vendors get the idea that I might want something. (They get this idea because I visited their web site and looked at it.) So their advertising partner chases me around the web trying to sell me the same thing. They do that, even if I already bought it. Ironically, this is because of a half-hearted attempt to protect my privacy. The vendor doesn’t give information about me or my transactions to their advertising partner (because there’s an excellent chance it would land them in legal trouble eventually), so the advertising partner doesn’t know that I bought it. All they know (because of the advertising partner’s tracker gadget on the vendor’s web site) is that I looked at it, so they keep advertising it to me just in case.

But okay, now we’re starting to get somewhere interesting. The advertiser has a tracker that it places on multiple sites and tracks me around. So it doesn’t know what I bought, but it does know what I looked at, probably over a long period of time, across many sites.

Using this information, its painstakingly trained AI makes conclusions about which other things I might want to look at, based on…

…well, based on what? People similar to me? Things my Facebook friends like to look at? Some complicated matrix-driven formula humans can’t possibly comprehend, but which is 10% better?

Probably not. Probably what it does is infer my gender, age, income level, and marital status. After that, it sells me cars and gadgets if I’m a guy, and fashion if I’m a woman. Not because all guys like cars and gadgets, but because some very uncreative human got into the loop and said „please sell my car mostly to men“ and „please sell my fashion items mostly to women.“ Maybe the AI infers the wrong demographic information (I know Google has mine wrong) but it doesn’t really matter, because it’s usually mostly right, which is better than 0% right, and advertisers get some mostly demographically targeted ads, which is better than 0% targeted ads.

You know this is how it works, right? It has to be. You can infer it from how bad the ads are. Anyone can, in a few seconds, think of some stuff they really want to buy which The Algorithm has failed to offer them, all while Outbrain makes zillions of dollars sending links about car insurance to non-car-owning Manhattanites. It might as well be a 1990s late-night TV infomercial, where all they knew for sure about my demographic profile is that I was still awake.

You tracked me everywhere I go, logging it forever, begging for someone to steal your database, desperately fearing that some new EU privacy regulation might destroy your business… for this?

Statistical Astrology

Of course, it’s not really as simple as that. There is not just one advertising company tracking me across every web site I visit. There are… many advertising companies tracking me across every web site I visit. Some of them don’t even do advertising, they just do tracking, and they sell that tracking data to advertisers who supposedly use it to do better targeting.

This whole ecosystem is amazing. Let’s look at online news web sites. Why do they load so slowly nowadays? Trackers. No, not ads – trackers. They only have a few ads, which mostly don’t take that long to load. But they have a lot of trackers, because each tracker will pay them a tiny bit of money to be allowed to track each page view. If you’re a giant publisher teetering on the edge of bankruptcy and you have 25 trackers on your web site already, but tracker company #26 calls you and says they’ll pay you $50k a year if you add their tracker too, are you going to say no? Your page runs like sludge already, so making it 1/25th more sludgy won’t change anything, but that $50k might.

(„Ad blockers“ remove annoying ads, but they also speed up the web, mostly because they remove trackers. Embarrassingly, the trackers themselves don’t even need to cause a slowdown, but they always do, because their developers are invariably idiots who each need to load thousands of lines of javascript to do what could be done in two. But that’s another story.)

Then the ad sellers, and ad networks, buy the tracking data from all the trackers. The more tracking data they have, the better they can target ads, right? I guess.

The brilliant bit here is that each of the trackers has a bit of data about you, but not all of it, because not every tracker is on every web site. But on the other hand, cross-referencing individuals between trackers is kinda hard, because none of them wants to give away their secret sauce. So each ad seller tries their best to cross-reference the data from all the tracker data they buy, but it mostly doesn’t work. Let’s say there are 25 trackers each tracking a million users, probably with a ton of overlap. In a sane world we’d guess that there are, at most, a few million distinct users. But in an insane world where you can’t prove if there’s an overlap, it could be as many as 25 million distinct users! The more tracker data your ad network buys, the more information you have! Probably! And that means better targeting! Maybe! And so you should buy ads from our network instead of the other network with less data! I guess!

None of this works. They are still trying to sell me car insurance for my subway ride.

It’s not just ads

That’s a lot about profiling for ad targeting, which obviously doesn’t work, if anyone would just stop and look at it. But there are way too many people incentivized to believe otherwise. Meanwhile, if you care about your privacy, all that matters is they’re still collecting your personal information whether it works or not.

What about content recommendation algorithms though? Do those work?

Obviously not. I mean, have you tried them. Seriously.

That’s not quite fair. There are a few things that work. Pandora’s music recommendations are surprisingly good, but they are doing it in a very non-obvious way. The obvious way is to take the playlist of all the songs your users listen to, blast it all into an ML training dataset, and then use that to produce a new playlist for new users based on… uh… their… profile? Well, they don’t have a profile yet because they just joined. Perhaps based on the first few songs they select manually? Maybe, but they probably started with either a really popular song, which tells you nothing, or a really obscure song to test the thoroughness of your library, which tells you less than nothing.

(I’m pretty sure this is how Mixcloud works. After each mix, it tries to find the „most similar“ mix to continue with. Usually this is someone else’s upload of the exact same mix. Then the „most similar“ mix to that one is the first one, so it does that. Great job, machine learning, keep it up.)

That leads us to the „random song followed by thumbs up/down“ system that everyone uses. But everyone sucks, except Pandora. Why? Apparently because Pandora spent a lot of time hand-coding a bunch of music characteristics and writing a „real algorithm“ (as opposed to ML) that tries to generate playlists based on the right combinations of those characteristics.

In that sense, Pandora isn’t pure ML. It often converges on a playlist you’ll like within one or two thumbs up/down operations, because you’re navigating through a multidimensional interconnected network of songs that people encoded the hard way, not a massive matrix of mediocre playlists scraped from average people who put no effort into generating those playlists in the first place. Pandora is bad at a lot of things (especially „availability in Canada“) but their music recommendations are top notch.

Just one catch. If Pandora can figure out a good playlist based on a starter song and one or two thumbs up/down clicks, then… I guess it’s not profiling you. They didn’t need your personal information either.


While we’re here, I just want to rant about Netflix, which is an odd case of starting off with a really good recommendation algorithm and then making it worse on purpose.

Once upon a time, there was the Netflix prize, which granted $1 million to the best team that could predict people’s movie ratings, based on their past ratings, with better accuracy than Netflix could themselves. (This not-so-shockingly resulted in a privacy fiasco when it turned out you could de-anonymize the data set that they publicly released, oops. Well, that’s what you get when you long-term store people’s personal information in a database.)

Netflix believed their business depended on a good recommendation algorithm. It was already pretty good: I remember using Netflix around 10 years ago and getting several recommendations for things I would never have discovered, but which I turned out to like. That hasn’t happened to me on Netflix in a long, long time.

As the story goes, once upon a time Netflix was a DVD-by-mail service. DVD-by-mail is really slow, so it was absolutely essential that at least one of this week’s DVDs was good enough to entertain you for your Friday night movie. Too many Fridays with only bad movies, and you’d surely unsubscribe. A good recommendation system was key. (I guess there was also some interesting math around trying to make sure to rent out as much of the inventory as possible each week, since having a zillion copies of the most recent blockbuster, which would be popular this month and then die out next month, was not really viable.)

Eventually though, Netflix moved online, and the cost of a bad recommendation was much less: just stop watching and switch to a new movie. Moreover, it was perfectly fine if everyone watched the same blockbuster. In fact, it was better, because they could cache it at your ISP and caches always work better if people are boring and average.

Worse, as the story goes, Netflix noticed a pattern: the more hours people watch, the less likely they are to cancel. (This makes sense: the more hours you spend on Netflix, the more you feel like you „need“ it.) And with new people trying the service at a fixed or proportional rate, higher retention translates directly to faster growth.

When I heard this was also when I learned the word „satisficing,“ which essentially means searching through sludge not for the best option, but for a good enough option. Nowadays Netflix isn’t about finding the best movie, it’s about satisficing. If it has the choice between an award-winning movie that you 80% might like or 20% might hate, and a mainstream movie that’s 0% special but you 99% won’t hate, it will recommend the second one every time. Outliers are bad for business.

The thing is, you don’t need a risky, privacy-invading profile to recommend a mainstream movie. Mainstream movies are specially designed to be inoffensive to just about everyone. My Netflix recommendations screen is no longer „Recommended for you,“ it’s „New Releases,“ and then „Trending Now,“ and „Watch it again.“

As promised, Netflix paid out their $1 million prize to buy the winning recommendation algorithm, which was even better than their old one. But they didn’t use it, they threw it away.

Some very expensive A/B testers determined that this is what makes me watch the most hours of mindless TV. Their revenues keep going up. And they don’t even need to invade my privacy to do it.

Who am I to say they’re wrong?

45 Techniques Used by Data Scientists

These techniques cover most of what data scientists and related practitioners are using in their daily activities, whether they use solutions offered by a vendor, or whether they design proprietary tools. When you click on any of the 45 links below, you will find a selection of articles related to the entry in question. Most of these articles are hard to find with a Google search, so in some ways this gives you access to the hidden literature on data science, machine learning, and statistical science. Many of these articles are fundamental to understanding the technique in question, and come with further references and source code.

Starred techniques (marked with a *) belong to what I call deep data science, a branch of data science that has little if any overlap with closely related fields such as machine learning, computer science, operations research, mathematics, or statistics. Even classical machine learning and statistical techniques such as clustering, density estimation,  or tests of hypotheses, have model-free, data-driven, robust versions designed for automated processing (as in machine-to-machine communications), and thus also belong to deep data science. However, these techniques are not starred here, as the standard versions of these techniques are more well known (and unfortunately more used) than the deep data science equivalent.

To learn more about deep data science,  click here. Note that unlike deep learning, deep data science is not the intersection of data science and artificial intelligence; however, the analogy between deep data science and deep learning is not completely meaningless, in the sense that both deal with automation.

Also, to discover in which contexts and applications the 40 techniques below are used, I invite you to read the following articles:

Finally, when using a technique, you need to test its performance. Read this article about 11 Important Model Evaluation Techniques Everyone Should Know.

The 40 data science techniques

  1. Linear Regression
  2. Logistic Regression
  3. Jackknife Regression *
  4. Density Estimation
  5. Confidence Interval
  6. Test of Hypotheses
  7. Pattern Recognition
  8. Clustering – (aka Unsupervised Learning)
  9. Supervised Learning
  10. Time Series
  11. Decision Trees
  12. Random Numbers
  13. Monte-Carlo Simulation
  14. Bayesian Statistics
  15. Naive Bayes
  16. Principal Component Analysis – (PCA)
  17. Ensembles
  18. Neural Networks
  19. Support Vector Machine – (SVM)
  20. Nearest Neighbors – (k-NN)
  21. Feature Selection – (aka Variable Reduction)
  22. Indexation / Cataloguing *
  23. (Geo-) Spatial Modeling
  24. Recommendation Engine *
  25. Search Engine *
  26. Attribution Modeling *
  27. Collaborative Filtering *
  28. Rule System
  29. Linkage Analysis
  30. Association Rules
  31. Scoring Engine
  32. Segmentation
  33. Predictive Modeling
  34. Graphs
  35. Deep Learning
  36. Game Theory
  37. Imputation
  38. Survival Analysis
  39. Arbitrage
  40. Lift Modeling
  41. Yield Optimization
  42. Cross-Validation
  43. Model Fitting
  44. Relevancy Algorithm *
  45. Experimental Design