Want to protect yourself from social engineering scams? Don’t just train yourself. Analyze Big Data.
Every wonder how the Las Vegas casinos brought down the MIT card counting team? They consulted Big Data.
How do Wall Street traders figure every millisecond they speed their trading algorithms makes them $100 million more a year? Two words. Big Data.
In this exclusive interview, sponsored by IBM Big Data, their chief scientist, fellow and blogger Jeff Jonas (@jeffjonas) goes deep on the risks and rewards of Big Data analytics.
[audio: http://ontherecordpodcast.com/pr/otro/electronic/Big_Data_Risks_and_Rewards.mp3]Eric: Netvibes CEO Freddy Mini says, “Context is the comparison of metrics.” Can you explain in simple layman’s terms the notion of context and why it’s so important in the area of big data analysis?
Jeff: I have a completely different definition of context. The definition that I use, as it’s related to my work, is this definition:
Better understanding something by taking into account the things around it.
When you see the word bad in a sentence, you look at the words around it to know what kind of bad it is.
If I reached into my pocket right now and pulled out a puzzle piece that had flames on it — and that’s all I had a puzzle piece with flames on it — and I handed to you and said good news or bad news.
How would you know?
Only if you take that puzzle piece to the puzzle to see the surrounding pieces, would realize it’s in a fireplace near a glass of wine which is good news or taking that simple puzzle piece with flames on it and finding out it’s near the kids bedroom in the carpet, bad news.
That’s context, so context to me is weaving together a diverse observation space that brings more fidelity, a more complete picture.
When you do that, you’re making higher quality predictions for opportunity or risk.
Eric: How do you decide what to compare to what?
Jeff: The way you compare one piece of data to another is the features that are exposed on it.
It’s the features on the data.
I was actually thinking about this process some years ago and I thought about two facts.
If you had one fact that said fish are dying and you had another fact that said Jeff Jonas likes to wear black, how would you compare them?
I actually played with that in my head.
I realized the way that you associate data to other data is features and edges of the data that touch, but there’s not data there that touches.
The fish are dying. Jeff Jonas likes to wear black. Those would be like two puzzle pieces in a puzzle that are nowhere near each other. On the puzzle board, you just have to have features that connect.
If you have a Twitter handle and one data source and you have another data source that has no Twitter handles, how will you compare them? You wouldn’t because there’s no features.
The question is what features can you use that allow you to connect data? A feature would be like a name or an address. A car has a VIN number, a make and a model and a color.
Eric: You built this system that made it harder for the MIT card counting team to have a free run at the casinos. What were they doing and what features were you able to grab onto to debunk their system?
Jeff: Before this card count team method that the MIT group used, an individual would card count.
They would watch the deck as it played out. They would determine when the deck had a lot of 10’s in it because they’re counting the 10’s. When the deck has a lot of 10’s, it has a slight benefit to the player. In normal circumstances, a card counter is making small bets like $25, $25, $25.
Then when the deck has a lot of 10’s, they jump to 500, 500 or 1,000 1,000 and it’s just bet variation. Well, it’s very easy to see in the surveillance room. In the surveillance room, they’re watching somebody bet a smaller amount and suddenly their betting a higher amount. They’ll just back the tape up, replay it, and count the cards. They’ll go, I’ll bet he’s going to raise his bet now.
When somebody does that, they just flat bet them. They’ll say whatever bet you start the deck with, you have to make that bet all the way through. You can’t vary your bet. Card counters have to vary their bet. With the MIT card counting data they split the shooter from the sensor. They had one person at the table counting. They would have somebody else that would just show up and make only big bets. They separated the signal.
Eric: Was it humans that were figuring out what was going on or was there some computer system behind the scenes that was tracking this whole thing?
Jeff: It starts with if you’re playing blackjack and you suddenly make a big bet, the dealer has an obligation to yell out to the floor supervisor. They say checks play or money plays. It just means you had a big bet.
The floor supervisor looks over and goes yeah, they’re making a bigger bet. Then they call the surveillance room. They’d just have to determine whether or not they want the play evaluated.
If they want the play evaluated, they call the surveillance room and say, BJ 7‑4, meaning blackjack, table seven, seat four, evaluate the play. What they do is that they just backup the surveillance tape, now they just watch that player playing. They watch the cards that are coming out and they determine if that person, what message they’re using if any. They evaluate their play.
Eric: What if I just happened to fall into that trap? I wasn’t counting cards, but I just decided I wanted to double down at that point.
Jeff: Well, they are so cautious about tapping someone on the shoulder and affecting their play. In fact this is one of the things that I learned in Vegas building these kinds of systems. Their interests in false positives and false negatives are so low. They do not want to tap your shoulder and say, “Hey we don’t like the way you’re playing,” when you’re just a good guy.
They would watch for the trend, where you’re moving your bet with the counting of the cards. Soon as you’re moving your bet, every time the card count is moving, you’re going up and down and it’s following that exact flow. The probability that you’re card counting is very high. For the system I built, they had a watch list of people that they were trying to keep an eye on and we automated that so they could search it more quickly.
We implemented facial recognition in that in 1996. The MIT card counting thought it was the facial recognition, but it really wasn’t the facial recognition that we were using to catch them. The surveillance/intelligence community has some data the card count didn’t know they had. It’s one of the ways you catch bad guys in data.
Eric: When the Syrian Electronic Army hacked the Twitter account of the Associated Press, the AP with a simple social engineering scam and sent out a bogus tweet about an explosion in the White House. The event triggered electronic securities trading systems triggering the sale of $134 billion worth of stock.
Rich Brown, the former head of Elektron Analytics, which is a Thomson‑Reuters unit that sells news feeds that computers can read, said that the words “explosions” or “Obama” alone wouldn’t have triggered the selling, but add “White House,” and it’s a combination even the slowest computer couldn’t miss. How might the use of contextual analysis prevented this from happening?
Jeff: Well, one of the challenges…There’s two parts to this. One part is, when you put algorithms on triggers, in other words, there’s no human in the middle, you get this phenomena where sometimes the algorithm start running away on you. That’s evidence of that.
The question is, what secondary data would you have to have to see that that was a false positive.
That was the only evidence, right?
If that one source were the only source, part of that is trying to find a lie. If you said something and it’s a lie, how do you prove it’s a lie?
The way you prove it with data is by looking for a second piece of data that disagrees with it. The question is, if a second piece of data was there that would disagree with that?
Eric: It’s not uncommon for the Associated Press to break news. If you apply that type of knowledge to the stock market, where fortunes are made and lost in a matter of seconds, you don’t really have the time to be able to fact check.
Jeff: I’d be pretty concerned running automated algorithms with their fingers on the triggers so fast but that’s the thing.
Businesses are trying to cut down latencies too. At Golden Sachs in 2007, I heard it said he that every millisecond they can speed up their algorithm, makes them 100 million a year.
There’s no time to put a human in all that.
Eric: What if you put Watson in there instead?
Jeff: I was touring around the labs and there were two technologies that really caught my interest that were related to my own, and one of those was Watson.
One of the things that I liked about Watson, this is before we beat the game Jeopardy.
When it’s analyzing the data and starts to find things it agrees with, it actually starts looking for places for disagreement, it actually looks for the disagreement.
It looks for the contradictory data. I remember looking at that and thinking, “That’s real interesting. That’s probably a very smart way to do that.” I think that’s one of the many things that make Watson unique.
Eric: When you think of the type of questions Watson can answer, they’re usually question that can be answered with historical information, but when you are making decisions like trading stocks, you trying to predict the future. And you can’t necessarily predict the future based on the past.
Now, I know you are not a digital or social media monitoring guy. I know that’s not your thing, but it is something my listeners are interested in.
When Google announced that they would be sunsetting Google Reader, which was really the only free media and social media monitoring service that also had basic, but some form of analytics and the ability to search within the specific feeds that you’re monitoring, it really did create a bit of a vacuum in a marketplace.
In terms of alternatives, free alternatives, there’s a service called Feedly, which I know you are not familiar with, but they basically have a simple UI and native mobile apps, and that was enough for them to sort of rush in and fill the void.
They were able to inherit a lot of the users that were left high and dry when Google sunsetted Reader. I actually saw another contender, that I wrote about, called Netvibes, as a good alternative.
Jeff: Are these kind of services, you give them a few key words for things you want to watch for to basically monitors all of Google’s kind of content, or does it monitor just a Twitter? What’s the scope?
Eric: Well, you can choose. Basically, you can monitor different sources, based on Boolean queries. You can monitor Twitter, you can monitor Facebook accounts that don’t have their privacy settings set to prohibit you from doing so, you could monitor news, you could monitor Google News, Yahoo News, you can monitor blogs.
Jeff: So you could say, “money‑laundering investigations” and everything in the world that is said that includes those words would grab it?
Eric: That’s correct. Anything that’s public, they would find it and they would deliver it.
Jeff: That includes cross‑blogs and news service and documents posted on Google?
Eric: That’s correct. Then they would give some ability to analyze the data and compare it to other results.
Well, that’s what Netvibes does. Feedly, on the other hand just gives you the ability to monitor bigger, more known sources.
You can’t hack an RSS feed and put it into Feedly. They won’t support that any longer. But in Netvibes you could.
In Netvibes you could also bring in other data like, say, you wanted to compare discussions on Twitter to sales volume. You could import that sales data into Feedly and compare the two.
It’s the type of tool that someone non‑technical like me could use. I wouldn’t need a team of computer scientists to get it up and running.
Jeff: You tell me but I would think that it would still be too much information and not precise enough.
Someone says it’s only false positives, but the way I would think about that, I would consider a decent tear‑one triage.
I don’t need my starting kit to be the whole universe. I want a portion of the universe, and I’m going to use that and coordinate with the other data that I have internally to then find the real nuggets.
So I wouldn’t think of it as an actual nugget‑finder. It would find too many things. How is it really?
Eric: Is it possible to find the nuggets through Boolean query?
Jeff: I think the goal is where should I be focusing my attention? You sign up for a feed like that, it gives you a thousand things a day. It’s better than having to look at a trillion things a day, but it’s still too many things to look at a day.
Eric: Clearly. Why couldn’t you then sort it again with more keywords?
Jeff: You can do that to some degree. I did a blog post about this called data beats math. There’s a point where just doing more queries against the same data only takes you so far. I think you have to look at secondary data.
Eric: What a lot of the systems offer — and most of us are quite dubious of their accuracy — is sentiment analysis.
Jeff: I’m the beneficiary and the victim of sentiment analysis systems.
It’s not my own area but if somebody says, “Hey, we want to make a claim about what the sentiment is,” up or down, I’d say, “Great, tell me. What is it? Is it up or down?”
The question is do you want to make buying, selling… What kind of decisions are you going to make on that? If they’re marketing decisions, it’s safer than if you’re making decisions about what to investigate next.
Eric: So you’re not going to make key business decisions based on sentiment analysis.
Jeff: If you’re trying to figure out what ad to buy, that’s a fine business decision. If you’re trying to figure out whether you’re going to give them unemployment or credit, you should be really concerned about that because we have something called a Fair Credit Reporting Act.
They do a lot of work in the privacy community.
The Fair Credit Reporting Act is one of the better privacy laws in the world.
This is if you’re going to use any kind of data that decreases somebody’s access to credit or job or insurance. You have to tell them what the derogatory data was and give them a chance to dispute it.
You’re going to use that kind of information to not give somebody credit or charge them more than their credit score, you should be way more cautious.
Eric: There are some limits to the Fair Credit Reporting Act, like if I was going to get a job in the casino cage that would be fair game. Right? It would be okay for a perspective employer to look at all my data if they’re considering me for that job.
Jeff: When you go to work in some places they’ll do a background check on you. When they do a background check on you, they have to disclose to you that they would like access to your credit report, you have to authorize that.
When you see your credit report and you make any negative impact against them, like not give them the job because you saw something derogatory, you have to tell them what it was so they have a chance to dispute it. It’s a great thing. It gives transparency to, in the process to keep people affected by bad data.
Eric: There’s this conference that I’ve been chairing for the last few years called The Digital Impact. This last year, one of our key notes was an independent designer who is engaged by manufacturers to figure out a way to connect their products to the Internet.
He gave this presentation about his experiences. He said he was working for an office chair manufacturer. The office chair manufacturer had him monitor how people use chairs.
He said, actually, if you monitor how someone uses a chair, you see how everyone has a unique signature in terms of how they use a chair.
Then we started thinking, well, if everyone’s got a unique signature to the chair and you can monitor how they use the chair then what could you do to improve their productivity through the chair? One of the things they came up with was if someone’s idle in the chair for a long period of time, you could vibrate it.
Jeff: Shock them. You could shock them.
Eric: My eyebrows kind of go up and I’m thinking, “Oh my God.” Right?
Then he goes into this presentation about this pilot project that he did for a health care provider.
Apparently a lot of older patients were calling 9‑1‑1 and getting an ambulance to bring them in. They would get a battery of tests and they would be sent home. They wouldn’t find anything.
They realized people were just lonely. What they did was they got these teddy bears and the teddy bears have the lights in them and they vibrated and they said things.
What they would do they would buzz the teddy bear and say something and lights would go on once in a while just to show the person that they’re not alone. Hopefully make them feel less lonely.
It hit me at that point.
I thought, “Oh my God.”
This is the future. Right?
They’re going to buzz in my chair and then when I go crazy they’re going to give me a teddy bear. And you look at how profit motives exploit technology first because, really, the only ones that can afford tit and it’s frightening.
Based on your experience and your knowledge, what scares you about the Internet of things?
Jeff: I think its that sensors get smaller. You won’t know where the sensors are.
Right now, you walk down the streets, and there are surveillance cameras here and you can see them. I think a lot of people would be a little creeped out by knowing there was a sensor in your chair that was non‑perceivable or that your flower pot was actually just sensing voices.
The question is do you know what data is being collected? The Internet of things means sensors will be everywhere.
Eric: Does it keep you up at night?
Jeff: No. I think the world is becoming a safer place. I think you’re going to live older today than any time in the history of mankind. You’re going to live older and be healthier.
I think there’s lots of goodness there. I think the saturation of sensors, we’re actually opting into them. People are creating irresistible services. On my phone, when I go to turn the Geo Location off it goes, “Are you sure?” I mean if you turn it off we won’t be able to find your phone.
Eric: Right.
Jeff: That’s the journey we’re on. It’s all about irresistible services. It’s causing people to just sign up for everything and let there be lots of surveillance.
Eric: When, you created this system that lets businesses and organizations analyze personally identifiable information without violating privacy rights. How did you do something that? How’s that even possible?
Jeff: Well, anytime you want to lead data together from different piles you’re going to have to bring it together. Every time you bring it together, you’re copying it. Every time you copy it, you’ve increased the risk it’s going to run away on you.
One of the things that I invented was a technique that allows you to anonymize the personally identifiable information like your name and address and phone, then analyze it robustly after it’s been anonymized.
I came up with the technique to anonymize it first and still be able to analyze it. I wouldn’t say it protects privacy but it certainly improves privacy. It’s privacy and sync.
What it does is reduce is the risk that your social security number or date-of-birth run away from you.
Eric: One of the things you do is help organizations make sense of big data, of information in real time. I’m curious to know what precedes being ready to do something like that? What do you need to have in place first?
Jeff: You have to have enough data to get the business outcomes that you want. I did a blog post about this called fantasy analytics because over half the organizations I go see I ask them, “What do you want to accomplish?”
They go, “We want to do X.”
I go, “What data do you have to do that?” When I look at the data they want to use, not even a room full of divine beings could use that data to get that outcome. They lack what I call a sufficient observation specs. They just don’t have enough data.
You sit down with somebody and you say, “What do you do?” They say, “Our job’s to protect the supply chain.”
I’m like, “OK. What are your goals?”
They say “I want to find bombs.”
I’m like, “I love projects like that. What do you got?”
They’re like, “We got the sender, the receiver, we’ve have who drives the boat to move it around. We have what’s in the manifest.”
I’m like, “What else?”
They go, “That’s it.”
I’m like, “You’ll never find a bomb. No one writes ‘bomb’ on manifests. Are you smoking crack? That’s crazy.”
What you have to do is you have to widen the observation space. You have to add more data. That’s the first part is do you have a sufficient data to get the outcome that you want? Another thing about big data is just because you have a pile of big data doesn’t mean there’s gold in the hills. Doesn’t always mean that.
So I ask them, “What’s the low hanging fruit? What can you really create an important business outcome from?”
Then the question is, “Do you have sufficient data to actually do that?” Over half the time you have to help me think about what is the sufficient amount of data to do that.
Eric: Obviously, a lot of people are really concerned about the state personal privacy. The fear is with so much personally identifiable information in one place, what false positives could the government take away or how could all that information in the hands of those with money and guns undermine democracy?
Knowing what you know about this space, what are the benefits and drawbacks of a database with that much personally identifiable information in it?
Jeff: You know, I was a really late bloomer to privacy. Really late. I didn’t even know what the word meant until maybe just a decade ago. I built a lot of systems with a lot of data about people in them.
In 2003, a former executive at CIA said to me — and this is after 9/11 — “If the terrorists blow up our buildings and kill our people, we don’t lose, but the day we have to change our constitution to respond, we’ve lost.”
I thought that was a really insightful comment. It almost sent a little chill up my spine. It was one of the many little lessons I’ve had on my journey about responsible invasion, about what data should be collected under which laws and protect the country.
I’m working on a blog post right now to actually cover this in some detail. It’s such a nuanced conversation but the challenges that the privacy community has, our fourth amendment says that we should be free from unreasonable searches.
Searches and seizures need to be reasonable and particular. The question is, if you’re a government collecting all the data on everybody, is it reasonable and particular?
Before the change in the laws, you would have to say, “I want a record about Billy the Kid. We’re investing the records of Billy the Kid. Do you have any records on Billy the Kid?” We’ve had a number of laws including Section 215 of the US Patriot Acts and amendments to what’s called the FISA, The Foreign Intelligence Surveillance Act, to make it possible for the government to collect more than that.
This is where the debate’s going to be. We have these new laws that make it possible for the government to do this and the privacy community would say how would you reconcile those laws with the fourth amendment? This goes down to any organization. The question is what are they collecting and how are they using it? That’s where the debate is going to be.
Eric: Do you think we wind up in an environment where if someone accesses my record I get a notification that says, “Hey, this agency has accessed your record in the PRISM database. If you want more information under the Freedom of Information Act, you can send a request to where ever.”
Jeff : That would make it more like the Fair Credit Reporting Act. One of the cool things about the Fair Credit Reporting Act is there’s an inquiry line on the bottom of the credit report that tells you who’s looked. You can go look at your credit report and see who’s seen it.
Now, the thing is you can’t do that on everybody. If you’re doing a secret covert investigation on somebody, you can’t tell them you’re investigating them.
I think there’s going to be changes coming. I don’t know what they are. This is what lawyers and policy people are working on. This will be an interesting, and an important debate for our country.
Eric: Do secrets have a future, or are we moving into an inevitable world where everyone will have access to everything?
Jeff: It’s going to be really hard to have a secret in the future. Secrets are going to get really hard to have. The world is going to become more transparent. It’s very fascinating. I think the two things I worry about on that, one is, if you knew everything was going to be known, would you decide to change your behavior? Would you want to look a little more normal or would you hope?
There’s two different kinds of futures there, right? Will everybody try to look more normal to fight their way to get to the center of the bell curve, or will the world become more tolerant of everybody’s differences and uniquenesses. I’m hoping for the world.
I think as young people and digital natives become more aware of the fact that this information is being collected and could ultimately be triangulated and used against them someday, then rather than express realities about themselves on social media, they’ll share things that are aspirational.
We’re seeing that already. We’re in this environment now where people have this sort of…reflex, to share every moment of their life on a social network, but really only those moments that they want to be remembered. I don’t think there’s a lot of people sharing pictures of themselves at the family planning center, or checking in on Foursquare for kidney dialysis. At the same time, if it’s little Johnny covering second base for the first time, that gets shared.
Jeff: Where the other data is leaking out is it turns out your friends give you up. It’s the other pictures that your friends take that they want to have remembered, where there’s a lot of leakage about what you would consider personal to you. One of the survey questions I’ll ask people in audiences that I speak to, I say how many of you are not on Facebook. I don’t know what the number is, let’s say 10 percent raise their hand.
I go, “Well that’s just a lie, you’re on Facebook.” You just haven’t claimed your territory. As soon as your name and contact information is in anybody else’s address book, and they’ve uploaded it, trust me. There is an entire folder there based…probably everybody on Earth. You’re already in there. Your friends are giving you up, and this is going to continue to happen, and it’s going to cause a lot of release of data and it’ll make it much harder to have secrets.
Is it bad? I think we’re going there.
The question is, I think organizations are going to figure out how to harness that in a way that’s as often as possible, that’s not the way they should do it. In theory, it’ll produce better services and we’ll use energy systems more efficiently, and I’m somewhat of an optimist.
Eric: Jeff, you already told us that you’re a latecomer to privacy, and you’re the father of three adult children who probably, when they were young, digital wasn’t really that big of a deal.
Now, obviously, we’re in this environment where everything’s recorded. If you were fathering young children growing up now, with what you read about cyber‑bullying and false positives, you wind up accusing the wrong people of the Boston Marathon bombings.
Knowing what you know about business intelligence and analytics, what advice do you have for parents of young children today who are teaching future citizens about safe and unsafe ways to use digital technology for communications and pleasure and entertainment today?
Jeff: Man, that’s a lot of responsibility to put on me. I’ll tell you how I did it. I don’t know that I would recommend this for everybody, it’s just the way. I’ll just tell you the way I did it. I’m not recommending this.
Eric: Fair enough.
Jeff: I raised my kids so they know so much about the world and so much about me, nothing could ever come out about anything I’ve ever done that would ever surprise them.
I gave them the most candid view of who I am, and by doing that, I think I would like to think that I insulated them from whatever they would encounter.
If I’m trying to run around and controlling what they can observe when the world’s getting harder and harder to have secrets, it’s just a house of cards.
I raise my kids to know really who I am, and what’s really going on in the world, and I’d like to think it gave them a level of judgment and as true a calibration as I could help with.
The jury’s still out, but they’re doing okay. So far, so good.
Eric: My interpretation of that is, that you would not set the parental controls on the home Apple computer.
Jeff: You know what, I would not want them to be under 13 and lie about their age and have access to things that are restricted. You know that there’s a law, right? I would be careful about that, so it depends what age that you are talking about. From that age, and then the platforms that are knowing what the ages are, are trying to have some due diligence to make sure that things aren’t being used.
I give my kids tons of freedom, over 13. Tons and tons of freedom.
Eric: But under 13, because…
Jeff: They’ll just be at their friends’ house and see it. It just says ‘Daddy is trying to control me.’
Eric: My eight year old uses YouTube for everything. He’s learned to play tennis on YouTube. He’s learned to juggle on YouTube, he learns how to use Adobe products on YouTube. Anything he wants to learn, he just goes to YouTube and he searches it. He finds a little video made by some other eight year old who tells him how to use the cropping tool in some application, or he watches Rafa Nadal with his stroke, and he goes outside and practices it.
I’m fearful that he’s going to search something that’s going to wind him up on some page where he’s either going to see some pornography, or some violence, or something like hate…Whatever. Do I put on the parental controls, or do I just teach him, “Hey.” First of all, he’s the type of kid, if he did see that he would go away from it. I think.
Jeff: My communication has been open with my kids. We talk about everything. If they bump into something like that in the world, and by the way, just because you’re controlling at your house, as soon as they go to a friend’s house and there’s not, you’re right back there. It’s just, you’re controlling it, and somebody else isn’t, and they still bump into it. I would just make them ready for it and have good communication.
Eric: Education, rather than surveillance.
Jeff: That’s interesting, because it goes right back to a lot of things I learned about privacy in Vegas. There are a lot of sensors in Vegas, but they’re very rarely sensing everybody individually. They throw the video away, they just keep it if there’s a bad incident.
The joke is, it’s what happens in Vegas stays in Vegas, and then it stays on video. The truth is, they just throw the video away. When they find flaws in games, they don’t go instrument and do background checks on their customers more. They fix the flaw in the game.
Eric: What do they do to insulate their employees and individuals from making mistakes that could wreak havoc?
Jeff: They use training. They use training, and change process. They fix process, and then they train the employees to follow the process. That’s what they do.
I’ll give you a scam. Somebody watched the outcome on a roulette game every time the ball falls in a number. This person wrote down every number for weeks. They realized the wheel had a bias, it wasn’t perfectly balanced. They played to the bias, and over a few weeks they won like $5 million. The casino finally figured out how they were winning, they just turned the game into an ATM machine. They closed the wheel down. Did they surveil more people, did they change…how did they fix that?
They just implemented a simple mechanism, the frequency with which they test the wheels for balance. They put no more surveillance on any people to prevent that from happening again. They fixed the process.