Practice English Speaking&Listening with: Week-4.1 Privacy and Pictures on Online Social Media

Difficulty: 0

Welcome back to the course. I hope you are enjoying the course in terms of studying some

new concepts, new ideas, and new solutions. This is the week 4 of the course Privacy and

Security in Online Social Media, what I will do now is continue the topic on privacy that

we were talking last time.

Now, just let to let you know we are in the topic of privacy for now, we just covered

the trust and credibility, and I assume by now you are all very well versed with little

bit of Linux little bit of a Python, how to collect data from twitter, how to store the

data, what kind of MySQL queries you should write and collecting data and all that.

In the last week we saw about how Westin categorized all the US citizens into 3 categories; Fundamentalist,

Pragmatists and Unconcerned. Fundamentalist is being 25 percent, pragmatists is being

60 percent and unconcerned being 15 percent. Fundamentalist are the people who actually

do not give away any personal information. Pragmatists make decision about privacy keeping

the situation in mind. Unconcerned are the set of people who gave away personal information

and be part of revealing personal information is about 15 percent in the US.

I kind of asked you couple of questions last time about some data that was collected among

large set of population in India. So this is one of the questions that I asked which

is what you feel about privacy of your personal information on your online social network,

which is about Facebook. About 42 percent, the highest was about 42 percent who said

that specified my privacy settings my data is secured from a privacy breach.

Another question that I asked you also is about if you receive a friendship request

on your most frequently used online social network, which is Facebook in this case which

of the following people will you add as friends. And the highest was actually person of opposite

gender. I am pretty sure in the last couple of weeks going through the class that you

are taking on the social network now, even your own behavior may be changing, you may

be looking at some of these requests more closely, you may be devising your mechanism

by which any friend request that you get, how you are going to accept it or how you

are going to deny it.

Now, the data is publicly available please feel free to actually play around with the


Last time I left you with the question saying; what are the kind of privacy issues that you

have on Facebook, Twitter? How you define privacy? I think it is nice to see some of

you posting information about various Facebook privacy issues or your own questions about

Facebook privacy issues on the forum. We should actually make the forum more active because

I think there are some very repeated questions that comes up, we're tying to answer as such

as possible but when they are very repeated we can avoid actually answering also. I strongly

recommend you to ask, check the forum before posting the questions.

So, let us look at what privacy is a little bit and then give a little detail about some

research that was goes down in terms of analyzing the privacy status on Facebook. One of the

definitions that was given earlier about privacy was thatPrivacy is a value so complex,

so entangle in competing and contradictory dimensions, so engorged with various and distinct

meanings, that I sometimes despair whether it can be usefully addressed at all.” So

that was Robert talking about privacy in his bookThree Concepts of Privacy.’

But I think the privacy by definitions is actually thought. I mean, if you were to look

at what privacy is for you, why are you sitting and listening to this lecture, versus privacy

in your school, privacy at home, privacy at work is very different. It is very hard to

define what privacy is for a particular individual across various situations, that is what this

definition is actually trying to capture. Contradictory dimensions, so entangled and

competing and contradictory dimensions.

Fundamentally privacy is been always talked about control over information, here are two

definitions of Alan Westin actually tried defining in his book in aPrivacy and Freedom

in 1967. “Privacy is the claim of individuals, groups or institutions to determine themselves

when, how and what extent information about them is communicated to others.”

So it is basically about to determine for themselves, how much of my information I can

actually share with others. “Each individual is continually engaged in personal adjustment

process in which he balances the desire for privacy with the desire for disclosure and

communication.” How much do I want to reveal about myself, how much do I want to actually

anonymize information about myself, how much do I want to reveal about myself, is the way

that the word privacy is defined and is the way by which you are controlling the information

that you are actually spreading. So, I am sure you kind of get the definition

privacy which is very hard to define and also it is very difficult to actually come up with

the list of privacy expectations for any individual in all given contexts. They strictly convey

privacy is about control over information. It sometimes could be actually a group information

also, given that idea is more or collective society we generally talk about a privacy

of a group, instead of individual privacy, that the society is where its individualistic

society where the privacy information of the individuals are more protected than the privacy

information of the group.

Some forms of privacy that people have come up with; information privacy, communication

privacy, territorial privacy and bodily privacy. Majority of the times when we talk about privacy

particularly in courses like these it is always referred to as information privacy and particularly

the internet privacy. There is also communication privacy which

is telephones and other forms of communication. Territorial privacy is about my living space,

my home, my city, my country and, the topics around that. Bodily privacy is about self.

So, information about my own physical presence is actually also discussed in the concept

of privacy. For example, a CCTV camera is one example where bodily privacy can be actually


Now let us look at some specific studies that are being done in terms of analyzing the privacy

in online social networks. Here is the study that I will walk you through the reference

to the study is at the end of the day of the slides, but we walk you through what they

did, what they find, how revealing the information are, how good the study was and how the privacy

is being actually studied in the context of Facebook and social networks and publicly

available information. Some background about pictures that were uploaded

on social networks itself. In the year 2000, 100 billion photos were shot worldwide. In

2010, 2.5 billion photos per month were uploaded by Facebook users only. Whereas, if you remember

the first lecture 1 where I actually showed you a infographic about what among the information

is uploaded on social networks in 1 minute, we actually saw that 1.8 billion photos were

uploaded everyday on Facebook, Instagram, Flickr Snapchat, and Whatsapp together. So

there is a lot of information, lot of pictures that are actually uploaded on social networks.

Companies like Facebook, Microsoft, Google, Apple have actually acquired a lot of face

recognition companies in the last few years, to study, to understand, to use these technologies

to identify faces on pictures that are being uploaded on the all social networks or online

services. It has become very, very important to apply these kind techniques like, machine

learning, deep learning and concepts around that into these images to study what is happening

on online social networks, I actually recently wrote also a blog about the importance of

images on online social networks. I'll actually shared it on the forum just after this lecture.


If you really look at what is going on currently in terms of these pictures that were uploaded

and the privacy about individuals, increasing public self disclosures through online social

networks happen, which is I take a pictures, I take a selfie standing near one of the very

important spots let us take in Delhi I upload this picture you know that I am in Delhi,

or let us take a picture next to Taj Mahal and upload it on my Facebook account you know

that I am actually traveling to Taj Mahal now.

There used to be actually a site called please rob me dot com I do not think so the website

is active now. This website what did they did was its called please rob me dot com,

what we interestingly did was let us take it if I have a twitter account and I created

it from Delhi and posting about weather in Chennai or Hyderabad or California they would

actually pick this tweet and post it on please rob me dot com saying that this account was

originally created from Delhi and whereas now this post is actually talking about weather

in California, so probably you are not at home and therefore your homes should be locked.

It got flacked a lot, but I think it is an interesting idea that they actually picked

up to make use of the information that the users of social network are disclosing by

themselves about their location. As a self-disclosure through online social networks and there are

many many issues that are going all around because of self-disclosure of information

on Twitter, Facebook, Instagram and other networks.

Parallely in one side this increase in public information is going on. In parallel there

is also increase in face recognition accuracy. In earlier the accuracy which lower now the

techniques, technologies that are actually improved. In particular if you look at networks

like Facebook it is actually pretty high it is because they search space that they have

to search for a particular face in the picture that you are uploading is actually only your

friends, majority of the times you're going to be taking pictures with the friends to

whom you are already are connected with or probably they are in a one, and one and half

hour or two hours away from here. So, that is happening on one side. And also

this is whole idea of cloud, storing information on the cloud, easily able to compute, computing

cost is becoming lower and lower for doing any of these analysis. On the fourth dimension,

problem is that identification of this users, who they are, what kind of information they

are valuing is also getting better. Meaning, the concepts like k-anonymity came in 15 or

20 years before, but certain many further and advance techniques that have been developed

to identify users, to identify faces, to identify information about users, to re-identify people

on social network, people on other networks. Those are four different things that are eluding;

one, increasing self-disclosure, improving the accuracy of face recognition techniques,

the whole idea of cloud and ubiquitous computing, and the techniques for re-identification of

users is actually getting better and better.

The one important question and one interesting question that people could ask is, can one

combine publicly available online social network data with the off the shelf face recognition

technology which is something that is already available, and be able to re-identifying individuals

and finding potentially sensitive information. So that is the question that we were talking

about in the next deck of slides which is, can we take some publicly available information

which is that the things that I had upload on Facebook, the things that I had upload

on Twitter. Can you use that and connect it with the off

the shelf face recognition technology which is some tools like tensorflow that I will

also mention later in the slides. Use these techniques to identify just basis and be able

to actually re-identify the person and or also find out sensitive information about

the users themselves. That is the question that we will be talking about right now.

Here is a goal. Goal is to use un-identified sources which is any websites that you can

think of, match dot com, shaadi dot com, photos from Flickr, CCTV feeds and things like that,

which is impossible to identify or its very hard, the user themselves are not disclosing

who they are in these websites. It could be either they have psuedonyms and names that

you cannot identify or re-identify to that particular person. Can we actually take these

sources, shaadi dot com and pictures from Flickr and Facebook, connected to identify

sources which are on Facebook, I would actually reveal that I am so and so on.

On Linkedin I will put this as I am so and so, on government website and other services

that are available. Which is un-identified sources like, shaadi dot com, identified sources

which is where I am disclosing that I am so and so, and I upload a picture my account

is actually ponnurangam.kumaraguru, can we actually put these two together to get some

sensitive information of the individual. For example, gender orientation like example Social

Security Number, like example Adhaar card number and the information like that. It can

be pretty nasty if you can actually put this together and the get some personal information.

So that is what we will be studying in our next slots.

Just to give you some very broad old view of some phenomenonal work that was done in

this topic Latanya Sweeney, who did this word called k-anonymity, where she actually picked

up the medical data and connected to the voter list which is publicly available. If you look

at the medical data she has ethnicity, visit date, diagnosis, procedure, medication and

the total charges that was paid by the patient. Name, address, date registered, party of affiliation,

date last voted. Taking this information which is from voters list and from the medical data

putting it together she had found actually zip code, birth date and gender was actually

common among both of them. She was able to identify if you give the system

that she built birth day and gender she was able to re-identify a lot of US citizens uniquely.

So that is the idea that built on to create something called as k anonymity, but the problem

she highlighted was that bringing these two different sets of data which is independent

medical data and voter data, you could actually re-identify users uniquely.

. In experiment one, they actually connected

the online data to the online data. They interestingly mined publicly available images from Facebook

and they going to re-identify profiles just on one of the most popular dating sites in

the US. They used this tool called pittpatt dot com, which was face recognizing tool.

Well, after the study was done the tool was actually acquired by Google it is doing face

detection and face recognition. You could actually use Tensorflow now. Tensorflow is

a open source library for machine learning techniques. Please consider exploring tensorflow

little bit and how it works and what are the libraries that are available inside tensor


The data that they used was first as I said; they took the identified data, they downloaded

the Facebook profiles from one city in the US which is possible in the way that you know

about Facebook data collection now you could actually collect data from a particular city.

Profiles that they collected were about 270,000, images that were collected around 274,000.

The faces that are detected were about 110,000 faces. This is the data that they had for

the identified data set, which is where you could actually say these are the names; these

are profiles that are connected to these pictures.

Un-identified data, they downloaded the pictures of one of the popular dating websites. So

first identified, take a back; the first is the identified data, now we are talking about

un-identified data, which is like the CCTV camera, publicly available information or

from match dot com, shaadi dot com. They downloaded the profiles and the pseudonyms of their,

to protect their identities, of course the names were not going to be revealed, the accounts

may actually have pseudonyms also. The photos that were downloaded from these

websites where actually used to identify the profile. To make the connection appropriate

they actually use the same city for the search, they download data from Facebook and the city

from this un-identified data set. The profiles that were collected here were about close

to 6000 and the faces that were detected were about closed to 5000. So that is identified

and that is un-identified data.

The approach that was taken was un-identified data, dating website, identified data, Facebook

profiles and the re-identification was to be done. More than 500 million pairs were

actually compared, because if each picture and each of the profile, each of the data

set were compared with each of the pictures in the other data set, from the un-identified

to the identified and the reverse also. What they did was, they did only used the best

matching pair for each of dating site picture, and pittpatt and I am sure in tensorflow also

it gives you in specific values, it actually produces values in some range they use the

best value that they could get in terms of comparing two pictures.

And to confirm, to get ground truth when this pictures are just the same data sometimes

if the techniques that are machine learning techniques are not going to be fool proof

and they are not going to make 100 percent right prediction. Therefore, they are actually

showed these pictures to Mturkers, the users who are part of mechanical turk which is a

crowdsourced mechanism where you can actually put a small task of like this identifying

where these two pictures are same people and you could actually pay them small money for

doing the task. And there were asked to rate the pictures

on the likert scale of 1 to 7, at least 5 Turkers for each pair. Again please try and

look at what are Mechanical Turkers, mechanical turk is a crowdsourced mechanism. For example,

if I were do a task in identifying whether a given email is phishing or not I would actually

it show to the Mturkers, I would create the task on mechanical turk and get users to actually

look at the image and say whether it is phishing or not. Look at the profile and Twitter to

say whether it is fake or not, they would actually go to the profile, they would click

on the link in go to the profile in Twitter look at the profile and then make a judgment

whether it is legitimate or not. So it is the very popular and there are many

many services like this, crowd flower which is mechanism in which many of these services

come together, it is also very popular crowd flower is one - c r o w d f l o w e r, is

one of the popular services like this - Mechanical turk which is from Amazon is also very popular.

They took these two pictures showed to users, mechanical turkers asking to actually compare

the images and make the decision. So, at least 5 Turkers for each pair because then we'll

see more confidence, more and more people say that, more and more people take a image

and say that this is the chair and there is high confidence that is going to be a chair.

What they were able to find out was highly likely, which on the likert scale, is highly

likely matches where about 6.3 percent that images that they took from this un-identified

and identified and randomly they compared using the pittpatt tool and showed in the

mechanical turkers. The comparison highly matches were about 6.3 percent and highly

likely and likely matches were about 10.5 percent. Which basically says that 1 on 10

from the dating site can be identified, because the dating site is an un-indentified data

set, whereas Facebook is my identified. So every time I see one of the pictures in

the 10 pictures that I see, I will be able to actually clearly exactly identify who this

person is, because I have the Facebook data, this is done of the same city and therefore

it should be probably correct and mechanical turkers actually confirmed that. So, you can

see that 10 percent of the times the users can be actually identified.

One question to you and I hope this question since there will be some discussion in forum

also is that; what can you do better if you were the attacker? And if you were make use

of this information and do something to increase the rate of the efficiency or use this information

to do something against the user what kind of things would you do Because as an attacker

you making one this percentage to be more right, because it is 10 percent you're getting

a hit rate of only 10 percent, or 1 and 10 pictures. Whereas, if you were to have a better

attack or threat mechanism you could actually do things by which you can increase this percentage

to more, so more and more pictures are actually re-identified and therefore it can be actually

used maliciously.

Experiment 2 as I said there are 3 things. So the second one what they did was they connected

the offline and the online. First one, they compare online versus online which is the

dating website and Facebook, now what they did was they did the offline and the online.

Pictures from Facebook, one of the Facebook college network data was collected to identify

students who are in campus and it was actually compared to the offline pictures also. What

was stated when the students were actually participating in the study. So this is the

experiment number 2; all connecting to the same questions which is can we actually take

images, pictures from these social networks like Facebook and re-identify people who connected

to networks, to data where users cannot get in from, CCTV source in.

So, what they did was they actually put a booth in the university, took 3 pictures of

the participant, they basically were standing and collecting data of the college students

in this university took 3 pictures for participant, collected data over 3 days. They collected

about 25 percent profiles, images were about 26,262 and the face is detected were about

114000, so Facebook data for that university. So, the data that were collected from Facebook

which is online is about 25000, profiles were about 25000, pictures were about 26000, faces

were about 114000 thousand.

Just to summarize or just to look at the whole experimental set up itself is that, pictures

taken of individuals walking in campus, asked them to fill the survey. Next slide I also

have a image to actually show you what was the process of the study. But now pictures

were taken of the individuals walking on the campus, they were asked to fill an online

survey. Pictures matched from cloud while they are filling the survey, because what

they did was they ask that you want to participate in the study, ok I will take you 3 pictures,

when they took the pictures then they asked into fill on online survey.

While they were actually filling the online survey, technique the system that they are

acted would go compare this pictures what they are took to the Facebook pictures that

they are already collected from the university itself and bring back the comparison and showed

to them. Last page of the survey with options of that pictures, so by the time they actually

fill the survey they were actually shown the pictures, saying what this is the picture

that we got from Facebook, do you actually agree to it. Asked to select the pics which

matched closely, produce by the recognizer. So, that is the process of the study, please

understand how the study was done, collected pictures were taken individually walking in

campus, they were asked to fill the survey, while filling the survey the data the system

was comparing the pictures on Facebook, pictures were brought back to the survey showed to

the user and saying tell us if these pictures are right about you.

Same thing is captured here in the process format in the background, which is upload

pictures of the users, pictures are taken which is 1 and then responses coming from

the server, start survey which is 3 and then 4 is generated survey token, so that through

this survey token you will actually be able to say that comparing the images and bringing

it back, which is 5 is looking at custom survey tokens send to the user who can actually fill

the survey. And then by the time of 6 is happening which is face recognition results are being

produced and then survey results both the images that are actually used which is given

to 7. So that is the process of the study, not a

very difficult, not a very complicated study but it is actually collecting some very interesting


This is the result what they did from the data collection. The left picture is the picture

autonomous to the picture for the purpose of just re-identification of the user itself.

The picture on the left is the picture that they took while the user was actually participating

in the study. So when the user logged in they took the picture that on the left.

Using that the picture they are able to actually identify the picture on the right which is

the picture from Facebook where this user was actually identified. So, that is the output

so to say, the input is the picture with the survey and output is the image from Facebook

which is re-identified this person in particular pictures. This can be actually pretty revealing

the pictures compared on Facebook.

In about 98 participants all students in the study, there were about 98 participants, all

students were the ones who participated they were collecting it from the university setup

and they all had Facebook accounts also. The results were 38 percent of participants were

matched with correct Facebook profiles, which is the pictures that were taken, 38 percent

of the people who took the pictures in the study were exactly matched with the Facebook

profile and their account, their information is actually brought back to compare to confirm

it with the user. Interestingly there was also a participant

who mentioned that he did not have the picture on Facebook, actually information of that

particular person, of that particular participants was also brought back. Of course, it was actually

taking very less time to do this comparison. I hope the study is making sense which is

38 percent of the times the users that were taken pictures from the university campus

were identified from the Facebook profile.

Experiment 3 is interesting because they actually tried using the experiment understandings

from experiment 1 and 2 to take this personally identifiable with information likes Social

Security Number. In this experiment 3 they wanted to predict Social Security Number from

public data. So, they used the faces and the Facebook data that were collected from the

experiment 1 and 2 with the public data to predict the Social Security Number. 27 percent

of subjects' first 5 Social Security Number digits were identified with four attempts.

So essentially what is this means, this means that every time I took up a face from the

database, I was able to identify the first 5 digits of the Social Security Number, 27

percent of the times. That revealing, that is not a very good sign, were 27 percent of

the subjects were able to find out five SSN digits of them. So that is the third experiment.

And I am keeping the third experiment little light because this is in total the interesting

things were pictures, un-identified data sets, identified data set and at the end they were

able actually do connected to social security member also.

Interestingly I am sure you could also think about how these kinds of techniques can be

applied in terms of identifying Adhaar number in India also and other personal details.

The study was done in the US and therefore if you were to repeat this study and find

out Adhaar number or others details of Indian Citizens it will be actually interesting to

look at that. If there is any ideas, if there is any questions that you have in terms of

how study could be performed in India, it will be interesting to talk about it in the


Here are the pointers to study that I just now discussed about.

And with this I will actually wrap-up the 4.1 week. I hope you understood what we were

talking about, we just talking about the Privacy Issues in Online Social Networks particularly

focused on collecting images and identifying users using the face, pictures, using the

images that are uploaded on social networks.

The Description of Week-4.1 Privacy and Pictures on Online Social Media