Follow US:

Practice English Speaking&Listening with: Drexel CCI MS in Information Webinar: “Saving My Web Things for Me and Others but Not All”

Difficulty: 0

good afternoon everybody my name is Alex Poole and I'm an assistant professor at

the College of Computing and informatics at Drexel University today I'm honored

to host the fourth in our MSI webinar series which will feature Dr. Mat Kelly

Dr. Kelly is an assistant professor at Drexel University's College of Computing

and informatics he earned his bachelor's degree in computer science at the

University of Florida subsequently he earned his master's and his PhD in that

same subject at Old Dominion University Dr. Kelly's research areas include web

archiving privacy and information visualization what is more his teaching

areas include web programming systems and architecture he brings a technical

archival perspective to Drexel CCI's new MSI degree particularly the digital

content management major he will teach courses and enterprise content

management and data curation among others Dr. Kelly has received multiple

awards for his research for instance he has received Innovation

Award from the Library of Congress and the National Digital Stewardship

Alliance a fellowship from the NASA Virginia Space Grant Consortium and a

number of best paper awards and nominations from conferences in the

digital libraries field his presentation this afternoon will provide a high-level

summary of outstanding issues surfaced in the field of web archiving that he

explored in his recently completed dissertation is an honor and privilege

to introduce to you today Dr. Mat Kelly. Thank you, Alex, for the

introduction today I'm sorry for the technical difficulties but we hopefully

if you're hearing this and you're on the right channel pursue them today I'm

going to be giving my presentation save my web things for me and others but not

all it is sort of an extension of my dissertation and where I'm going with my

research first I'd like to talk a little bit about

Drexel's DCM major so we we are starting a master's in information science degree

here at Drexel and to obtain the degree you have

three foundation courses five core courses six select of six electives and

capstone some of these courses are for example information visualization

enterprise content management which I'll be teaching this spring and then applied

ontology and hopefully that through this presentation will encourage you to put

your application in for master's degree at Drexel so without further ado I'll

get on to the presentation first as you're probably familiar with the web

where you see sites like the Drexel homepage on the left here your Facebook

page or your photos a peep of your child online and we put the information on the

web and sort of expect it to be there we go back the next day and assume that it

will be there but sometimes these services go under or aren't accessible

and we no longer have access that information if we believe that

information has value then we want to preserve it in some way so for example

if we put if we talk to our cousins on Facebook and we want to revisit those

conversations if we don't have our need what access Facebook then for example we

cannot access that information anymore so there are efforts to preserve the web

the Internet Archive for instance is one set of web archive you allows has

efforts to try to preserve content and you'd see things for example the

Facebook homepage on the right there which may not be exactly what you would

expect from the entire archive but they preserve things like CNN and publicly

accessible pages so what you would expect for to be preserved is your own

content your own Facebook you'd want to revisit the conversations with your

cousin and without having someone preserving somewhere the efforts min

archive don't really help you in that regard that's sort of one of the focuses

of my research is preservation of content that the institutional web

archives may not necessarily preserve I'm so beyond internet 5 there are other

efforts with different scopes for it for instance the UK web archive only

preserves things within the UK domain or other countries' interests things like if

you see in the red icon in the middle is yes the archives similar situation or

there's services online that allow you to submit a URI to it for like archive

today or webcitation net that allow you to tell it what you want to archive

and it will do a one-off preservation of that but a lot of these services still

only get things publicly-accessible so the things like

your Facebook feed or if you wanted to revisit your bank accounts or your baby

photos orange sites or corporate intranet sites anything really that

isn't on the surface web and is not publicly accessible is not preserved for

the future unless use of for the individual but it's probably a good

thing that they don't have that you wouldn't necessarily want others to see

your own your own photos or your own Facebook feeds of your conversations or

your bank ledgers or content that has ramifications if it were to be preserved

and subsequently are publicly accessible so to talk a little bit about what web

archives are with respect to the live web we have this context if you see on

some of the right sort all four of the images on the right are captures of some

part of the web's really the Drexel in the past but we don't necessary have

context what you or I are they of when were they captured and we kind of want

to have a way for these websites to say this is what I am this is when I am so

we can relate it back to the present web of what we see here so if we want to

experience the past we need to know what it is of the past that were that is the

basis of our experience so we want to be able to say what the Drexel dot edu

look like in the past and for example according to inter archival tell us this

is what it look like in 2019 this will look like in 2012 etc down all the way

as far as they have captures so the mental framework is a standard way to

express these syntax and semantics where these captures the individual captures

or what we call mementos can say this is what I am of so if you see in the three

different figures there are captures of the CCI homepage in the past in March of

2016 February of 2020 and March of 2014

whereas the one on the bottom right is that the Drexel homepage in August of

2007 so we have a way to associate the three images of the Drexel CCI homepage

with each other in the past and then we have a completely separate timeline of

the actual homepage in the past so having this sort of association gives us

context as to what we're doing relative to what we can currently see

the present momento supplies that so web archiving at least two of the components

that we will talk about here as a presentation aspect in the access aspect

so the preservation entails mainly taking what you see on the live web at

this time the now and being able to experience it in the future so to do so

you basically have to capture and ensure that you can access it subsequently and

so the access component of web archiving is a completely different animal where

you need to be able to associate what you captured with what it was and though

we have the syntax and semantics to amento the dynamics of doing so are

still pretty complicated so with regard to preservation you basically have to

capture what you see and make it be able to be re-experience so you see an HTML

page is at a URI because we need context of what it was in that HTML page there

are many things like embedded images JavaScript script files that incorporate

behavior into the web page and within those files there may be recursively

embedded images so for example of a JavaScript file includes and other

embedded images or another JavaScript then you sort of have to trace it down

until you you get everything you have that you can assure yourself that in the

future you'll be able to have everything in the home page sorry everything would

in the web page to be able to experience exactly how it was so all these

different resources on the web are preserved into a format called work it's

an ISO standard that basically allows you to take a trace of everything that

you experienced and store it into one separate file so if you see on the right

side we have sort of some metadata about what this actually is this represents a

certain URI in the live web and then it also records the HTTP headers that

aren't usually exposed to you when you're viewing the web with a web

browser and that makes it so Riaan seeing that same page allows the

browser's of today to experience as as it was so for example we know that we

got an HTTP 200 ok or a redirect or what the content was what the content type

was a language that sort of thing by recording that information down we can

experience it as exactly as it was and of course on the bottom we have the

payload so in this case the HTML of a web page which has

the references to other resources like the stylesheet the CSS file there or the

another image that you can then trace down for preservation so it's important

to recognize that this work format is sort of a record a concatenated record

of records of all the different resources that are required to

re-experience a webpage in whole so the other component that's is the access so

one after you have this kind of stuff you need to repay bellari build a web

page so you have to be able to your browser has to be able to say I got this

HTML in a need to be able to look up back into the archive instead of going

to the live web to see the images that are embedded in the page or the Java

scripts that are embedded in the page so we don't want you web pages the past to

reach into the live web to get the resources to rebuild the experience

otherwise it's not an accurate record of what the past was but rather there's

sort of a hybrid and that's not what we want for the sake of preservation so

accessing these are usually accomplished through web browsers so for instance we

have two different web archives here the internet cards wayback machine which

many are familiar with and archive dot today which is represented at many

different domains so here it's archived at pH but it's also archived that is in

many other different hosts that allow it to be low more resilient in time as

their as their domains expire or are seized and so it's important to realize

also hear that you are eyes are opaque so if you look at the top example of an

archive you may try to infer that this capture was done on march 19th of 2013

but from the second example here with archived at pH you'll see that those

sort of semantics can't be inferred and they rightfully can't be inferred

because your eyes are opaque you aren't supposed to try to figure out what

something represents based on the URI alone so if it's something jpg it may

not necessarily be be a jpg when you dereference it and this is sort of one

of the axioms of of the web is you need to dereference the actual URI itself to

see what sort of content it is there so momento mitigates this by introducing

some headers which we won't necessarily go into the

here but it makes it where you can express this in a very semantic way you

can say this capture at this URI represents this live web URI at this

specific time so kind of backtracking unis the topic of this talk is to save

my things save my web things for me and others but

not all so as you see the examples in the top though web archives will

preserve things at or Bank of America comm what you would get on

the top here are simply the login pages you wouldn't get the content that you

really care about so you gotta ask yourself is it really preserved to the

extent you want it to be so we want to sort of capture this concept behind

authentication we want to see be able to reappear Ian stew part of the web that

we did from the live web through representations like we see on the

bottom our Facebook feed our bank accounts are actual photos and not login

pages so some of our initial work strove to overcome this issue and part of this

comes down to preservation by reference verses by value so when your preserve

being content alive web you say do it preserve it at from this place so but if you say you end up

getting login page so being able to do this by reference preservation is

insufficient to get the representation that we would expect from preservation

tools and so we have this concept of preservation by value of preserving what

we see in in the web browser itself to be able to capture that and

re-experience that because that's a part of the web that we we we see on on the

web so this part of this work was funded by the NEH Archive What I See Now

Project and as you'll see in the bottom the what you would experience it would be

more of what's on the right what you see on there within your own captures and

not necessarily I'm expecting to get the right and then ending up getting the

left so one of the downsides of this is if you're capturing content behind

authentication the same ability to authenticate into sites

no longer exists and so you can't go to a webpage of a capture you saw in the

past and expect it to hit the Facebook authentication mechanism to be able to

replay all that has gone on you're preserving the representation as it was

and that's it and that's there's a little bit of an issue on that because

if you do so and you want to share those captures and they're available at a

publicly available URI despite you reserving them from behind

authentication then then you might have information leakage of your personal

information or sensitive information things you want to preserve for the

future but you may not want others to experience in the past that's the

concept of saved my things for me and others but not all we would go on

everyone to be able to see it we want to be able to save it and share it with who

we like but not necessarily with everyone so we'll dig into that a little

more in this conversation for example here you know if Alice's check out my

capture of at my archive at Carol here may say hey

that that contains some information that well first you make you may not own so

information about careless personal life that she doesn't want shared but also

sensitive information about Alice that she may not want to want the public to

know about but she still wants to see for herself so you also had this concept

in memento of time app so a time map is a way to associate how pages in the past

existed and how they do it how they will in the future so if we have different

different your eyes in the past so for example CNN makes you have existed at on the secure a scheme or dub-dub-dub or with the ports or index dot PHP or any of these different variants

we have this concept of canonicalization that allows these different UI variants

to be coalesced so when we see a we query archives for we are able

to get all these references and coalesce them together and you have wasn't the

path despite the different variations of the URI so this this is important to

note when you have when you're combining in the past and

in in the present if you're going in the past and you see login

pages and then you see own content there's sort of this differential

between the past and the present that doesn't really parallel between the two

so for example for that would be useful to know all these different

permutations and be able to associate them together sometime apps allow it so

as we see as an example of a mento time up here we have the context of the

original URI which memento calls that your I our

and then we have multiple different archives that are a game together and

their identifiers are shown so you see here one from the in archives wayback

machine we see one from the Portuguese web archive all we see one from the

Icelandic archive one from archive and there could be many others but we get

the context of where we need to go to access this capture of a live live web

URI in the past what that live web URI is the URI are here is expressed at the

top and then the a temporal indicator a date of what this actually represents

and so all this is provided by the mental framework which introduces a

concept of time on the web so this concept of mental aggregation is

essentially a software tool that allows you to query an endpoint on the web and

say give me everything you have for this URI and it will do the task of querying

all these different your these different web archives aggregating them together

temporarily sorting them and providing them back to a user so this concept of

mental aggregation is important because if there is a page it changes very

rapidly or very slowly in time your accuracy of the picture of the past of

that URI will be reflected by the granularity of how many captures are

were so the more archives you have the less likely is that you'll have temporal

holes and you'll get the whole picture of a site in the past so you got to ask

who runs he's a DBE Anders who controls who what the the narrative of this of

the story of what the page in the past so normally up until a few years ago the

momento aggregator was hosted at Los Alamos

Alamos National Laboratory so you would send a query to this address as you see

in the top here and you would get back a set of archives that they aggregate

together this set of archives isn't changeable so

what you get you get and you're glad that you've got it but if you want to

include an additional archive in that set or a new one comes about or there's

one that's been decommissioned you don't have any control from the perspective of

user as to what archives you used as a basis for that story in the past so I'm

in order for new archives to be included from this aggregator they have to first

be mimicked memento compliance that requires someone to reach out and say

hey make your archive memento compliant so all the archives that you've seen so

far on this are and they have come about because of the cooperation of the

momento team but also someone has to manually add that archive into the

configuration of the of the aggregator which is hosted on a server somewhere

and so doing so is makes it works very hard to include new sources into the

story or decommission old sources from the perspective of user you're basically

at the mercy of someone that's running a web service and you can also if you have

captures of the past of your own Facebook stuff or just simply public

captures of the live web your captures will never be included in that story so

things that are very niche to your interest may not be represented in web

archives because they were signed to your machine in there and I get into the

story of the past so sort of breaking apart some of the the time apps here we

have the original URI which we saw before as well as reference to other

time maps which are just simply listings of it we have the context of the URI the

URI M here is the brown part here that says this is where you go to get it and

then the temporal date stamp as well as a way to do temporal negotiation which

we won't necessarily discuss in this but momento allows you to do if you say you

want to capture for example from 2014 and only archives have captures from

2013 it will do its best to through temporal negotiation to resolve it to

that capture that is closer to your time but that's beyond the scope of this this

talk ends you can explore more through the momento standard so this issue of

having a remote aggregator that the user can't control is something that wasn't

necessarily a good thing from the open source perspective so we want to be able

to include our own caption here and tell our own narrative and

and include the websites of the past that we care about or those captured

behind authentication we have so some of the work of Alam and Nelson at ODU (Old

Dominion University) programmed their own aggregator that allows individuals to

deploy their own service and itself contains doesn't require any

configuration on your machine on compile it to many different platforms open

source you can search the logic but the power and this is also as someone that

runs our an aggregator you choose what the sources are that are aggregated in

the picture so if you want to include or decommission some archive in the center

of archives that are used for the storytelling you can do so and fairly

easily so you're still setting up a web service

and which is a one that you control and it could be on your machine you create

all the same and if you don't want to specify a custom set you if the

aggregator at los alamos happens to go down you still want to use this

functionality you have the ability to do so through a service that's running on

your own machine and so through further work of this we sought to give a little

more power to the clients in this regard for momento aggregation so through some

of my previous work in a framework for aggregating private public web archives

we extended on either the open source a manner that allows you to more

systematically aggregate with captures from private web

archives as well as allowed not just the person that's running the web archive

but the clients themselves those that are query navigator to set which set of

archives are I need that's really powerful because if you have a specific

set of archives and you learn about a new one you can form an aggregator that

I have this other set be able to share this set and then collectively aggregate

the the different sources for the story you want to show in the past we also

introduced a way to regulate authentication and access through a

separate entity here that we call the private web archive adapter we'll talk

about these a little more and then a way to do archive a negotiated dimensions

beyond time which is a really powerful component that's still kind of a loose

end in this research that we're going to explore some more so I'm gonna dig into

these a little bit and talk about how it's relevant to the point of this

presentation so first the momentum it aggregator was an extension of mem Gator

that allowed you to have open there's a citation there that

I should have added in there that allows you to have deploy aviators on your own

on your own machine empowers a client to specify not just their own set of

archives but also through our extensions in in my previous work to specify things

like precedents so for example if you want to if you want to look to certain

sets of archives your own archives or at really any sort of parameter for query

and you can do so through the syntax introduced in the framework and on the

previous slide you also have the ability through the extension of a momento

meta-aggregator to be able to specify what set of archives are used

for the aggregation process and many other things that are covered in that

paper which we don't necessary have time to go in here but if you're interested

in it then these slides will be posted in this recording available for you to

reference and look further into so it's sort of a practical example here

assuming we have all these tools and this is Alice here and she has been

preserving things what you see on because she sees

a story evolving and thinks it's important or whatever site she feels

that may be rapidly changing and it's important to preserve and then she has

her own captures that she doesn't necessarily want to share with everyone

but she wants to ensure that she has it in the future

um so Alice says I see something on CNN or I see something on my local newspaper

that I don't think the archives are actually getting so I want to see my

capture temporarily in line with the captures from in her archive and only

in her archive not necessarily the other many other archives in there because we

may not necessarily trust the other guy so she wants for captures and in

archives captures and that's it so our idea is to have all our captors

in her one unique capture to be able to

or one unique capture to be able to see it see the part of the story that's

evolved so she could she's able through the interruptions framework to spin up

her own meta-aggregator and see the same sort of thing she can then it's

configure this aggregator say look at my captures and also in

archives captures and the aggregator itself will do the legwork of querying

the different archives here just those two archives and then temperately

activating and making them variable and shareable to anyone that wants to see

the combination of Alice and in archives of CNN in the past so part of this power

is also she can say to Carol Carol you want to see what what I see of CNN

plus what an archive sees because I think I have any capture past and she can create

all the same because it essentially a web service and she gets back the same

thing that Alice would get but then Carol has her own idea she wants to

supplement those captures of Alice's archives and include

that in in the temporal pictures as well she feels that she's getting perspective

different perspective so if you ever visited a website from different

locations you may see that you get a different representation based on her

and it based on what where your location is or if for example your country

restricts what you see and you want to record what that actually is that is an

important perspective on what the web as it was as it is a true representation so

here Carol wants to say I have I see a representation of the same site that

you're aggregating together with you to turn to cap your to captures and so she

can very easily spin up her own medicator and say I want to include my

capture and everything that comes from Alice's meta-aggregator what she may not

even know it comes from Alice meta-aggregator but she can say I want to

take everything that comes from there and supplement it with what I have

online and when she queries it then she for CNN she gets her own captures here

to set individual captures that are unique to her own archive so her a

aggregation but this isn't necessarily applicable to sites where Alice for

example wants to protect her captors so if here Malcolm were calling the guy on

the right says get me all the Bank of America captures from just Alice's

archive since we're enabling that pattern on there then Alice may be a little

worried that she's exposed so problem isn't solved by simply saying we allow

individuals to specify what the archival sources are and so from this we

introduced the what we call the private web archive adapter which we used

conventional web standards like OAuth2 and used the concept of having an entity

that's separate to the archive itself that allows for this standard

authentication mechanism in a tokenization mechanism which we'll see

here with the power and that - not only the couple-y archive from having this

role of managing authentication or private web archives in this case your

own machine but to allow the archives to be functionally cohesive

and allow the authentication mechanism to be functionally cohesive and make it

where these different tools that were developing these concepts were

developing or interoperable with other standards and applicable elsewhere and

so in the same regard if evil Bob here and Malcolm were to query

Alice's meta-aggregator instead of simply giving access to it because they know

the URI of the capture itself the archive will say you need the first

authenticate through this other mechanism before we're going to give you

access to it so where you would normally see what differentiates

with that logged in or not is a similar procedure that uses OAuth2 on the back

end where if you go there you don't have login credentials it says provide us

login credentials and you'll get a different representation so same sort of

live web mechanism mechanisms but applied to the archive web so to explain

to a little more and really this is the concept of standard authentication

mechanisms is you have this concept of a tokenization where after you login you

get a key sort of a hash right here that allows you to use that for subsequent

login so you know how to provide those credentials each and every time and so

through reusing this key you defer to the authenticate authentication

mechanism to check whether that key is still valid in the future without having

to supply the credentials to many different services so if you've ever

liked use your identity on a blog two comments you would log in through

Facebook and Facebook would return a key back to the blog to say use this to

associate the identity with the user that's posting and so every time you

post it on the every time you post in the blog it would use that same thing

but then if your credentials got got hacked that key could be revoked and

then you wouldn't necessarily have to you wouldn't have to reset your

credentials or for example if you want to say someone else you know is using

your thing you can revoke it and then another key would be have to read the

registry registered and you would have to provide the original credentials to

obtain another key to do it so applying these concepts in a live web

authentication are useful standardized and make it so this frameworks a little

more interoperable between existing tools so in the same

regard once you've got that I keyed it through the standard process OAuth2 we

can also share these tokens for example if Alice is providing it logs in and has

its key cuz she can share this key whatever thing with her friend here

Carol say use this for access in the future and if Bob wants to log in but he

should only have a certain access so off to has this concept of scoping where

Alice can specify if you are logging in using this different role then you will

get a different sort of key that that signifies that that you have to access

different things so for example if Alice wanted to enable access to all our CNN

captures but not to her Bank of America captures then she would be able to do so

through the standard mechanism and then if someone like Malcolm here on the

right it just provides random key he won't get access cuz he has to go

through so this concept of oauth2 in tokenization allows these keys to be

shared between Alice and Carol as we see keys to be disavowed if credentials are

compromised and then allows you to limit scope as we see here if you want to

regulate what is enabled by that so all these keys would then be subsequently

passed and then access enabled or not enabled through the subsequent

validation of the keys that are provided so we threw through this a certain

extension this work is to be able to if we're going to aggregate private public web

archives we want to be able to do negotiation in dimensions beyond time so

while memento introduces concept of time where we can associate what he captured

the past is and what time in the past that represents another citation I

should have included in there we want to be able to do the we're going to be able

to negotiate in dimensions like whether it capture whether an archive is public

or private if we have things like Alice has two different archives their own

machine we want to be able to disambiguate which captures are allowed

and which aren't as well as other dimensions like for

example if we have caps equality that varies across many different archives so

if you've ever used web archivist before I'm depending on who's doing the web

archiving itself which institution the quality is vastly different you know

we've done numerous studies on this to show that you may only want to look to

certain archives to do it but if you're doing it from the aggregation

perspective then you have to mainly go through and be able to see what the

source is to be able to clear out the ones that you may not initial honesty so

having a more systematic way to do that is powerful if you go to do negotiate on

the dimension of quality or any other thing that requires you to actually look

at the capture self to evaluate it and then as we saw for her one of our recent

studies which is not cited here unfortunately many of the archived

archival captures of some sites especially bigger sites are redirect

sometimes so through a very lengthy tech report we found that about eighty

percent of the captures in the past or actually redirects so if

we're trying to evaluate how well archive at a URI is in the past simply

through quantity of looking at these time maps you can't necessarily really

evaluate you have to actually look at the capture itself to see if it's not a

redirect and so having things like surfacing up traits that are inherent in

the capture itself without having to do analysis on it so by just dereferencing

the URI the URI-M for the past we can say oh this is a redirect so it's likely

not useful to what we have and so from there we can pare it down to the things

we want having these different dimensions that we can into negotiation

on beyond simply time is very powerful we started investigating this with

through my dissertation work and there's still a lot of open ends here so we

introduced the third concept here but stargates so star being if you think of

a gate being the way to do to negotiate in time so memento introduces the time

gate so a star gate a star being an Astra scare a wild card of being able do

negotiation dimensions beyond the time so one of the powerful aspects of this

is allows you to filter the concept the content that's returned to you on the

server side so if you are able to express that you only want the captures

that meet a certain criteria like our a certain quality or a certain language or

aren't redirects you can do so through this additional mementity or what we

call mementitieness which is explain a little more in that paper if

you want to specify only use certain sources here so even though in our

aggregator our meta aggregators is defined to use certain set of sources but

you only want a certain static at sources and you want to include your own

intemperate picture as we saw before a starting it kind of allows you to do

this sort of thing and some one of the powerful things here is also through a

standard through an RFC specifying a standard way to express this sort of

thing using to prefer HTTP header which we won't go into here there's a little

technical you can the aggregator self doesn't need not comply to these

requests so you're specifying preference whether the web service complies with

this is up to this how the service implements it but it also allows that

allows the client itself and not the person running the web service to

express his preferences so normally you can spin up your own instance of of a

meta aggregator but you're still if you're were to share it with the people

without them spending up their own they are still at the mercy of someone else

saying what is the use of the source what is the implementation here so

enabling client-side preferences is real powerful here because the most part

you're glad that you got what you got and you don't have a lot of control over

specifying it beyond that and so we introduced these concepts of doing our

cuttable negotiation beyond time through the

Stargate concept and we detail a little more in the papers so if as a quick

example here so say Bob here wants to say give me only archives in the past

and despite what you have that are that are private so we say private only here

in this case and there's standard ways to express this through the syntax of

the of the HTTP header itself so Bob specifies this and the aggregator and

the second step says here is here's all the archives I have first filter this

and tell me which ones are private and once it does that the Stargate can then

say okay these two are probably at these two two archives and only then we will

query those two archive so for instance if you are searching for something that

you don't want to want the archives to initially know you're searching for in

the past this is powerful you can say I only want

to search for within domestic archives that we know of or if you're if you know

that someone's manipulating the story in some way then you only want to specify

the as that reliable sources worse you don't

get missed information I mean you also want to do an expose that you're looking

to it these to these archives because I don't know that they are actually

relevant in the story and so the second part here that we talked about before so

if you have a lot of different redundant captures of yourself with a Google

example here we see on the right of Apple comm in the past is if you look at

all the captures of Apple common past you see a lot of redundancy of one that

you introduced a new product and you look at the web archives because the

pages aren't exactly identical they saw these s unique captures and so if you

were to generate screenshots of all these in the past you would see this a

very large level of redundancy in there so being able to pare this down to get

the captures to get a good summary of how the page exists in the past is very

useful it's kind of evaluated beyond simply the number of captures you have

which we saw before is fallacious so one way to do this which is a completely

different work but it's relevant here is you're able to look at the HTML generate

this what we call us impact and the same hash is sort of the signature that

instead of conventional hashing mechanisms varies only slightly when

things are similar so normal hashing mechanisms when things are even slightly

dissimilar they vary greatly where it's very distinct from one another sym hash

compared to that when things vary slightly the hash itself is slightly

slightly dissimilar so as we see between the second and third example here there

is only at what they call Hamming distance of one so because it's only how

does that subtle change it wouldn't be significant enough to say this is a very

unique capture but when we see the one on the bottom we know that something

drastic changed on the page and we should probably catch that so that

threshold was studied in 2014 I believe that defined this

threshold of how how different does a page actually have to be in the past

before it's significant enough to be included in summary of the past so and

this threshold here we saw what we call Hamming distance of four so if you just

meet this threshold this including the summarization and this works been

published and is available but we use this to similarity through analysis of

the HTML page itself and not any other subsequent resources so it's not hugely

expensive to first define what we want to say is included in the summary and

then go through the expensive procedure of generating the screenshots for if we

want to see it picture and so being able to do this negotiation

in the dimensions of the of the similar to similarity we can say okay we want

only the captures that are under the certain threshold of quality so above

the stretch one quality so the here we have the damage measure MD is damaged

and that they are unique captures and we want to be able specify this but unlike

before we can't say there's a filtering procedure we have to look at the

captures itself before we can do it so first we first we first we query all

these archives we get all the HTML of them and we specify these different

hashes that we've generated from them to a Stargate the Stargate can then say ok

these are the ones that meet this threshold go ahead and use these as a

basis for what's including aggregation reading the criteria that the user

actually specified of being under damage threshold and being unique so we have a

lot of power here both content-based attributes of things that require you to

look at the capture itself to do some analysis for it which may be a long run

process and then drive the attributes of saying ok we have started the attributes

that that require processing versus the content-based attribute it's one where

for example it is has like a HP 300 or 200 but hang on if you want to exclude

there redirects so kind of in in summation on this talk was about saving

your web things for you so preserving concept behind authentication

which we covered and others through collaboration we're able to do this

through being able to enable individuals to deploy their own aggregators and

doing so through a query mechanism that enables the client to to query these

aggregators and then be able to regulate access to access control and we covered

a couple other things that are relevant and there's still a lot of open end work

here on some of these open in research efforts as we didn't discuss here

include doing the querying very efficiently so it is very temporally

computationally and spatially expensive to be able to do the analysis of many

different archives which are for the most part monotonically increasing in

size was getting even more expensive in the future so being able to do so

efficiently is something that is definitely a consideration of ours

making the software easy to use health hosted and native so those that are

preserving the web aren't necessarily tech-savvy people so we want to make

sure that they can use these tools and I have

to go to the command-line to be able to interact with them we also if you're

saving all these things on your own machine we want to be able to enable you

to distribute these in a very safe manner to encourage the resilience of

the archive so if these things are actually important and you're saving

them you don't want your harddrive dying to make all your efforts move in the

future so previous similar previous work with interplanetary way back which as an

open-source package that we developed during my dissertation to to encourage

this sort of thing um and then in our initial work we simply went through off

to there's likely other more systematic mechanisms to regulate access to web

archives and so there's more research to be done on that to make it so it's a

little more systematic and easy for individuals to understand and so that is

the end of this talk and I'll take any questions and you can type questions if

you want or you can say them here okay

so yeah and we did have one question which you said it was answered here so I

can go into a little more what's the granularity of access control can

contribute to specify illicit access by capture or by site set of captures so

our initial work of simply introducing access control to private web archives

was novel in of itself and so there's still a lot of open endedness with this

so the way we did it we specified the level granularity set up by you so what

determines whether or not our kite is private or public is how you specify it

so when you're setting up these captures you have multiple cat archives or

multiple collections or multiple other qualitative attributes you wanna

sociated with it so say you have collection and it's only about a certain

topic you can include that as a basis for what is of that topic or not so in

this we specify the two topics of being public and private but the label that

you associate with those two classes is really up to you with this initial


a mint this is Justin so you touched a little bit on the challenges of

archiving things behind login walls can you talk a little bit about that and how

you address that with some of your tools? Sure, Destin, okay so and this actually

bridges off of Justin was a previous collaborator of mine but he some of his

work that was separate from mine and we talked about this a little bit was the

considerations of content that is preserved behind authentication that in

his instance was on a corporate intranet where the confirmation couldn't be

exposed and he had to basically essentially wipe out all the captures

because it contains sensitive information so some of the tools I

developed use the representation itself rather than URI so as we saw earlier in

this presentation we have where conventional web archives preserve the

content by URI itself I'm instead these tools for example a browser-based tool

work create which we developed takes a representation that you're seeing in

your browser and preserves that to the standard work format to make it

interoperable the other tools so if you have these captures your representations

in the past and you have the actual representation of what you saw the

archive what I see now project was about taking what you see in your browser and

saving that for the future rather than saying go to this location and see

what's there and hopefully it is what I see now so hopefully that interests a

question Justin that works for me thanks

any other questions

okay well if you have any questions in the future this is being recorded in my

contact information website are here for more information there's anything that

is of interest to you there's a lot of open ends on this and I'm open to

discuss it any of them in the future because it is

research I'm currently pursuing so thank you for your time and joining us here

for this talk and hold hear from you in the future

The Description of Drexel CCI MS in Information Webinar: “Saving My Web Things for Me and Others but Not All”