September 3, 2019 0

IARPA BETTER Proposers’ Day

cool well welcome everyone today’s the
proposers day for better better extraction from text towards enhanced
retrieval thanks everyone for coming in flying from around the country or you
know commuting on the Metro much appreciated so the point of today is to
kind of give you an overview of a better program what the goals on the IARPA end
are and then here’s some questions and feedback from the audience and then in
the afternoon provide some opportunities for you to talk to each other as a
couple initial notes all these slides will be available on the web
after today so don’t feel the need to you know take pictures or furiously
scribble notes this will all be available
we’re also filming so our friendly cameraman in the back that’s moderately
terrifying for me but just stick with me and we’ll get through it together so the
video the presentation will be available as well so hopefully you know we can get
as much information to you as possible and you can feel fully informed when you
kind of write up your proposals if you decide to wrap a proposal so just a
couple disclaimers to cover me for today this is just solely for information
purposes this is not a formal solicitation right now if if and when
the BAA comes out that is a formal solicitation nothing that I say today or
anyone else says constitute the BAA is the final word I’m kind of what the
program is gonna look like look like and the BAA supersedes anything that I say or
anyone else says about the program so if you know there’s any questions or you
know I say something that contradicts the BAA the BAA is the final word I will
mention that there is a draft of the BAA up right now on FedBizOpps it’s been up
since March 13th I believe so you know if you want to furiously google that
while I’m talking because basically what today is gonna be is laying out what’s
in that draft baa again it is a draft I think we have the date set as April 6 for questions and comments in response to that draft BAA
don’t promise that we’ll respond to any or all of those but we do take that
information into count when we’re finalizing the BAA so
again already said all this already running ahead of the slides but the goal today
is to familiarize everyone with the better program and what we’re trying to
get out of the program and have how we’re gonna thinking we’re gonna do test
and evaluation this is your chance to kind of directly one-on-one interact
with me there will be a question and answer session later today there will be
microphones and then again foster teaming there’s a lot of people in the
room there might be someone here that has a capability that you don’t have and
you might have a capability that they don’t have so you know that is one of
the large goals of today is to allow you to speak with each other to figure out
kind of the best combination to write the strongest proposal that you can so
again there will be a question and answer session in the packets that you
received there should be note cards so if you could
write your questions on those note cards we will collect them and then I will run
through those note cards and answer questions directly and then there will
be an additional opportunity to step up to the microphones and ask questions in
a Q&A session when you write the note cards you can either give them to the
registration desk out front or brooke the sitting up here up front and she’s
waving at the crowd now and then you know we’ll sort through those and answer
them as best as possible right now and then just as a note once the BAA is
released we can’t respond to questions directly we collect all those and then
answer them in bunches so there will be those will be posted to FedBizOpps so
that everyone can receive the same information at the same time so we’ve
already started that process so if you’ve emailed me and wondered why I
haven’t responded directly to your email that’s why we’re gonna be collecting all
that feedback and responding to it kind of formally via FBO so just I think I
went a little bit quicker but just to give the rundown of what we’re gonna be
doing today you have this agenda in your packets but you know we’re in the
welcome and logistics set part of the day
after I shut up dr. Paul Lehrner our chief of T&E is gonna get up here and
give a little bit of an overview about IARPA and what it is we do and what
our goals are as an organization and then kind of the meaty part of the day
after that is where I’m going to stand up here and actually talk about better
and the program goal is the program objectives and how we’re gonna test and
evaluate the things that we receive then we have a break and then someone from
our acquisitions team will be up here to talk about how to do business with AI
ARPA I know there are some startups in the room there are some people that
might not have had a relationship with IARPA previously so hopefully we’ll
outline a little bit for how you can interact with us as an organization from
a business perspective and then we’ll do the Q&A session and then at noon there’s
a lunch the government is not providing you with lunch I have to leave at that
point and since this question has come up I will mention in the afternoon
session it says no government that means no government not just no IARPA so
there will be no government personnel in the room at that point and that really
is the session again for all of you to speak to each other and facilitate
teaming so that I think I roll over to Paul
and now Paul will talk a little bit about what we do as an organization well welcome everybody
I am Paul Waner I am the I ARPA chief for test and evaluation just a little
bit about how we’re organized so I ARPA is they have a director deputy director
underneath the deputy director goes direct reports go to the program
managers like John and as part of the leadership team there are three chief
scientist type roles one is our general chief scientist he Shepherds the the
research health of the research program itself there’s a chief for technology
transition and I’m achieved for test and evaluation so I shepherd the independent
test and evaluation part of it part of it okay so I am representing IARPA as a
whole and hopefully if we have a little bit of time because I don’t think I’m
going to take a half hour I might be able to answer some of your questions
about IARPA as a whole okay next slide please oh I have a clicker
all right I’m doing a good job of representing IARPA as a whole let’s
see all right let me begin by pointing out that the if you’re not familiar with
the intelligence community we’re a large and diverse community there are sixteen
separate organizations across the community that are considered part of
the intelligence community and as you can see ranging from CIA dia all the
services have their intelligent components to homeland security etc etc
so I ARPA’s mission is to do research that benefits the intelligence community
across the community okay so all of these people are our customers i ARPA
looks to invest in high risk high payoff research you may note I’m pretty sure
when you look at the details of the better program some of the high risk
part because you will seeing goals that are set and if somebody in the room
doesn’t say that’s ridiculous it’s not high risk enough ideally if we achieve
the high risks or even if we don’t get to the goals
it’s a high payoff research and our job is once again to service those support
those 16 intelligence organization giving them and what we’d like to say an
overwhelming intelligence advantage you will notice not only in the better
program but generally that our programs are complex multidisciplinary often you
need to put together a team that has multidisciplinary elements of from
multiple organizations corporations universities etc we do emphasize
technical excellence and we emphasize the technical truth a portion of that
technical truth is kind of my job the chief for tests and evaluation we spend
a lot of time resources and effort on independent testing independent
evaluation of everything so yeah you do a good job we’re gonna confirm you’re
doing a good job let’s put it that way I ARPA’s approach you know like any
research funding organization we want to get the best minds working on our
problems that means that despite the fact that we are part of the
intelligence community we do as much as we possibly can and full and open
competitions and the question will come up can can foreign organizations apply
and the answer is generally yes I assume it’s true and better as well okay
generally yes we don’t restrict that unless we have to
and at the same time like a lot of organizations like us DARPA there are
similar upper organizations out there we work on a rotating basis we try to bring
in new people with new ideas all the time and that includes new people on the
leadership team I’m gone in January I will have been here for four years let’s
see our programs old dies I’m afraid our programs every program will have a set of
technical goals we try we try to make sure that these goals are clear
if they’re not clear then you don’t know what you’re trying to achieve and we
don’t know how to measure it if they are clear then we want to make sure they’re
measurable I will tell you that one of the most common reasons that a program
idea is rejected even if we really want to invest in the topic area is we are
not convinced that the objectives are indeed measurable that we can’t do a
satisfactory test and evaluation a week if we can’t measure it we don’t wanna
buy it they’re ambitious you can judge for
yourself under the better program as an example
and we’d like them to be at least a bit credible not everybody in this room
should say it’s ridiculous like I said we employ independent rigorous tests and
evaluation that comes in many forms sometimes we will do replication testing
of your experiments other times independent tests and
evaluation team will actually run your software against the test problems so
you develop you do the research develop the software you develop the
capabilities you test that on your own capabilities on your own environment but
when it comes to the final test evaluation it’s out of your hands and so
it’s a completely independent team that does it every program has its own unique
approach to dealing with independent test evaluation we very much endeavor to
get some of those 16 I’m pointing to something that’s not there anymore those
16 IC partners involved in the program begin with we want to make sure
that there’s somebody interested and taking the results of the program it’s
that’s not a 100% rule because occasionally we invest in something
where the eyes people on the operational side look at you’re like what are you
talking about and you know it takes us a couple year to get some results and then
we say this is what we were talking about and then they often buy it I’m
involved in one of those right now that’s why I can actually tell you
exactly that has happened a typical program runs for three to five
years yeah it’s rare that it’s outside that
that area if we have a research area that’s a longer term research area for
example we do a lot of stuff from quantum computing there’s no why we’re
getting a quantum computer in five years but what we do is we set specific goals
to advance the technology over the next three of five years that program is
finished and then we start the next program as a follow-on but each program
itself is looking at a three or five year technical goal we encourage you to
publish your results as much as possible we think that’s a great thing and as a
matter of fact results in data and even the stuff you don’t publish I’m not sure
exactly how better is going to be structured but we are pushing more and
more to make sure that the data you generate supporting your reports even if
not published are generally available after the end of the program okay and we
do and actually have a pretty good track record of transitioning capabilities
into the intelligence community partners in some cases that transition is actual
software but you know since we’re a research organization is often we prove
that it can be done we demonstrate that it can be done I’m a scientist a to that
work proof we demonstrate and provide evidence that it can be done and you
know once there’s sufficient demonstration our various partners will
work with us to build an operational capability I’ve been involved in cases
where they have in one case we demonstrated provide strong evidence
that a capability can be can work but we also explained what it would take to
make it work and that involved you know an ongoing commitment of millions of
dollars a year of investment and they had to make the decision as to whether
or not they were going to do it and actually do okay but so just because we
make it work in the lab and we make it work on this program doesn’t necessarily
mean it’s easy to do our job was to explain to them what it take
to do it okay so we have broadly characterized our research programs are
in four core areas analysis I’ll go through each one of these in a bit
anticipatory intelligence which we separate out from analysis I’ll explain
that in a bit collection you know we want to be clever
about getting information and clandestine places and computing
computing infrastructure let’s start with analysis so very much an
intelligence community you know we do research to support analysts and help
them do their job better certainly en vouge today is finding new
better clever ways to process large volumes of data and we’re very much into
that kind of research better is certainly associated with that as well
as the second category which is basically natural language large texts
large linguistic processing capabilities and then we also do research that
directly focuses on improving analytic processes so this isn’t data this is
helping analysts to think through it better sometimes an individual case
sometimes in a group case sometimes it’s cognitive methods even for example have
a program where we’re doing transcranial electrical stimulation to see if we can
make them smarter okay while they’re doing their job so you know we’re doing
the whole gamut okay now anticipatory intelligence you
would often consider that to be part of analysis but let me just point out that
if you look at the research and judgment decision-making forecasting that that
broad area of psychological research most techniques that work and are
effective for problem solving current situation evaluation current situation
discovery are ineffective for a forecasting and most things that work
for forecasting are ineffective for current problem discovery understanding
the simple example is that you might hear the term wisdom of crowds okay if
you’re doing if you’re working on a problem
solving a crossword puzzle okay as a group it’s a good idea as soon as
somebody has an answer that fits into one of the slots
everybody says yeah that looks good and group meetings are great for that the
research and forecasting in terms of how to aggregate expert judgment the crowd
of wisdom the crowd stuff you may have heard of cardinal rule is don’t let them
eat okay and that’s just an instance of what we find over and over again that
which works for court forecasting does not work that well for analysis and that
which works for analysis does not work that well for forecasting so we’ve just
separated those two out because they actually represent very different
research threats okay within the umbrella of anticipatory intelligence we
do a lot of work trying to detect and anticipate the emergence of new
technical capabilities much of that is data-driven
trying to infer out of the ongoing stream of technical reports journal
publications anything we can get a hold of what’s coming or what what seems to
be emerging indications and warning we ran a program for example open source
indicators that try to forecast political unrest events from ongoing
data Twitter feeds whatever new sources etc this is actually the program I
mentioned before it it worked it worked pretty well about 70-80 percent of the
things that analysts would like to have had forecasted yeah they they were
alerted to it found 70 -80% of stuff they were alerted to they were glad they were
alerted to it but to make it work you had to keep up and continually
update the machine learning the forecasting algorithms and that’s what
was very expensive strategic forecasting and this is actually related forecasting
major political trends and we’re also interested in doing research and rare
events not all of the forecasting geopolitical forecasting is data-driven
we do a fair amount of work on cognitive methods analytic methods for forecasting
if any of you have come cross Phil tetlock and his book on super
forecasters that’s from us we funded that stuff okay that was actually a
byproduct of what we funded which was a crowd wisdom approaches that were kind
of on steroids for us if you’re familiar with this literature prediction markets
were state of the practice that program substantially does better than
prediction markets in terms of forecasting accuracy but you get an idea
the kind of stuff we’re doing collection as you might imagine we’re looking for
clever ways to get data and stuff that’s in hard-to-reach places believe it at
that certainly interested in asset validation biometrics etc we want to
make sure that you are who you say you are when you come into any facility location tracking this is often a very
important thing for tactical intelligence being able to track a
vehicle or whatever okay that’s much more in the technology side and less on
the I can just talk about it side and then computing infrastructure I
mentioned we do work in quantum computing actually do a lot of it that’s
an example of the work in that area trustworthy components we have done work
and you know if you are building a physical computing capability you may
have some parts from overseas you may have some parts locally how to partition
out the engineering and the parts so that we can still get stuff that’s
effective and reasonably priced etc but trustworthy okay as an active research
in those areas and safe and secure computing of course were interested in
cyber threats and cyber security okay I think is the last actual slide which I
can get through with one more sip of coffee okay so how to engage with us I
mean you’re here so thank you very much we’re delighted to see you here you are
here for proposers day in a research program to the to the right for this
line and research programs are how we spend the bulk of our money
just as a practical thing it’s if you have one large prime during research
managing multiple subcontractors whether the universities or whatever that’s a
one large contract is about as much work to manage for us as a little contract so
if we can bundle things it’s just as a practical matter it makes our life so
much easier but in addition to the research programs themselves we do have
seedling programs if you look at the website
there is an IARPA-wide baa under which you can submit interesting ideas not
part of a program something you think we should be interested in something you
think may be related to a program what I will recommend you recommend that you do
is you because seedling is not going to get funded unless you have a program
manager who is in fact interested in it or let me just say it’s very unlikely to
get funded nothing is for sure we encourage you to go to that website look at the list of program managers their areas of interest are
described and if you have an idea that you think would be of interest to that
program manager give that program manager a call send him or her an email
and started in a formal discussion that program manager will know the limits of
what they can and can’t say to you but if they’re not interested you will know
it and if they are interested they will tell you so that much they can certainly
say and that’s the best way to do it because don’t waste your time if you
don’t think there’s somebody on our end who is in fact interested and we
encourage you not to waste your time but send us ideas we also run price
challenges I can’t remember is their prize challenge associated with better
not yet okay and okay and I think they answered that was I have no knowledge of
that they answered it okay but we often run
prize challenges and that basically encouraged the general population to
compete with their technology for whatever it is we’re trying to test them
on and if you look at our up top gov website you will see those okay and
finally we do send out RFI’s requests for information we run workshops just
keep an eye open for those things because we’re always interested in
learning more and that’s pretty much how to work with IARPA we are an open
organization the email for myself and everybody in IARPA is on the website
our phone numbers are on the website please don’t take too much of a package
of us as you can imagine we get some interesting phone calls the aliens under
the in Texas that was being described to me is one of my more fun ones but yeah
please please do contact us if you have a question or you’re interested in doing
something and I think that’s it good on time okay any questions awesome thank
you alright thanks a ton Paul all of our
kind of chiefs are very busy people so it’s always much appreciate when people
like Paul stop in to kind of give the high-level overview of what it is we do
so now I guess we’ll step into what I hope is at least 50% of the reason why
you’re here to actually hear about what better is about I promise to not say
better a thousand times through the course of this presentation but if I do
forgive me the acronym is my gift and my curse
so just want to stress again that there is a draft baa up and a lot of what this
presentation is gonna be is recounting what’s already in that draft BAA and
providing some context and some examples of what the BAA specifically talking
about but if you’re you know if I say something and you’re really curious
about more detail it’s probably in the BAA a feel free to ask questions about it
but just you know the BAA has a lot more detail than what I can present and
there’s kind of 45 minute time slot and then I’ll also mention that that draft
is in fact a draft I think we’re probably 99 to 95 percent of the way
there in terms of what’s gonna be in the final baa but we I do very much welcome
feedback from everyone in this room and people not in this room you know the
people that would be watching this on YouTube later
you know like Paul said we want to make these programs challenging and high-risk
high-reward but you know if you email me and say
this thing you’re trying to do doesn’t make any sense you know I probably
listen to that so feel free to shoot those across at me again probably won’t
respond directly but we will incorporate all that feedback so just again to the
nuts and bolts better is like Paul said anticipate to be a multi-year R&D program
we’ll get a little bit into the timeline in a second but a high-level it’s
anticipated be 3.5 years our punchy take away the lowest punchy takeaways is that
the program aims to develop enhance methods for personalized multilingual
semantic extraction and retrieval from text
hopefully that makes sense all the people in this room sometimes it doesn’t
make sense to people in other rooms so what we’re really trying to do is smash
together information extraction information retrieval and active
learning to enable this downstream use of extraction and retrieval fine-grain
personalized knowledge and this so kind of the key takeaways there are
multilingual and fine-grained personalized knowledge they were doing
things across languages and we’re doing things for an individual rather than
these kind of overarching one ontology to rule them all approaches so hopefully
I don’t have hopefully if you’re in this room I don’t have to explain to you why
human language technology and natural language processing is valuable
hopefully you’ve bought that already but so these things will be probably
familiar to everyone in this room there’s constantly more text that’s
generated there’s too much for any one individual person to read or probably
any team of people to read on a given day a lot of times this stuff comes
across in multiple languages you know think about if you’re trying to look at
all the news those publish across the world yesterday multiple languages too
much and then this concept of finding orthogonal information that’s you know
when you’re trying to discover something is kind of like looking for your keys on
a dark night under the streetlight you’re kind of looking in this one place
but really they’re probably somewhere else so you know you know what you’re
looking for but we want to expand that search a little bit to things that you
might not think about a priority to look at so this gives us an opportunity to
develop better methods there’s one to extract complex semantic information
from documents and then to use this extracted semantic information to do
basically document triage I’ll kind of you know hashtag spoiler alert we’re
rewriting document lists that’s going to be the information retrieval portion of
this is you know the metric if you’re looking at the BAA is average precision
so we’re looking at re-ranking a list of documents based on relevancy and again
punchy takeaway fine-grain knowledge and say it twice that you really get the
message so just to provide some I’m sorry I’m shifting back and forth I have
to stay for the camera but I tend to like to wander so apologies
so to provide a little context to why we’re doing this and what kind of overall goals that we’re trying to accomplish so imagine you’re an
analyst right and you have a job and every day your job is to look at things
like political events like civil unrest and maybe you did your PhD on civil
unrest you’re the world expert on civil unrest so through the course of her work
flow this analyst develops a corpus of documents that captures the things that
she’s interested in a lot of times analysts don’t think of it that way but
if you know think about if you’re writing a literature review or something
as you go through all these documents you’re making relevancy determinations
by citing these things in your literature review so you know within
those documents in the case of this political event that involves events
entities and linkages there are relevancy to a problem area so the you
know how we envision the technology in this program helping is to provide
automated suggestions of some of that some of those things provide suggestions
and most relevant documents you can think of this in kind of a recommender
system you know you read this maybe you want to read this and then automatically
tag information according to that individual analysts knowledge so again
you know we know what this analyst is looking at every day we know what she’s
trying to accomplish so we can provide automated support to this knowledge
discovery process and knowledge extraction process based on what we know
this person is looking at everyday so this is trying to kind of compress that
exploration and foraging part of the day so in the second scenario you can kind
of think of this as an analyst that does maybe some geographic area right they’re
not expert and things like protests but a protest happens in their area that
they’re looking at so they don’t have a ton of time to really beef up on you
know what are the causes of protests what are the outcomes of protests what’s
the most likely thing to happen after this protest but fortunately there’s
this other analyst that it that is her job to look at protests every day so the
goal is to use what the second analyst might be able to do you know this
analyst can find five New York Times or washing
post stories that kind of encapsulate the thing that they care about and then
the hope is to leverage other existing knowledge to get at this fine-grained
notion that the second analyst is interested in so just to kind of drive
that point home the second analyst should have to be
shouldn’t have to develop a massive corpus of annotated examples to extract
things that they care about right in general it’s a no-go to say hey if you
provide me with ten thousand labeled examples of the thing you care about I
promise you I can build you a ML model that doesn’t really fly for most people
so how can we leverage this knowledge of this thing that we know is a protest
over here from analyst 1 and apply it to what analysts 2 cares about and then in
this case of analyst 2 you can think of someone doing a search for protest right
there’s lots of other ways to describe a protest other than just the word protest
riot demonstration so on and so forth so how can we broaden the search
criteria analyst – based on the knowledge that we know from analyst one
when analyst two might only be looking at a very narrow subset of kind of the
semantic qualities of protest so really what we’re trying to do is compress this
discovery cycle so that people can quickly get up to speed on new areas
that might not be familiar to them and leverage things that we already know
across kind of this wide user base maybe you don’t buy that story but that’s the
story so some of the things that we’re really
trying to address in this program is complex information extraction so we’re
talking about events here multiple slots I given the people in this room I know I
have to be careful describing what an event is but please just stick with me
as I peg it to one notion of event and forgive me if I mix my terminology up a
little bit I promise it’ll be clear in the BAA so in this case it’s fuzzy events
and actors right we’re doing things we’re kind of they’re not claimed clean
named entities they might not be organizations or people or things like
that they might be phrases like the farmers and you know kind of this gets
out the concept of what is a protest these are complex things that are trying
to extract trying to do this for personalized extraction so how do we do
this when we don’t know the large a priority one ontology to rule them all
when we don’t know the complete knowledge space that were playing and
how can we quickly develop new things based on information that’s coming in
this is better extraction towards enhanced retrieval so we’re smashing
together information extraction and information retrieval again this is a
document triage document ranking test so how do we use this extracted semantic
information to do document triage the thing that I like to drop here is you
know thinking about and kind of a learn to rank context if that’s a concept
that’s familiar to people here and then do multilingual extraction and this is
the big one that I’d like to stress we’re talking about one-to-many models
we’re not talking about hand jamming a one-to-one model to move from English to
Spanish we’re talking about how can you use a model to do English do Spanish
French Russian Chinese Arabic I will say because I know this question comes up a
lot we’re not doing low resource languages we’re doing high resource will
probably be talked tackling things like top ten top five world languages so you
can probably guess what we’re going to be looking at but more than like we
won’t be telling you at the beginning of the program what this exact languages
are but no low resource I promise there’s already a low resource programs
out there and then at the end of the slide just
want to stress that anything that I say in this presentation is should not be
construed as restricting the space that you can play in and kinds of methods
that you might want to propose to solve these things have a certain frame of
reference for how I organize things but the point of this whole program is that
I am not the one solving this you are so we’re posing the problem feel free to
propose anything that you think might be a valid way to solve this so whenever I
say learn to rank or something like that if you don’t think that’s the way to go
great whenever I kind of have some mock-ups
for how I think the system will look if that’s not to constrain your creativity
so just want to make that clear in general this is kind of how I foresee
the program organization this is one of the things that might shift before the
final BAA but in general this is what we’re looking at so the program has
three phases again spread across three and a half years as you can see you’re
going to be extracting multiple languages so in phase one and two right
now we have that pegged as one language and then in Phase three we move to two
new languages and then the domains will shift and this presentation I’ll be
using political events as kind of running example but you can think of
events from biomechanical cyber crime lots of other things that we could lots
of other domains that we could play in and we will play in so again it’s going
to be multiple domains multiple languages and the goal that is you can’t
build a system that is really good at political events and not good at
something else we’re really trying to make this a general system to apply
across languages and across domains a lot of this won’t make sense right now
but I’m putting it up to give an idea of how this program structured so again three and a half year program
three phases the first phase is gonna be 18 months so we have a little bit of
burn in time the last two will be twelve months and there will be steady data
releases over the course of the program and then steady a couple check points in
each phase and then one final test and evaluation at the end of each phase and
that’s the kind of thing that will grade you on for the program and for IARPA
metrics I’m gonna get a lot more detail about what these things actually mean
but just want to have this in the back of your head whenever I’m talking about
how we’re gonna step through step one two three this is what that’s going to
look like stepwise across the life of the program so what are we testing you
on again the ability to extract complex events talking about a sentence or
paragraph level classification of event type we’re not looking at event triggers
that might make some of you happy and some of you sad but we don’t really care
about the exact thing that triggers this event again it’s a sentence or document
level classification task and the extraction of I call them actors you can
call them agent and patient you can call them whatever you want I will call them
agent actors and it’s a span ID so this is the span that this actors contained
within and a role ID and in terms of political events we use things often
like the source actor and target actor of this event so you have to identify
the span of the actor and the role again the ability to perform cross lingual
extraction will get a lot more into how that is structured in a second the
ability to apply this extracted information towards a semantic retrieval
tasks and this active learning component that I mentioned up at the beginning how
well can you incorporate human feedback into what you’re doing so this is I
joked at the beginning that this is my test and evaluation plan turns out how
to flesh this out a lot more Jason our director didn’t really go for this as
test an evaluation but you can imagine kind of playing in this space right so
if you have human input if you have information extraction you have
information retrieval you might be better on some than on others so it’s
kind of this multi-dimensional optimization problem that you have to
solve so I put this up there to say if you
a kind of world expert on information retrieval and you need to do some
information extraction stuff this is where the teaming comes in but also
understanding that different approaches might excel at different parts of this
kind of multi-dimensional space I have put that up there because yeah so actual
example of what you’re gonna have to do so again running example is political
events so teams are tasked with extracting political events from a
corpus great for this example we’ll use this high-level thing that we call a
quad class ontology you can imagine that political events are either said or
they’re done that’s material or verbal and they’re conflictual or cooperative
think that was pretty self-explanatory so what you’re gonna have to do is
extract something from this coarse grain kind of high-level quad class followed
by a fine grain event so if something’s material conflict a more fine grain
aspect of material conflict might be an attack and then you finally have to
discover a even finer grained aspect of that like a military action is different
than police attacking someone or to civilian rebel groups attacking each
other so again that’s what now you might be seeing how that phase fills in that
we go coarse grain fine grain finer grain across the life of the program so
this is what the quad class looks like just a visual representation again
conflict cooperative said and done and then we will label things at the
sentence or paragraph level to this high level quad class ontology so then the
way this is set up is we label all these things say this is this sentence is
material conflict here are the actors and then you have to extract from a test
corpus that’s in the target language caveat is you don’t have any annotated
training data for the test language you have annotated train data in English no
annotated data in the test language we’ll talk a little bit about what
resources you have available in terms of multilingual and cross lingual but we
are not annotating these events in the target language so you do this right
this might be an example of what some of those material conflict events look like
you know rebels attacking a town protesters marching in opposition to the
government red is source actor a blues target actor so this is the kind of
annotation that we’re envisioning and again extracting from a target language
corpus and then now we’re doing this fine grain example right so we go back
to this already annotated corpus and material conflict events and we say cool
that one’s an attack so to reiterate we’re reaching back into the already
annotated examples we’re not providing you with new unseen examples we’re
reusing the things from step one and providing a finer grained notion to step
one so now this rebels thing is no longer a material conflict event it’s an
attack and all you know is that the that protest
looking thing that’s just a material conflict event and then we’re reap
rising this reaching out to the target language corpus and do an extraction in
the foreign language this is the exact same setup as step two just with a finer
grain more detailed annotation all right and then the final note is that that
says it up there will be re-annotating in of these material conflict events and
that end might be variable across the life of the program so phase one you
might get 10,000 phase two you might get five phase three you might get two
probably not two but you get the point but if you can do two that’d be cool
and then for step three now is when we’re combining the semantics the
information extraction the information retrieval tasks and note that I’ve been
talking about sentence or paragraph level annotations now we’re moving to
document to provide the motivation for this if you remember at the beginning
the presentation I talked about an analyst being able to find five or ten
documents that they care about that analyst probably won’t go sentence by
sentence or paragraph by paragraph and highlight sentences for you so this is
trying to capture that notion we’ve already had some questions about what
document level annotations will look like and what’s how that will combine
with soon as or paragraph level annotations I don’t know as I’m sure
everyone in this room knows annotation in and of itself is a hard test so we’re
still trying to figure out the annotation tab today March 29th we’re
still trying to figure out what the annotation will look like and a lot of
this will be enabled a lot of what the annotation is able to accomplish will
determine what this final thing looks like I do
promise everyone in this room that it will be very clear at kickoff what all
this stuff will look like we’re not gonna kind of drop this on you you know
twelve months into the program and shift things around on you it will be clear at
kickoff time but I can’t provide a ton of clarity on that right now just
because we don’t know so again document level annotations but
again this is thinking along the lines of you’re playing an attack now this
document captures a military action event you do some information extraction
you have to hit the test language corpus again again this is extracting the finer
grain event like military action from this target language corpus and there’s
this information retrieval tasks so you’re reaching we have a large corpus
of documents some relevant some irrelevant some more relevant than others
and you have to rewrite and this is another point well emphasize this is
kind of the schematic that I have in my brain but if you want to do information
retrieval than extraction extraction then retrieval so on so forth that’s up
to you feel free to propose whatever you think will solve this best this is just
kind of some simplifying assumption so I can hopefully get the point across to
y’all today and then this is another place where this kind of schematic is
notional then you’re gonna have to incorporate some human interactions
again this is another place where we’ve had questions about what human
interactions will look like in general they will be yes or no annotation so you
can say hey I think this is relevant yes or no hey I think this documents more
relevant in this document yes or no hey I think this extraction is correct yes
or no what you’re not I can tell you right now what you’re not gonna be able
to do is say hey analysts can you annotate me 5,000 examples that’s not
gonna happen or hey analysts can you highlight the exact span in this
paragraph that you think is the source actor that’s not gonna happen we’re
gonna provide you those annotations but that’s not what this human revision is
getting at to provide the motivating example it’s like if you’re sitting down
you’re searching for something you’re probably willing to do kind of a
thumbs-up thumbs-down you’re probably not willing to dive into a document and
highlight and annotate things in the course of your day-to-day job
so again this notional schematic is though we feed in the seed document
there’s information extraction there information retrieval there’s some
feedback from a human TBD what that looks like and then you
reach out to all the documents and the goal is to get this kind of curve that
you see in the bottom where hopefully as the number of revise or annotated
documents goes up your performance on the various tasks also go up and we want
to see what the shape of that curve looks like and we’ll get into this in a
second but the number of revisions or annotations that you get over the life
of the program will be variable so what we’re really looking for is you know you
can ask one question what’s the one most high value question that you can ask to
a person you can ask ten what’s the 10 If you can ask 50 what’s that look like so
that’s kind of what we’re trying to get at here is what’s again shape of that
curve just to recap we’re doing three things primarily a coarse grain fine
grain finer grain plus retrieval for task 1 and 2 we’re looking at sentence
or paragraph level annotations Plus this span and roll ID and for the third test
we’re looking at that but with document level annotations and the information
retrieval task again it’s a document re-ranking task and then including human-in-the-
loop input in some way shape or form so this is the structure of the program and
now if we go back to this hopefully this makes a little bit more sense so we’re
gonna be staggering data releases over the life of each phase and then every
six months or so you’ll have these kind of checkpoints where you do the coarse
grain extraction tasks the fine grain extraction tasks and then a month before
each phase ends this kind of overall information extraction and retrieval
tasks so we’re trying to stagger things that you don’t have to drink from the
fire hose metrics everybody’s favorite discussion for information extraction is
f1 score this is one place where we’re more than happy to get feedback on
whether or not you think these makes sense for what we’re trying to do right
now we have it as multiplying so you get an f1 score for the event extraction so
the sentence or paragraph level classification and
f1 score for the span and roll ID we multiply those two together to give us
one nice you might debate whether or not it’s nice nice metric for the kind of
overall program goals yes that does mean that these two things are decoupled
because we want to see your ability on event extraction and actor extraction
yes that does provide for some unique things that you can get some things
right other things wrong but that’s pretty much what we’re looking at right
now again more than happy to entertain feedback on that if it makes sense to
you as the people who actually be judged on this task hopefully I don’t have to
define what f1 score is but it’s up there and then for the IR test we’re
doing an average precision metric so again this is just looking at how
relevant the re-ranking of documents in the relevancy of those documents we
will provide annotations of relevancy we will provide a kind of codebook of what
we mean by relevancy we’re hopefully defining relevancy in a fairly complex
way that this can’t be solved by kind of ad hoc retrieval methods and in a
complex way that actually reflects what a real life analyst would do so this is
kind of a contrived task but again punting that to program kickoff a
program kickoff it will be very clear what we mean by relevancy but right now
you just know that you just need to know that you will be re-ranking documents
according to some notion of relevancy that we will define and provide
annotations for process if you see a kind of recurring theme through all this
we’re providing you with lots of annotated data that’s really what we’re
giving you here and then kind of structuring this test and evaluation to
get the things out of it that we want this table didn’t really come out the
way I wanted to but each phase we have milestones the values in these cells are
as this bottom note points out are the percent reduction in error from a
baseline so we’re going to define baseline models a priori there’s a
couple in the BAA right now the first one for the information extraction is using
the political event extraction from BBN those numbers that are published on
relating to the iqs project that’s on dataverse and then for the average
precision it’s Oh point two five which comes from our
friends at NIST so again these values in the cell our percent reduction error
from the baseline model we will probably develop other baselines across the life
of the program some will be harder some will be easier and so the point is that
you won’t be able to toss some easy solutions over the fence you have to
kind of get past an initial bar that we think is good enough but again we’re not
doing this just we’re not saying you know you had to get 85 f1 score at phase
one where we are taking into account the realism that kind of moving from six to
seven is easier than moving from eight to nine and then I don’t have all the
tables up here they are in the BAA but these metrics do get harder over the
life of the program so I think and phase 2 the percent reduction errors like
sixty percent or something like that and then phase three is like seventy five
percent so it does get harder over the life of the program so we’re hoping to
shrink that gap more and more as the life but the program goes on I didn’t
put all those up here because I walked through them but the idea is the same
and then we don’t really call these metrics for the program’s it’s not
things we’re judging you on its things we’re providing you with
so again revisions the number of human judgments allowed for each performer
system again these are those yes/no up/down votes that you get from a human
human and then the document annotations which is the number of these document
level things that we provide you for this information extraction or retrieval
test so these will be variable over the life of the program you can see
notionally right now we have phase 1 it’s 50 revisions in about a thousand
document level annotations and then those numbers get down smaller over the
course of the program so at phase 3 you’re looking at being able to answer
maybe 20 questions by an analyst and you might have maybe 10 document level
annotations again this is another excuse me another place where we would welcome
some feedback so if you think that is just impossible for you to build a
system at the end of this program that only has 10 document level annotations
feel free to tell us I might not listen to you because I might disagree but
feel free to tell us and that’s useful information for us to have because it
helps us set our expectations as we design some of these metrics data no
social media so only news like data long-form ish people always ask well
what do you mean by news is a blog news we don’t know if you want to pin this to
an idea in your head we’re looking at something like the common crawl news
scrape so it’s a subset of common crawl that has only news websites so hopefully
that sets some expectations for you so it’s not just clean New York Times but
we’re not getting to someone’s kind of micro conspiracy blog or something like
that and no tweets no tweets you’re welcome so the government team will
provide annotated data sets of this news data again on form news data and we will
provide enough training data examples to enable a wide range of techniques we’re
not going to annotate a thousand examples and say vaya con Dios
you will have enough data to actually work with data regimes I mentioned this
earlier when we talk about a multilingual extraction again going
across languages there’s gonna be two data regimes there’s an unconstrained
and a constrained regime and the constrained regime we’re gonna point out
the resources that you’re allowed to use we will likely not be creating
multilingual resources ourselves you can think of this as pointing to resources
that already exist in the community things like LDC corpora WMT tasks things
like that again we’re playing high resource languages so there’s a lot of
stuff that already exists and this isn’t talking about the annotated data for the
inmense themselves these are talking this is talking about when you’re
building a multilingual model to move from English to Spanish you can use a
specific set of annotator of multilingual resources that already
exist and will tell you what those are and the goal of that is to say you know
basically what what kind of performance do you get when you use only freely
available open resources that you know we the government have easy access to in
the unconstrained regime it’s the Wild West whatever you want to use
we’re not gonna be paying for you to develop new resources asterisk maybe
but probably not paying for you develop new resources unless you make just a
super compelling case for why that’s the case but you can’t use proprietary and
you can’t do anything illegal and you have to provide us with an
accounting of the things that you use but other than that it’s whatever you
want to use if it’s some lexical resources if it’s some multilingual
resource whatever completely up to you subject to the constraints that I just
mentioned is it a the question was for constraint can they use concept net is
it a proprietary data set then sure yeah so again if it’s open source and freely
available and not proprietary that we would have to buy a license for you’re
free to use it that’s the main takeaway here because what I want to see in this
case is if y’all are all super creative that’s why you’re in the room that’s
what I’m doing this program all of you creative people will be able to figure
out more things than I can figure out as a human being that might be useful to
use in this program so we want to see the absolute best that you can do when
we don’t put constraints on you that’s the point of this deliverables Paul
mention in terms of test and evaluation sometimes you kind of create results and
we score them and sometimes we run your software against we run your software
ourselves against the test data that’s what we’re doing it better so what
you’re gonna have to do is you’re going to develop your models your approaches
your algorithms whatever you want to call them and you got to put them in a
docker container docker container ization technology it’s gonna be dhankor
containers the models must be capable of interacting with an API right now let’s
go ahead and peg that and say it has to consume from a restful api and it has to
emit to some json schema that we will provide to you at program kick-off okay
so what that means is that maybe you don’t have the expertise on your team to
develop a dockerized approach the tool chain therefore docker
really advanced in the past year or two so it’s not that hard but if you don’t
have that skill set you don’t think that you’ll be able to do that this is going
to be a program requirement it’s in the draft baa right now we will define the
exact API and schema at program kickoff but just know that this is how it’s
gonna be you will provide your models to us we will run them score them and we’ll
tell you how you did cool maybe not cool that’s it for me I think
I’m probably ahead of time but there’s my contact information again we have
right now we have the BAA wide email address that’s open so just send all
your questions concerns comments to that right now
again you have the note cards and your folders feel free to write questions and
I will go through those and provide answers in the Q&A session there will be
an opportunity to just do that via the microphones if something else comes up
but I won’t take questions right now probably just to keep things a little
organized so that we can do those cards and help me out a little bit so that’s
it I think we’re on a break now just making their way in good morning my name
is Katie Cole I’m the chief acquisition officer for IARPA and I know that today
you’re primarily here to get the technical content being presented by dr.
Beieler but I do just want to spend a few minutes just quickly going over IARPA’s
as business process I hope that this just kind of provides some framework for
the responses that you’ll be submitting as well as just giving you some
resources that may be useful to you or others in your organization so today I’m
just gonna kind of quickly walk through some reoccurring questions that we get
related to the actual submission in response to our bas all of these items
will be addressed in the final BAAs that’ll be the source that you’re going
to want to turn to at the end of the day but this should at least just kind of
give you that framework so when the BAA is released there will be a specific
period stated within the BAA for questions to be submitted the content of
those questions can be technical or really be related to any other section
within that all of the responses to those questions
will be posted the FBO so you’ll have access to all of those we ask that you
do not include any proprietary information for your company or mark
your questions as proprietary or confidential and again prior to
submitting those questions just take a quick read through that BAA and I’m a
lot of the information is there pay attention to section 4 that’s where it’s
going to talk about what what you’re going to be submitting so what the
structure of the proposal is as well as the submission process and in addition
on I ARPA’s website and this is already available to you up in the top
right-hand corner there’s a link to frequently asked questions I would take
a guess that the majority of the things that you might want to get some
clarification on can be found there and again that’s available to you at any
time so as far as eligible applicants like we obviously want you guys to be
collaborative to team together you’ll have an opportunity this afternoon in a
non-government setting to talk and see on what makes the sense most makes the
most sense for you but just be aware that this is the responsibility of the
proposer the government has no input into this and we will not direct it in
any way foreign organizations and individuals may certainly participate
subject to the non-disclosure agreements security regulations etc and again this
will all be outlined in the BAA so ineligible organizations this is also
found on our frequently asked questions area of the IARPA website but essentially
any organization that has access to privileged information on behalf of the
government can that participate as an offer or a subcontractor so this would
include FFRDCs UARCs and other government organizations intellectual
property is typically a big question to this again you’ll find will be detailed
on specifically within the broad agency announcement and if selected for award
this is something that would be negotiated within the terms and
conditions of the kind tract but in general the government does
not seek to own your intellectual property or technical data what we’re
seeking is the right to use that information so again this stuff is
available to you for your own commercial use now if there’s anything that’s um
first conceived under this activity IARPA will ask for non-exclusive rights so essentially like a paid out license
to use that information as part of the BAA there will be a section you’re
required to fill out that you can state and assert any restrictions to data
rights so that’s will provide all of those
stress instructions for you pre-publication so we encourage
publication of peer-reviewed unclassified research we always get a
question about both what does that mean so typically because this is an
unclassified research we don’t have a pre-publication review however we will
ask for a courtesy copy five days or so prior to the publication being released
so preparing for the proposal submission follow the detailed instructions as I
said in Section four IARPA uses a tool called ideas for the ingest of all or
proposal responses that information is also provided in the BAA
we encourage that you register on that site early it is the offers
responsibility to ensure that they have the ability to get in there so you can
register we recommend up to a week early make sure that there’s no issues there’s
information on if you run into any technical issues how to contact us we
don’t anticipate any classified proposals being submitted for better but
obviously that would not be done through the system and we would provide you
further instruction check fbo so FedBizOpps often as well as IARPA’s
website will have a link to all of that material but that is where you’re going
to see the responses to the q and A’s if as a result of Q&A is we need to amend
or clarify anything with the BAA that’s where it’s all going to be posted for
you and under Section 5 of the BAA that’s where you’re going to
see our evaluation criteria as well as the method for evaluation and selection
this too is probably pretty important for you so we recommend that you read
that carefully so organizational conflicts of interest if there is any
perceived conflict of interest you’re not sure the BAA will provide
instructions on how to get that information to us so we can help you
make a determination we just recommend as soon as you know if you think that
there’s something going on that you again just make sure that you follow the
instructions through the BAA get that information to us and we’ll respond
promptly to it so IARPA does applied research we don’t have the ability to
waive export or international control regulations we’re not subject to DoD
funding restrictions for rnd this is not our DoD work but you’ll have to work
within your business organization if there’s any concern about import-export
and all of that there’s people within your organization
who can help walk you through that finally I just want to address there we
typically get some standard questions as part of this day so I’m just gonna go
ahead and kind of run through a few of them for you this program better does
have a budget we wouldn’t be here otherwise no we’re not going to tell you
what it is what we’re asking for you guys to do based on the information
that’s provided is to provide the government back what you think the best
solution is to for this piece of research we don’t have a predetermined
number of awards on according to the BAA we’ll go through our evaluation criteria
and the selections will be made in accordance with that from today we don’t
have a set number of days until this BAA will be posted to FBO I can tell you
that typically from today for it takes a couple of months before it’s there but
just you know we just recommend that you just continue to monitor our website and
FBO so if there are any questions for me I’d be happy to answer them for you okay well thank you all very much and
I’m gonna hand this back over to John to respond to your questions all right Thank You Katie so hopefully
you’re all now clear on how you propose on an IARPA program right great
awesome so we are in the Q&A session you are a
very curious Bunch or I did not explain that’s a lot so I’m gonna drink a lot of
water through this I’m gonna answer the questions as best as I can
there are microphones in the aisle if I answer your question but don’t really
answer your question feel free to throw something at me and ask again I might
give you the same answer but you know never hurts to try twice but I will say
thank you to everyone that wrote these questions a lot of this is invaluable
feedback again as Katie mentioned the draft BAA is not finalized and there will
be some lead time on our end before we finalize the BAA so all the questions
that are written right now are really great feedback as we’re going through
the process of finalizing the BAA and firming up some of the metrics and
design of the test and evaluation so I’m also gonna look real quick in the back
and can you hear me because I was told earlier that I wasn’t speaking into the
microphone enough cool so I tried to organize these by theme I didn’t succeed
so I’ll try not to repeat every question over and over again and try to go
through them relatively quickly okay question should we include machine
translation and proposals for multilingual documents machine
translation approaches are within scope there are lots of different approaches
that you could take to dealing with multilingual documents other than
machine translation this goes back to the point I was trying to make of I
don’t want to constrain your creativity we’re not funding pure MT
research so keep that in mind if that’s your proposal that probably won’t be a
great competitive proposal but MT work in and of itself is definitely within
scope of the program probably going to get a question on that one again you
probably want to weight recalling more than precision and the F score since
false negatives can be disastrous while false positives are just less efficient
for analysts noted this is one where we’re I’ll bring this back and think
about and talk with the testing evaluation team as we’re designing the
metrics and this is kind of what I’m talking about when I say I might not
have a great answer to all the feedback right now but this is definitely
something that we will take back and think carefully about as we’re designing
the T&E setup so thank you for this who are the evaluators NIST etc NIST is
on the te team of helping design metrics for information extraction and primarily
information retrieval they’re doing a lot of work on the relevancy
determinations as I mentioned they we are taking your models and we’re running
them everyone knows we’re in the neuro revolution so that will probably require
GPUs and infrastructure to run a lot of the models that people are some of the
models that people will submit so NIST won’t be doing that there will be
another T&E partner that will develop the infrastructure to run your models to
actually score them but NIST will be involved in the evaluation and the other
T&E partner though will be building the infrastructure is TBD where we don’t have
a firm final answer on that right now still in the process of finishing things
does the classification runtime and training need to be in docker or is it
just the runtime just the inference just the runtime you don’t have to train
within the docker container or what I’m expecting right now today is that you
will hand us the Train model in the docker container and we’ll hit the
inference API our existing work should meet the goals but we are not in docker
how long before it would need to be so one the program’s not going to kick off
for a while two within the program I don’t want to flip back through the slides but
the first teeny point isn’t until eight or nine months into phase 1 of the
program so you’ll have a minimum of 6 months before you have to deliver
something to us and I think it’s actually more like 8 or 9 so hopefully
that’s enough time to dockerize things within the scope of the program when
will the final baa released when is the proposal deadline Katie just addressed
that I will say that questions on the questions and comments
the draft baa due April 6th and we’re hoping to turn it around fairly quickly
on our end revising the technical portion based on the feedback received
today and in those questions and comments but yeah there is a little bit
of lead time that we have to have to finish the BAA and make sure everyone’s
cool with it and get it posted to FBO so probably a month or two pass today
and or past the deadline for asking questions yeah and then proposal
deadline will be something like 45 or 60 days past the posting of the BAA but that
will all be in the BAA itself for personalized search will the analysts
interest be defined an advanced or learned on the fly only so again we’re
providing annotated data for relevancy and that’s what we’re defining as the
analysts interests we’re defining that in the relevancy determinations
themselves so you won’t be learning on the fly I know that’s going to conflict
with some stuff that I’m gonna talk about with human interactions in a
second but you are being provided annotated relevancy data for the
information retrieval task is identifying conflicting or inaccurate
information within the scope of this effort nope we’re not doing epistemology
we’re not figuring out what’s true and what’s fake what’s real and what’s not
you’re extracting things based on the annotated training data I’ll skip ahead
because there’s there was another question that asked I just lost my train
of thought but ya know epistemology you’re extracting everything someone
downstream is in the conceptualization of better there would be someone
existing downstream that would figure out what’s real and what’s not
you’re just extracting information from text summarization how central
summarization of better and there’s other questions relating the
summarization summarization is not in scope you’re not summarizing documents
you’re just providing the extracted information and re-ranking a document
list no summarization integration platform how advantageous is it for
proposers to have an existing framework for integrating IR IE algorithms with
USGS facing computing environments as for example micro-services in containers
you’re putting your models in docker and the docker containers consumed from
an API and published JSON that’s it that’s all we need that’s the only
deliverable we’re getting is docker containers so some other thing that you
might have is in particularly advantageous as long as you can put
things in docker docker containers do you expect each team to develop end-to-end systems that can’t read that conveys all key aspects covers covers all key
aspects IE IR multilingual human-in-the- loop can a team just focus on focus on
one aspect eg information extraction No the teams have to respond to all of the
task areas in the BAA it says that in the draft baa you have to be your team has to
be able to do information extraction information retrieval and incorporate
active learning or human-in-the-loop that’s why we’re here today if you’re an
IE expert and you don’t know how to do IR or don’t want to do IR there are
people in the room who probably want to do I offer but not IE so that’s why
you’re here the team and figure those things out but your proposal has to
cover all three and again that is in the BAA is sentiment analysis considered as
a beneficial component of the program I wrote no we’re not doing senitment
analysis I’m trying to think of any applications where sentiment analysis
would be directly relevant to information extraction I am not thinking
of any urban extraction please feel free to correct me if I’m wrong on that but
my initial reaction is no sentiment analysis is not in scope for global
organisations slash companies intending to submit a proposal can part of the
performing team be researchers from other countries for example Israel or
Switzerland as Katie said yes there are no restrictions on foreign performers on
better it’s all open research is all on class research and we encourage you to
publish the results that you developed any better
Ember’s was awarded to multiple teams and run as a competition will better be
run the same way where underperforming teams are eliminated we are anticipating
multiple awards we do have set metrics and there is the possibility of down
selects within better depending on whether or not
the team can make progress towards those metrics that were defining again we want
to set the bar high and then keep funding people who can achieve the
results that we’re looking for expecting to fund multiple teams yes does the team
have to propose against all tasks yes again they have to propose against all
three and it says that in the BAA how much funding is available some what
is your overall budget it’s got dollars in it so the government
team who is the evaluator who is the data provider are these the government
team I like number three I like that twist so as I mentioned NIST is aiding
in the development of the metrics themselves and they will aid in the
evaluation but they will not be running the models that you deliver to us so
again that’s just making sure that we get the best results possible from you
and we score you fairly and don’t kind of do something fly-by-night and make
sure that we’re fully capturing the stuff that you’re providing us and
giving you the best shot to be successful so again there’s still some
things up in the air for exactly who’s gonna be an additional T&E partner
T&E teams can change over the course of the program depending on languages
and things like that but right now NIST in someone else will be developing some
infrastructure is the primary groups I guess I can just mention its mitre who
will likely be building the infrastructure what is your attitude
towards open source software to what extent will potential contributions to
open source influence decisions on funding to what extent will you make
sure groups who promise to open source their work actually do so ok this one’s
fun so there are companies in the room right and there are academics in the
room and there are startups in the room so there’s a lot of different tolerances
for open source software and we don’t want to particularly step on anyone’s
toes again so we don’t want to kind of limit who can propose on this based on
the information intellectual property rights that people might assert I
personally like open source the BAA will make very clear
kind of what the expectations for open-source are I sought to figure some
of that out on my in but I will say that the in terms of the data and things that
we develop in the course of the program that all will be released at the end of
the program it being made completely open-source and I I like open source but
the BA will make very clear what the expectations asked to kind of
intellectual property and open sourcing software or want to be nice to our
corporate friends here it is is an approach of MT for a language into English
and then IE in English acceptable or should the IR and IE be native to each
language with cross link will transfer training etc again it’s up to you if you
think the best way is to do IE and IR in native language and go from there or
if you think the best way is to empty and then do ie IR that’s completely up
to you you know write a strong proposal you see the metrics that you have to
achieve if you think you can achieve those metrics with either one of the
things listed in here feel free to do so we’re not constrained
your creativity whatsoever in that regard currently the draft baa says only
one language for phases 1 & 2 if so how will multilingual / cross lingual and
language general methods be tested and validated would I ARPA reconsider
putting a second language earlier in the program yes that is why I said that
table that I put up with the phases was notional as of like yesterday we will
probably add in more languages to ensure that people aren’t building kind of
language specific program so we probably will broaden the number of languages
included in the program so thank you for that feedback no surprise so no we’re
again we’re playing in high resource languages top 5 top 10 world languages
languages that have large existing resources but we’re not telling you
exactly what languages are hitting it exactly what phase because what we don’t
want is people to go out and build kind of specific systems for a specific
language because we really are trying to push the generalizability across
languages so you’ll know at phase kick-off not program kickoff at face
kickoff what the language or languages for each phase will be
so we’re off declare to make sure there we’re also not being mean and telling
you two hours before you have to do this what language you’re going to be doing
is that the beginning of the phase so you do have some lead time but I’m not
telling you at the beginning all the languages that you have to extract for
the constraint test can we use concept net Creative Commons license yes you can
use whatever data you want as long as it’s not proprietary as long as we we
the US government don’t have to buy a license for it that’s really the thing
that I’m driving towards with the constraint and constraint regime oh wait
sorry I misunderstood that for the constrained tests you will use the
resources that we point out so we will provide you with a list of three or four
corpora or linguistic resources that you can use it’s in the unconstrained regime
that you can use anything as long as it’s open source and freely available
everybody with me on that I’m sorry for miss speaking on that constraint will
tell you the list of things you can use I constraint you can use whatever as
long as it’s open source and freely available of constrained resources we
will determine that later No so for the unconstrained regime its proprietary
data okay if it is derived from open data and is for proprietary data okay if the
government has granted unlimited use rights to those data so the first one is
proprietary data okay if it’s derived is it open source so if you’re kind of
asking yourself these questions that’s what all goes back to you can I go on
the internet and download this data if no probably answer is almost certainly
no it’s a proprietary data okay if the government is granted unlimited use
right so these data my gut says sure right now but I need to go roll that
around in my head a little bit and figure out what what that implies for
the broader program but the BAA will make very clear about that point I promised
that much so for now TBD in the BAA to be clear we are never receiving
annotated data in a non-english language for training correct it’s all English
annotated relevancy information extraction information retrieval nothing
is annotated a target language everything is annotated in English and
you have to extract or search in target language could you further discuss the
sort of document level annotations you envision as well as a sort of semantic
extraction those that’s in quotes anticipated based on these document
level annotations understand you don’t know one yet two question mark what I’m
envisioning for the document level annotations is something akin it’s
something akin to a topic label a document level topic label and again to
reiterate when you’re looking at the step through step 1 step 2 step 3 we’re
revisiting things that have already been annotated so if we label a document as
in the example a military action it will have annotated events for material
conflict and attack right so we’re revisiting these things that we’ve
already annotated and applying another label to things that have already been
annotated in the data hopefully that makes sense if it doesn’t please feel
free to ask again in the QA and I’ll try to explain further but kind of document
level label on things that have already been labeled at the sentence or paragraph
level and the semantic extraction is similar so extracting particular events
of military action that would probably that will be a subset of the attack
events so military action is a subset of attack we’re providing a document level
label of military action that’s probably not clear and we can chat more about
that will there be annotated train data for information retrieval yes we’re
annotating relevancy in English you mentioned in phase three only about
ten labeled examples would be provided is that ten per event type ten total or
something else so again with this we’re talking about document level annotations
of things that have already been annotated so what we’re saying is we
will label ten documents for this fine-grain event notion that we’re
trying to get at so we started out with maybe a thousand and phase one and we
end up with maybe ten of those labels in Phase three
so again document level labels dr. Beeler describe the goal that no one
ontology would serve it was not clear from the scenario how the user would
provide information about the type of events that are of interest is it
expected that this will occur through limited feedback please describe the
envision flow so this might be getting into some debate weeds but the
annotations are the ontology right so if you remember back to the analyst one in
two scenarios that I discussed at the very beginning the presentation the goal
here is to recreate something that looks like a flow of an analyst citing things
and hoarding things throughout their daily workflow so by doing those things
they’re developing an ontology based on what it is they do in their day-to-day
job so again the annotations and these kind of determinations of relevancy and
things like that are the ontology I know that gets tricky
because we have to test and evaluate things and we have to define the scope
of what is a protest or what is an attack so for us the T&E team the
government team I have an ontology in my head that I will annotate things too but
you will not know what that complete ontology looks like because again the
annotations are the ontology in describing the program the PM focused
on events as the trigger for information extraction will we need to extract
documents not related to a specific event for example extract opinion pieces
on a trend biographical profiles retrospective analyses found in news
like sources for information retrieval probably again this gets back to that
concept of orthogonal information so even if it’s not kind of a newswire
piece describing an attack that happened in country X maybe someone would want to
read the bio of a general or something like that so that is a possibility in
terms of information extraction also may be but there will have to be some
determination of does this document or sentence or paragraph contain an event
that actually happened some of the examples like this are the
retrospectives of pearl harbor right so you probably don’t want to extract that
as an event that happened on a date that is today or something like that so there
will need to be some concept of this is an event that happened this is an event
that did not happen and again this gets back to the point of the annotations of
the ontology so hopefully hopefully and this might not pan out but my assumption
is that the annotations will help pan that out somehow so the things that we
annotate and things that we say we want extracted will kind of fall out in the
learning of what is event that happened all right so this is a cluster about in
the human-in-the-loop feedback so I’ll say at the top of this that I now with this
stack of cards fully recognized that it’s something that hasn’t been fleshed
out enough and will be more fully fleshed out in the VA what those
interactions look like well we expect from you and things like that so I
promised that will be clarified in the BA I’ll provide his answers to these
questions as best I can now but just stick with me on that and we’ll provide
a lot more guidance in the VA so you can write good proposals that will actually
capture the things that we want everyone cool on that awesome I saw like two
heads nod so I’m gonna assume I’m good is more complex feedback from analysts
and scope for example rating ranking scale of one-to-five graphical
interaction basically can we solicit non yes or no feedback so I said I wrote a
note on this they said maybe scaler and then I started thinking about and this
is kind of what spurred my thinking of we need to clarify this a lot more
because I originally was thinking about yes or no annotations again to avoid
becoming Clippy basically right hey I see you’re trying to write a report you
want some help with that thinking that analysts can do yes-or-no votes but will
give a lot more thought to what those annotations and revisions look like I
like the idea of scalars as opposed to binary I don’t know exactly what
graphical interaction would look like but I think about it and again provide a
lot more clarity in the be a human loop interaction is restricted to yes or no
clarifying questions in the draft via a while this permits classic active
learning interactions there is little room for innovation and finding more
creative methods for utilizing human loop might there be more flexibility in
the eventual solicitation yes so that’s kind of what I was getting at with
clarifying in the BA what we want and perhaps opening up the scope and letting
that be another area that people can provide innovative solutions to so I’ll
consider that and see what’s in what would be in scope of the program and
what’s possible and what we can test and evaluate and all those things so again
thank you for all this feedback early into this this is a lot of great points in the human loop stage how many humans
per retrieve document is incentivization or gamification in scope unclear on the
number of humans so I know people would probably want multiple annotated get
some inner annotator agreement numbers what can well I’ll think about and then
incentivization or gamification no non scope so that is one thing that I can
say now that we don’t really want to dive down the rabbit hole of gamifying
annotations on interfaces with this program I need a drink of water with regard to human interaction will
the performer teams be able to structure determine the questions posed to the
humans or will the government define the scope of allowable questions and force a
common set to be used across all performers
well the acceptable question scope be of the level event classification are at
the level of document relevancy so I’ll answer the second one first both
information extraction and information retrieval will be able to have some
human loop interaction and then that first question about structuring and
determining the questions again that goes back to the running theme with this
group of questions that we I’m now thinking more and more that we will open
it up to be you know tell us what you think we should do for human and loop
interactions and open that up again there are some constraints because we do
have to test and evaluate this thing at some point so we can’t open it up
completely but we will define that a lot better in the VA I think that’s three
times I’ve said better today can you expand on human loop functionalities or
benefits for this proposal benefits okay so functionalities will clarify in
the VA this goes to everything else benefits if I’m understanding the
correction the question correctly the context is that it’s been my experience
that machines do better when we have humans on the loop or in the loop and
nothing that we do right now is quite there in terms of doing everything that
a potential analyst would want to do in an IC context so we’re trying to
integrate knowledge from a human that you know we’re not at singularity or
something so humans still have some knowledge that the Machine won’t have so
we’re trying to incorporate some of that knowledge that’s in the human brain to
make these algorithms better and more applicable to an individual’s context
and with that microphones more questions feel free to ask wave at me I will call
on you as you ask well yeah just feel free to line up if you can if you can’t
line up at the microphones jump or wave in terms of the
human-in-the-loop would there be any interests in modeling human annotators
of varying abilities so you might have the expert who’s the most reliable it
somebody who’s less reliable and using them with some kind of combination not
within the scope of better that is a general interest to me
so again Paul mentioned that we do other things like seedlings and stuff like
that so would more than happily entertain ideas like that but not within
the scope of better we do have to kind of narrow this down at some point let’s
just go back and forth alternating so so for the in you use cases you have two
scenarios with two analysts where the relevancy annotations that come on the
labeled data that is not part of the human in the loop feedback you’re just
giving that to us does it represent a single analyst perspective or does it
represent the individual analyst perspective of those two scenarios in
reality what will actually come out of our annotation for test evaluation it
will likely be something that resembles a single analyst determination of
relevancy whether you would allow the performers to have access to the
evaluation data while the evaluation was going on so we could compare our results
to what the evaluation team is producing I hadn’t thought of that
but that seems completely feasible as long as we’re releasing it after you’ve
already delivered all your stuff I’d be happy to entertain that thought as the
number of annotated cases gets smaller and smaller is it do I understand
correctly those will also be more and more finely grained yes yes yes so I do
recognize that there’s like diverging signals there right so you’re getting
more in depth things that you have to annotate and you’re getting less input
and again that’s to capture this notion that analysts tend to care about very
specific things but they’re also not going to annotate you 10,000 examples of
that really specific thing that you care about you emphasized multi domain
in addition but all the examples were I guess by necessity focused on a
particular protest political unrest how I guess how wide is a definition of
multi-domain in your operator domain or is it always in the political security
yeah so again spoilers
if anyone googled me I do political events so political events will likely
be a major domain in the program but there’s two more and as I mentioned at
the beginning that could be cyber security events biomedical events things
like genes interacting with genes to mutate a protein that’s an event right
so the domains that we’re considering right now are fairly broad but you’d be
pretty safe to assume that as the language in the BA and at this proposers
des has indicated the political events will be within scope of the program so
there will be a lot but probably political events as well the HMI
component is the focus of the active learning to improve the ability of the
system to help the person in the future or to improve the speed at which you
achieve a certain task ie is it sort of a training oriented
active learning or a test situation active learning yeah yeah I know
I knew the trap that was walking in to probably test its the it’s not gonna be
both probably and is to make it better for the analyst in the future is
probably what we’re gonna be looking at in terms of the test and evaluation but
I’ll make a note on the back of someone’s card this this card to make
sure that that’s clear in the BAA you mentioned the constrained and
unconstrained regimes is that a division of proposers or will every
proposer actually have to solve something in both unconstrained and constrained
well you don’t have to solve yeah you can solve so basically what we want to
know is what’s your score if you can only use these resources if you choose
to what’s your score if you use anything so in theory using anything could make
it worse it can make it better it could keep it the same so you don’t have to
solve solve both we’re just providing that option to solve both yes you will
you will have to solve constraint you will have to use the resources we point
out but you don’t have to forage for new resources can we assume that a GBU will be available if so have you given any though to releasing a common video docker image You mentioned a 70% limit of performance and I was wondering if you were think about interannotator agreement as a way to cap performance you expect to achieve rather then go to try and get 100% what two humans competing against each other would achieve we see the 65 in biomedical document annotation 65% is kind of the good number and its very hard to break it for political event annotation 80% is the number you see everyone gets 80 percent inner annotator agreement it’s spooky I had thought of that but that might be
another baseline model so if people are familiar with the squad Stanford
question-answering data set they have the human baseline for what a human
could reasonably do so that is as I mentioned there will probably be
multiple baseline models provided within the program so that could be you I like
that idea and that is another potential baseline that we can include is what how
well could a human do on this task thank you is there any consideration given to
iterating with the single analyst over multiple tasks sorry I wanted to write something else
down before I forgot ideally yes in terms of practical budget time actually
getting people on the thing that’s probably not gonna be possible within
the scope of the test and evaluation that we had to do for this program yeah
we still have plenty of time so I’m not trying to herd people out of here but if
there are no more questions you can be free to roam and going once twice
awesome oh one more not so technical but for someone who’s never done this before
are there guidelines or limits in the scope of Consortium how big the teams of
organizations you want to five you know and also I’m still not clear about the
budget are we just proposing a budget and are there any guidelines for that so
on in terms of size of the team no we don’t have any a priori you can only
have two members of a team or anything like that
that’s free for you to decide yeah yeah organizational members we in terms of
the budget you do propose a budget I mean we do evaluate on kind of resource
realism so if you tell us that your team of a hundred people will do this for ten
dollars it’s not believable also if you tell us it’ll take five hundred million
dollars also not believable so we do evaluate on that but we do not tell you
upfront here’s our budget and you know you propose against the proportion of
that so you tell us what you think your realistic budget is and as part of our
evaluation since we have time I presume the training data will be provided to
participants or teams the beginning of each phase what about the unanimity that
test data is an annotated test data know though that early
question did ask if after things are delivered if we could then release the
test data and that is something I’m open to but before you’ve delivered your
software we won’t release that and I’m about to flip back a lot so bear with me
on this I just want to clarify so you see those red stars that’s when we drop
data on you so the data will be dropped at multiple times within each phase so
again the data this release at the beginning of each phase will be the data
for the rest of the phase but we will provide new annotations on that data
over the life of the program or over the life of the phase if that makes sense yeah yeah and we won’t release that
until if we release it all it wouldn’t be released until the final
phase teeny period at the end because again we don’t want to kind of show you
our cards because we want to do this true blind evaluation of how well you
can do on unseen documents all right now I think we’re good so just a reminder
there should be restaurants in your packet that are kind of in this area
that help might help you navigate those aren’t recommendations they’re just a
list in the afternoon session I will not be here no people from the government
will be here again I want to reiterate no US government people will be in the
room for this that is not only a ARPA that is anyone so that’s really for the
performer community to potentially team and talk things out on their own just
want to reiterate again

Leave a Reply

Your email address will not be published. Required fields are marked *

© Copyright 2019. Amrab Angladeshi. Designed by