September 1, 2019 0

Handwriting Recognition: A Perspective on Two Decades of Innovations

Handwriting Recognition: A Perspective on Two Decades of Innovations

>>…us today for the CBIIT speaker series. I’m Mervi Heiskanen, program manager, supporting the Informatics Technology for Cancer Research, ITCR program here at the CBIIT. As a reminder, today’s presentation is being recorded, and will be available via the speaker series playlist on YouTube. You can find information about the future speakers on that site, and by following us on Twitter, @NCI_NCIP. Today, we are happy to welcome Dr. Venu Govindaraju, who is the vice-president for research and economic development at the State University of New York. Distinguished professor of computer science and engineering at University at Buffalo, and founding director of the Center for Unified Biometrics and Sensors. The title of his presentation is “Handwriting Recognition: A Perspective on Two Decades of Innovations.” Please go ahead. >>All right. Thank you so much. Thank you for this opportunity. I am happy to make this presentation on handwriting recognition. And we at the University at Buffalo have been working in this area for more than 25 years. And we have made many innovations. Today, there is so much popular science buzz with artificial intelligence, right? I mean, there are articles every day in news papers, talking about how artificial intelligence is making big strides, and a future with more and more of artificial intelligence is in store for us. Well, I would like to say that almost two decades back, when artificial intelligence was not this popular, when it was the period of the so-called winter of artificial intelligence, when government agencies were walking away from artificial intelligence, research projects and funding, because they thought that it didn’t have as bright a future, or it was not realistic, or it was not working with the real world problems. You know? Most of the AI in those days was based on toy problems. And so, therefore, there was this notion of winter of artificial intelligence. And now, not too many people actually are aware of this, but during the ’90s, we worked on handwriting recognition and actually delivered [inaudible], [inaudible] in the real world. And I will describe that as we move through this presentation. So, when I talk about handwriting recognition, I would like all of us to keep in mind that this is actually a fundamental task that the human beings do, and eventually, it’s reflective of human intelligence. As kids, we start to read and write at an early age, and being able to get a computer to do the same is really, you know, within the purview of AI. So, let me first give credit to, you know, my PhD students. You know, some of their pictures are over here, on this slide. I have actually graduated about 35 PhD students in the last two decades. And many of them have worked on dissertation topics related to pattern recognition. So, pattern recognition is sometimes the older name for machine learning and artificial intelligence. And today, you know, it’s definitely under the umbrella of artificial intelligence. I had a whole bunch of students graduate with dissertations in what we call document analysis. So, this is where handwriting recognition plays a central role. You know, doing OCR, optical character recognition. Reading what is on a piece of paper, a scanned paper, and look at the digital image, and interpret what is written on that image. That’s one way to go. Or, even if it is electronic documents, or if it is graffiti on walls. And, you know, all that will fall under document analysis. And I had many students complete their PhD in that area. And then there are the related topic of biometrics. You know, recently, the biometrics have gained popularity, especially in the iPhone, you know, where you can put your fingerprint in order to authenticate. And then you can use Apple Pay and so on. So, fingerprint recognition. Face recognition. They’re all part of biometrics. But even we did authentication with the, by these mode, banks were looking at signatures for authentication. Signature recognition is part of document analysis. So, these are all areas which are, you know, connected. So, document analysis, pattern recognition, machine learning, with the applications in biometrics. Handwriting is one of the many modalities in biometrics. When you talk about biometrics, it’s essentially about identifying people. So you could identify people based on fingerprints, on voice, on [inaudible], on iris. And, in limited domain applications, based on handwriting, right? I mean, the forensics folks look at handwriting and decide whether the ransom note was written by a particular person, or not. So, handwriting recognition actually straddles some of the biometric [inaudible]. So, this shows you the 37 PhDs that I have graduated, and, you know, the topics of the dissertation title. And you can see that handwriting recognition looms large among the topics that they’re taking. Okay. So, as I said in the title of the presentation, we want to look at some key innovations. And this is my perspective on what constitutes a key innovation. We have, in our lab, students, you know, have published more than 400 papers. And I have just gone through all of those, and decided that the key innovations, over the last two decades, are essentially spanning four primary areas. And then we’re also look in the future applications. The four areas where we have made innovations are in the way we use lexicons. And I describe that further. Fusion, which is combining different approaches. Indexing and retrieval, which gained popularity, you know, once the World Wide Web became, you know, so much in use for searching for documents and retrieving documents. And then, recently, security, you know, captchas. You know, how do you make sure that it is not a bot that is accessing [inaudible] on the Web. And these four innovation actually fall nicely, and this is deliberate on my part. Where the first five years were about lexicons, the next five on fusion methods, and then the next five, the thrust was the indexing and retrieval and security. And going forward, I see huge opportunities for handwriting recognition in personal archives and [inaudible] classrooms. So, let me go over some of these innovations. First, talking about lexicons. So, this idea of lexicons and the use of lexicons actually came about — in some ways, there is serendipity about this. Because we were given a task of postal automation. This was a large postal project that we received in Buffalo, in the 90s. And the idea was, can you automate the reading of addresses? So, the state-of-the-art at that time, in early 90s, was such that for machine-typed addresses on envelopes, there was technology available which could read and do a fairly good job, you know. Perhaps it would read 75 to 80% of all machine-typed addresses, and no human intervention was needed. When it came to hand-written addresses, and those of us who, you know, so not the millennials — but those who were around at that time remember writing addresses on envelopes all the time. If it was for paying the bill, or writing a greeting card, or something. Handwritten addresses was quite popular. And I’m told that about 10 to 15% of the mailstream in those days was handwritten. So, the postal service came and said, “Can you develop something for the handwritten addresses?” And we got a contract. And we at the university research lab, with a whole bunch of students and post-docs, we put together a system that would read handwritten addresses. And handwritten addresses, they are not just the ZIP code. It is the entire address block. So, you read the ZIP code. You read the house number. You read the street name. And then the program automatically generates a bar code, which corresponds to that particular address. So this was the task. And we were actually successful. And I will play a video here, we’ll see if it works. So bear with me for a second. [ Machine whirring ] >>So, what that shows us is, what you see is like a tape going through it, at a very high speed, is actually many pieces flying by, in front of a mounted camera. 13 mail pieces per second. So, that is the speed at which the mail is going by. And the camera is taking pictures. And the address images are being sent to this handwritten address interpretation program, to read and interpret that address. So, you can imagine, this is in the 90s. So the computing power is limited. And we used to have some spot stations, at that time. And storage was expensive. And so, the algorithms, essentially, were different, you know, at that time. And I’ll draw that comparison later, in some of the view graphs. So, here is how the algorithm worked, and I’ll tie it back to lexicons in just a minute. So, look at this example address. It has a personal name, which we ignore, because at the same address, you could have multiple people receiving the mail. Then you have the house number, and the name of the street. Then you have the city, state and ZIP code. Also notice that it is written without any guidelines. Right? I mean, there are no ruled lines. So no box is made to fill in characters. So, in some sense, it is freestyle, writing in the wild. Given that there is some structure to the address itself, that the ZIP code is always going to be on the last line, and so on. So, we had programs which would locate the address, which can be non-trivial if you are looking at magazine covers, and other colored background. And, it would generate that bar code that you see at the bottom. That bar code corresponds to 11 digit code, which uniquely identifies this address. So, you have ZIP code, and the house number. If you recognize those two entities, then together they can give you the list of possible street names. You read the street name, and then you can come up with an 11 digit bar code that corresponds to this address. So, here take notice that the ZIP code and the house number, those are numerals. And you can do numeral recognition. There are only ten classes, zero to nine. So sometimes a small class pattern recognition method could work over here. You recognize the digit, and you’ve got the ZIP code and the house number. Now, how does one recognize the street name? What we read as Cedar Run Drive, on the address, [inaudible] in a pixel map, right? So, how does the program figure out that it is C-E-D-A-R R-U-N D-R-I-V-E? Notice that the characters are written in a cursive fashion. So the first C and E could very well be a U, together. And the D could very well be an A. And the E could very well be an L. And the R could also be an L. And so on. So, those characters are written in a cursive fashion, in a natural way. And the task of the automation system is to read it. Okay, so how do we do this? So, this is where we actually do what is called dynamic lexicons, okay? And this is how the concept works. And this is a key innovation which made postal automation for handwritten addresses possible. Okay? So, look at this new address. So, I recognize the ZIP code, 14213. I recognize the house number, 74. Then I look up the postal directory. And it turns out that in that ZIP code, there are only about ten streets which have the house number 74. So, this gives me a lexicon. And that lexicon has about ten different street names. And the task of handwriting recognition becomes one of essentially choosing between these lexicon entries. So, I have generated this lexicon in a dynamic fashion. Because if you change the ZIP code, I would get a different set of streets. And it may not be 12 streets. It could be 20. It could be five, depending on the ZIP code, depending on the house numbers, you get a new list of possible streets. So now, handwriting recognition is just a task of pattern matching. I look at all the pixels in the snippet to the right side of the house number, and I try to match it against each of these lexicon choices. So, we match the pixels with Bradley Street, Colonial Circle, Dewitt Street, and so on, and figure out which is the best match. And if I do it correctly, the correct match will be Livingston Street, and the postal directly tells me that that will be an add-on code of 1653. So, my 11-digit code for this address would be 14213, that’s the ZIP code. 74, that’s the house number. And 1653, that’s the add-on. Together these 11 digits make up the bar code that uniquely identifies this address. So, what we have done, essentially, and this is when the innovation lies — that this handwritten address could have been destined to any of the almost 300 million destination points in the United States. Those are the 300 million points where potentially, a mail piece could be sent. Now, we take that task of choosing between 300 million classes, and reduce it down to a task of just 10 to 12 classes. Because they have already recognized the ZIP code and the house number, so it’s in some sense a divide and conquer. And we use the postal directory intelligently. And we come up with a smaller list. This notion of generating the lexicon dynamically — that means the lexicon is not fixed. It is not a static lexicon, but keeps changing. But it is, most of the time, a small number. We do this handwriting recognition, reframed. Reformulated as a pattern classification problem. So, this is essentially a key innovation. It turns out, 30% of the ZIP codes contain less than 100 street names to begin with. And the maximum is about 3000. So, you have reduced a 300 million-plus problem into a problem of choosing from a lexicon of no more than about 3000 streets. And very often it is, you know, in single digits. So, I would like to contrast that innovation, to drive home the simple but very powerful innovation to the task of bank check recognition. And many of us remember them. Today, you can actually have a handwritten check like this, and you can go to the ATM machine, and the machine very often will read the numerals and it will just want you to confirm that the amount it has read is correct or not. Okay? So, even in those days, you know, like in the 90s, the bank check recognition was done by reading the numerals, and then validating it against the legal amount, which is written in the line below. This task is very different from what I just described to you as the postal innovation. Because in the bank check case, those possible words that can appear are limited to 40. And the list is shown over there. No matter what the amount is, and how high the amount is, it is usually these, a combination of these words that you will see, on the line written below. So, this is a static lexicon. And so, the techniques that are for bank check recognition are very different from what we did for postal automation. In fact, if we had tried to use this method of bank check recognition that is used with static lexicons, the problem of postal automation would be intractable. Because now you are talking about millions and millions of streets that are possible in the United States. And to consider all of them at the same time. So, a similar idea. You know in handwriting, we have used census forms recognition, in prescription forms for medicine. And I’ll show you some examples. I know you’ll be very interested in seeing how the recognition methods perform on doctor’s handwriting, which is notoriously sloppy, as it goes. And also other application. For example, registry of land ownerships, and so on. So, there are many applications where you can bring a lexicon to bear and do recognition, as long as the lexicon is not too large. You can have improved pattern matching. In the next two view graphs, I will show you how the dynamic limited lexicon [inaudible] approach, which is very different from what you’d expect in a static lexicon case. Here, you have an example cursive word, written. And let me just tell you, it is the word W-O-R-D. Okay? And you can see, there is so much ambiguity. And what is done is, you segment these words into potential segmentation points. That means, where do you believe that a character ends in this word? So, potentially the curvatures tell you where you could have a break or a transition between two different alphabets or characters. So you come up with lines at potential points, and the idea then is, that you find me a path through these nine points, or the nine nodes, in such a way that you account for all the pixels in the word image. So, I could say that the first character is just written one and two. It could be just between one and three. It could be one and four, and so on. And each of these segments, the groups of segments, are then sent to a character recognizer which is capable of recognizing the alphabet a to z, and give this code. So now, you can see what is the best path. And your best path could be, let me group segments one to four, and that, the character recognizer calls a W. Then, let me group four to six, and six to seven, seven to nine, and so on. There are many different possible paths. I could have just said one to two, and called it the letter C, for example. Now, the dynamic lexicon helps us as follows. If — let you assume that the lexicon is on the left side. And it has five words. Notice that none of those words starts with the letter C. And therefore, I am not even interested in the segment from one to two being called a C. So, my question to the character recognizer is not “Tell me which character from A to Z it is.” My question is simply, I look at the lexicon. I notice that all of the words on the lexicon start with either a W or an H or an S. So, my question is simply, how good a W is this? How good an H is this? How good an S is this? And I do the same thing as I move along. And I try to fit those words of the lexicon, and see how they can be accounted for by the pixels in the image. So, if you gave me a different lexicon, which is dynamically generated, but give me the same word image, then the question I ask would be different. And that is where the innovation lies. That, based on the dynamically generated lexicon, we are able to adapt the question that is being asked of the character recognizer. So, that is the first innovation. Another innovation with lexicons was, let us look at some interactive features. How do human beings choose between a set of choices? After all, a lexicon is nothing but a set of choices. So, if I have a word image, and the only ability I have in my image processing routines is to tell whether there is an ascender or a descender. Ascender is, as you can see in the letter B, there is the stroke which goes up. Or in the letter Q, where the stroke comes down. So, coming down is a descender, going up is an ascender. So, we all know in the alphabet, there are a few letters which have ascenders and descenders. Others don’t. And the letter F might have both an ascender and a descender. So, what I could do is, I could simply extract the ascenders and descenders, so I know there is an ascender in the beginning, and an descender somewhere in the second half of the word image. And if you now give me choices, I can tell what that word is. Okay? So, if you give me a city, like, let’s say, Hyderabad. Then I know it’s ascender, descender, ascender, and so on. Now, if I gave you just the feature. Say that there are a whole bunch of ascenders. Can you tell me which city name it can? It’s going to be impossible, right? If I had only a static lexicon of every possible city in the world, there is no way you can tell. However, if I have a dynamically generated, limited lexicon, and I told you that it is one of these four cities, then you can do a matching, and you can say, ah-ha. The way the descenders are distributed here, it is the city of Kolkatta. Now, if I have a different feature map, I could say, well, this is the city of Patna. Again, you know, a different one. Maybe it is Dehradan, and so on. Now here is a question when I think there is, you know, with a live audience, I always ask. If this is the profile of a city name in the United States, can you tell me which city name it is? And depending on where I am giving this talk, I see a few hands go up. But if I’m giving this talk this Buffalo, all the hands go up. Because it is Buffalo. Now, if you actually gave you the choices, many of you in the audience would also recognize this as Buffalo. Because you would see, well, if it is Amherst, well, Amherst does not have a descender. What about Buffalo? Yes, the Fs can be descenders. Boston? No descenders. So, given the choice A, B or C, it must be Buffalo. So, this is the power of the lexicon. What you have done is, you have interactively gone back and forth, between the lexicon and the features you have extracted, eliminated some of the choices, and figured out the right choice. See how different it is from recognizing every letter in the word that is written. That is not needed when you have a small lexicon. So, what we did in the postal automation task was, we reduced the problem of recognizing street names into a small set of choices. And because it has to be able to read the handwriting of pretty much anyone, we do not rely on reading every letter in the street name. We simply pick up a few features, go back and forth with the lexicon, and recognize the correct choice. And as in all multiple choice questions, and students can vouch for this, that guesswork is a good strategy. When you go to an exam where it’s all multiple choice questions. But when there’s a choice of “none of the above” is also inserted, it becomes tricky, right? Because then it opens up other possibilities. And the same thing happens in the postal automation. We should be able to tell “none of the above” and reject it, so that a human or a manual keyer can look at those difficult cases. >>Here is how I explain this whole method of inactive features. Let us say — let us now look at the bigger recognition problem. So, let me say that I’m not interested in all the 10 digits, zero to nine. Depending on some applications, maybe it’s the [inaudible] application, maybe it’s the date field. Maybe it’s some other — social security number. Some other contextual intelligence has said that it is only one of these three choices. Okay? It’s a three, or a five, or a seven. What I could do then is, I could take that digit image. If I am discriminating within a three and a five, notice that the lower part of the image is exactly the same. If I’m discriminating between a five and a seven, notice that the very top part is exactly the same. So, depending on which digits are being discriminated, the features that I should be seeking have to be different. And that’s where we come to this multi-resolution features. To tell me, based on your context, I will then tell you which part of the image you should focus on. So, here it how it works. You know, in a very simplified description. I’ll take the digit image. Take its profile, the contour. Identify the points where the contour changes direction sharply. Connect them with line segments. Put a four-by-four grid on top of it. Actually, a two-by-two grid on top of it. And then have also a center piece, so that I have five different areas of that digit image that I can extract features from. And now, depending what you tell me [inaudible], I will extract features from that particular zone of the digit. And this will be done in an automation way, where I extract the features from different boxes. And if I cannot make a decision, I divide that smaller box further into four more areas, and so on. And then I [inaudible] any of the features from any different zones. And without going into the details of the algorithm, I have a lookup table for all the [inaudible] in all the different zones, what kind of features I can expect. So, I can quickly save the three. And this is something like the 8th star algorithm in artificial intelligence. And I can tell you what the digit is. So, to recap. Interactive lexicons. You can have very simple features. Like, what is the length of the word? Does it have two parts, like New Delhi has two parts, Patna has one part. How many ascenders? How many descenders? Do they have loops in them? And so on. And just based on that, I can do the recognition. So, this had serious impact on the whole handwriting recognition task. Notice that almost two decades ago, small lexicons, we were already getting very higher accuracies. When the lexicon size increases, as you can see from the description I gave you of the algorithm, the performance is going to fall down. But, if you can bring context into play, and keep the lexicon small, we can get high performance. Today, this is the impact that we have had from the dynamic lexicons. Today, the recognizing of, in the postal domain is more than 95%. That means 95% of all mail, whether it is machine typed or handwritten, is completely read by a program. No human eyes are laid on that image. The interpretation is done completely automatically, and the bar code is generated. And that is a huge artificial intelligence success story. It started about two decade back. I have a video over here. Let me see if I can play this. >>And the innovations coming out of this research have been shared, and need to be shared, all along. In the late 90s, the mid-to-late 90s, the team at SUNY- Buffalo, we started working on this dream of looking at letters, handwritten letters addressed in the wild. So we actually detect them and save hundreds of millions of dollars, which indeed, they sowed the spark of doing back then. And now, as we speak, 25 billion letters per year are scanned in machines in the US postal service. And we know how much that’s saving, in terms of efficiency and dollars.>>So, this is a video clip presented by Eric Horvitz, from Microsoft, at the conference for the, for the scientific committee in Washington, DC last year. And this was shown as one of the exhibits of the success of artificial intelligence, almost two decades ago. Okay. So, now, that was the first innovation. Now, let me move on to the next one. And this is about fusion. And once again, you can see, just like in the dynamic lexicon case, I’m sure many of you can think of a whole bunch of applications and areas where that strategy could work. Fusion also is essentially about taking different expert opinions and coming up with a single, final decision. So, I could have a case where, if going back to the lexicon example. If I have N words in my lexicon, my first recognizer approach, specific to maybe a small percentage of what I started with. And then that smaller lexicon can then be fed to other recognizers, and reduce further. So the questions that can be asked are, what is the best reduction each recognizer should take? Because clearly, even the lexicon size reduces, our accuracy will increase. However, we also know that if the [inaudible] recognizers are looking at the same features, then just reduction of the size of lexicon is not going to help, because the subsequent recognizers are going to face a denser lexicon. Lexicon and trees which look very similar. So, there are a whole bunch of research questions that need to be addressed. For example, I could have two lexicons of the same size. Both have five words in them. In the first case, they all are two letters wide. And this is a made-up example. In the second case, you know, they have varying length. So, if word length is one of the features, then clearly lexicon one is very, very dense. And lexicon two, the word length can be used as a feature to discriminate. So, there is a connection between the density of the lexicon, the features that the classifier or the recognizer is using, and we have looked at various methods to come up with the best strategy in combining recognizers. So, it’s possible that different groups develop different recognizers. Some of the features that they use overlap. Others are different. And so, every case is, is, has to be dealt with separately over here, where you figure out which recognizer should go first in the pipeline, which should go later. How should the reductions take place? And so on. And we have published in that area. Now, also what we did — and this goes to the medical domain — and I am sure you can relate to this. This is from a medical form. I have extracted word snippets. And all of these are examples of the word “chest pain”. And this is taken from a medical lexicon. Now, let us say I don’t have a lexicon. And I have a form with medical terms in it. What can I do? So, we came up with a method of bootstrapping, where we said, okay. Just try to recognize some letters in the word. So, you can see, in the right side, the blue shows the correctly recognized, and the red shows incorrectly recognized. And then, we can bootstrap. What we — we can estimate the position of that letter in the word. And, based on the context, go and look up a medical dictionary. So, if this is the group of [inaudible] cardio-vascular system, I can then look up words and figure out what is written. So, we were able to do this with a whole bunch of medical categories. And I’m not sure if you can see those categories over here. And we were able to come up with lexicons for all of them. And then, using this bootstrapping, we were able to get about 8 to 10% jump in our recognition rate. This was an application for medical forms. And here are some examples. This is a fax of medicine, prescription. If I just had to read, without knowing that this is a medical term, or the name of a medicine, I’m not sure I could read any of these. But notice that we have actually recognized, on the right side, the results have shown, using the bootstrapping, and dynamically generating lexicons, we were able to recognize the first word as glyburide, and the second case, fosinopril, and in the last case, metoprolol. And this was possible. And look at power of the lexicon over here. That not so carefully written medical term, the name of a, medicine name, has been recognized accurately. So, that opens up a whole bunch of applications for us in the medical domain as well. Now, if this was a more engineering-based technical audience, I would have discussed the approach we take for combining different recognizers. So, if I have lexicons, and [inaudible] listed one to N, and I have recognition methods C1 to Cn, each recognition method independently sorts the lexicon based on how well it matches an image. And returns the code of the confidence it has in that particular match. So, every recognizer or a classifier comes back with a vector of the lexicon, rearranged in the way they have been scored. The task of fusion then is to derive the function F, which in the best possible way takes these nine score column vectors, sorry, the N score column vectors, combines them somehow, and returns a single vector. Okay? So, that essentially is the task of fusion. So, a learning-based approach would be to learn that function F. Previously, [inaudible] approaches have been used, where they would say, simply take an average, or just add them up, or take the highest, or so on. But none of them are optimal. Clearly, you need to come up with a function that can be learned, and we have developed techniques of learning that function. Now, there are, in this classification task, I’m sure you must have figured this out. There are two ways in which you can approach the problem. One is, I ask you to verify a particular answer. So, I give you a word image. And you just ask the question “Is this this particular entry?” Okay? That’s called a verification task. Or, you do not do any of those. You simply ask me to recognize what is written. And that’s the identification task. Now, here is how the two task play out. We are doing verification all the time, when we are logging in on our computer, right? We do it right? We give a username. So, there is a password that is already stored. And with that username. So, when I type my password, it matches only against what is stored against that user name. This is verification. On the other hand, if you say, I’m not going to give you a user name. I simply am going to give you a code, and tell me who is it. That is the identification task. Without going into the math, let me just say here that we have come up with the optimal function F for the verification task, which has been shown to be the likelihood ratio. And for the identification task, we have said that a closed-form solution does not exist. And what you need is an iterated method. I’ll skip the details. I clearly mention in the papers that show up on the view graph. And include the verification task, as I said, the likelihood ratio works out to be very, very good. And in other cases, it is an iterative method. We also show that the same approach of likelihood ratio in fact can hurt the fusion, if you use it in the identification task. Again, I don’t want to get into the details, but it could even perform worse than the single recognizer. So, here in the table below, you can see. C1 and C2, on their own, can do 55% and 77%. And the likelihood ratio does 69%. So, it is smaller than C2. So, this can happen. And so, that’s what our paper is about. And we’ve had, we believe, tremendous impact in talking about what method of fusion should be used when combining different recognizers. Once again, I’ll leave the details from the paper. And I’ll move on. These are very theoretical results on how you take the scores from all the classifies and combine them. Let me move now to the third innovation. And I can see that I have about ten minutes left. So, the third innovation is on indexing and retrieval. And here, what we did was, we looked at historical documents. And I’m sure that, in the medical domain, there are many documents which have degraded. The legacy documents have degraded, and need some enhancement. Even if I don’t recognize automatically, maybe I can enhance them, so a human being can read what is that. So, we developed this, in the case of historical documents. We also came up with a method of transcript mapping. That means, in historical documents, some historian has actually transcribed the document. Now the task is, can I map different pieces of the image to the transcribed words? And this can be interesting now, because today we have the World Wide Web, and we have hyperlinks. So, if I move my mouse or cursor over different parts of the image, you can actually tell me what that word is, and let me link that to some other, richer information. We also looked at word spotting. Where you have a handwritten document, and I can automatically tell you which word appears on which pages, okay? That could be of interest, maybe you looking for the name of a disease. Maybe you’re looking for the name of a medicine. You’re looking at the profile of a particular patient, over the past 20, 25 years. There are a whole bunch of handwritten medical documents. And you simply want to figure out where a particular word appears. And we have developed approaches where you can do this automatically. Okay? And our impact in this area has been on, looking at writing from multiple writers. The same patient could have been seen by multiple doctors, with different handwritings, over a long period. How do you do this? It’s something very challenging and interesting. Here, I’m just showing you Isaac Newton’s handwritten notes. So, this is something that we worked on, on figuring out, you know, what is being written, or even spotting different words in the handwritten document. And we use what has been commonly called the Rule of 70. That means, if we can recognize 70% of the words — only 70, not 100. 70% of the words, we can actually give a very satsifying experience to the consumer. Because you have two things going for you. One is, the same word might [inaudible] in the document, and therefore as long as I got one of them, I can pull that page out. Great. The second thing that is going for us is, if I’m going to show you the links, right? Just as Google does. You know, you type something in the search box, and it gives you ten or twenty different links on the first page. As long as the link you are looking for is somewhere in that first page, you are quite satisfied. You don’t say, “Well, I did not get it on the very first link.” That’s not an issue. As long as I can eyeball and see that it is somewhere there, I’m happy. So, given two factors going for us, if we can do 70% recognition, we will be fine in terms of indexing and retrieval. So, in the table below, you can see that for medical documents, if I use a lexicon size of 4000, my recognition rate is only 20%. So, I have a long ways to go to reach 70. In [inaudible] domain, bank check recognition domain, we are already there. We can do it all. But historical documents, medical documents, you know, it’s still a challenge. There are other controlled applications where you are getting close to the 70, and you can have a good experience in indexing and retrieval. The next innovation is, as I mentioned, of security. And we are all familiar with captchas. You know, these are these squiggly words that you type in, to get different words. That services, and a few years back it was noticed that these words were getting more and more complicated. Because the idea was the keep the software bots out. And those software bots were getting increasingly smarter. And so, the words were getting squished, in order to keep them out and let only humans, you know, get the service. But then it was getting challenging for the humans. Because, you know, you need to really squint to read some of these words. So, how do we find out the sweet spot, where humans can read without much of an effort, and machines or software bots cannot? So, a whole bunch of approaches have been proposed. We proposed a handwriting-based approach, when we said, humans are good at handwriting recognition. Why not make captchas such as this? You write handwritten sentences, a question and answer. So, here it is. She teaches us English. What does she teach us? And in order to prove that you are not a robot, you simply have to answer that question. This involves reading handwriting. It involves understanding natural language. And hopefully is a better approach.>>So, we came up with methods of generating handwriting automatically for the security purpose. So, you can see, we can model the movement of a hand. And recognize handwriting. We also looked at what it means to have accent in handwriting. Just like in speech. If I grew up in India, I carry some of the accent from India into the English that I speak. Similarly, if I learned to write, let us say, in the Indian script, you know, when I was growing up. Does that influence the way I write English today? If I started learning writing Chinese, which is mostly very, very clear strokes of vertical and horizontal lines, they’re stick-like and not cursive, you can see how it influences the way English is written. So, we have developed methods on all of these, being able to tell whether somebody is a native speaker or not, native writer or not. And so on. And you can also control, based on the handwriting, motor hand movement models, the quality of handwriting. So, that’s one area we have also used it in other captcha, security examples. But let me conclude here by saying, what is the future plan for us? So, four innovation, key innovation in two decades. Let’s look ahead. Here, looking at some personal diaries, such as these. And I’m sure you can see the connections. This could very well be a medical document with a doctor’s handwriting. The different figures drawn, maybe the heart is drawn, and so remark. And so on. So, if I know for a fact that it is all written by the same person, the same doctor or the same historian, then I can use some of the characteristics of that particular person on how they draw, where they draw, how they write. Do they write in margins? [inaudible] sentences? I can actually learn, an artificial intelligence machine learning method that we use, to learn not just the way we write the different alphabet characters, but the way the structures, the handwritten document. And that is something that we are working on right now. We are taking it to the other languages and scripts. You know this is Arabic. You know, even segmenting can be a challenge. And we have done this successfully. [inaudible] we are able to tell whether it is handwritten or machine. So, in the same page, maybe there are certain machine typed things, and handwritten. How do you figure those out? Again, because we don’t have too much time, I’m going to skip these slides. And then, we are all talking about flipped classrooms, right? And then we’re talking about, can we look at video lectures, and can a index them? How do I recognize [inaudible]? Look at how challenging these are. Wouldn’t it be nice if a student can search for different videos, not just based on the author of the video, the presenter of the video, the name of the professor. But by a concept. And the concept is not spelled out in English words. It’s perhaps based on a chemical formula. Or an equation. Or a diagram. How do we do this? Very challenging. And interesting for us to move forward. With that, I will [inaudible]. I’ll just leave a photo in here. This, well, you know, that you can read. Which talks about handwriting, how recognizing it has so many different facets to it. It has the emotional facet, the behavioral facet. The identify facet. And so on. So, with that, let me conclude here. I’m sure I can take a few questions, if that is possible. But thank you, and I’m ready to take questions and conclude here.>>Interesting presentation. So, at this point, we are basically out of time, so we do not have time for questions. But we hope you can join us for our next presentation on Wednesday, November 11, when Dr. Hugo Aerts from the Dana Farber Cancer Center and Harvard Medical School will present to the speaker series. Thanks to all who have joined us today, and let’s give special thanks to our guest speaker for sharing his time and expertise. [ Applause ] I’ll be happy to take, later on, if somebody wants to e-mail me a question. I’ll be happy to, you know, respond. Thank you.>>Sure. Thank you very much.

Leave a Reply

Your email address will not be published. Required fields are marked *

© Copyright 2019. Amrab Angladeshi. Designed by