November 4, 2019 1

Digitizing Early Arabic Printed Books: A Workshop – Session 1

Digitizing Early Arabic Printed Books: A Workshop – Session 1


OK. Well, thank you. Good morning, everybody. Thank you all for being here. My name is Elias Muhanna and
I’m an assistant professor in Comparative Literature at
Brown University and a faculty fellow here at the Watson
Institute for International and Public Affairs. I’m glad to welcome you to
the fourth annual scholarly gathering convened by the
Digital Islamic Humanities project, one of the signature
initiatives of Middle East Studies at Brown University. The purpose of
this initiative is to explore the state of the
art in digital scholarship pertaining to the study
of the Islamic world. Since 2013 we have held an
October conference or workshop aimed to bring
researchers together to showcase their experiments in
digital humanistic scholarship within their respective
fields and to discuss the opportunities and challenges
engendered by this changing scholarly ecosystem. If you don’t
believe me, here are some of the posters
from previous events. 2013 was our
inaugural conference. 2014 was actually a two-day–
what did we call it? It was a workshop
on textual corpora and the Digital
Islamic Humanities, which was convened by–
we had several sessions held over two days. We had about 35 people here
from all over the world learning how to use different
kinds of tools and resources for their own research. And it was very introductory. We had Elli Mylonas and
Maxim Romanov running that. It was a really wonderful event. Last year, we had a more
traditional conference or a symposium called distant
reading in the Islamic archive. And over the course of
these past gatherings, some of the questions we have
been interested in exploring include, where are the most
important digitisation projects of historical sources in
Arabic, Persian, Turkish, and other languages taking
place around the world? And that is a
question that is very much at the center of what we’re
going to be exploring today. What kinds of digital
technologies and methodologies have proven most
useful for scholars in different disciplines? For example data mining,
pattern recognition, social network analysis, etc. How are existing
technologies challenged by the manipulation of data
in non-western languages, and what are the
most significant technological desiderata
for researchers? And finally, as
digital tools and media become more widespread,
what ethical issues relating to privacy and human consent
must be carefully considered, particularly in projects
involving contemporary political and social issues? This year, we’re very pleased
to be hosting this workshop together with an institutional
partner, Gale Publishers which recently released an
exciting new digital text archive entitled Early Arabic
Printed Books from the British Library, which brings
together approximately 2.5 million pages– maybe the
number has gone up since I wrote this– from historic
books on diverse genres, including literature, law,
mathematics, medicine, geography, and other topics. In the first half
of the program, we’ll be hearing about
this particular archive from Julia de Mowbray,
a publisher at Gale who has developed several
groundbreaking digital archives since 2006, as well
as Bret Costain who is the director of
new product strategy and development for Gale. Following Bret and
Julia’s presentations, we’ll have an open discussion. And at noon, when everybody
arrives– right now the room’s looking
a little bit empty, but when people
show up for lunch– we’ll take a very
short break for people to grab some lunch
outside in the lobby and come back inside
with their lunch please to hear a lecture by Doctor
Katherine Schwartz from Harvard University entitled Towards a
New Book History of the Modern Middle East. Katherine’s talk
will be followed by a response from
Professor Cynthia Brokaw, former chair of the Brown
department of history and a specialist in
Chinese book history among many other things. I’d like to thank all
of these individuals for agreeing to participate
in this year’s event, and I’d also like to express
my gratitude to Beshara Doumani the Director of the
Middle East Studies for his wonderful
guidance and friendship for the past three years. We wouldn’t have been able to
put this initiative together without his help nor without
the wonderful support of Sarah Tobin, the Associate
Director of the Middle East Studies, Barbara Oberkoetter,
our fearless program manager, and Rachel
Easterbrook, her worthy deputy. Thanks also to Jason–
is it “bye-berr”–? Jason [? Biber ?] from Gaile
for his logistical support at the last second
and to Ian Straughn our wonderful Middle East
librarian at the Brown Library. Here’s some final
procedural details. We’re webcasting this
live via a video feed, and these events tend to be
viewed several hundreds times. I say that, just
because if you are not wanting to be on
camera, you should talk to Rachel or to Barbara. We will also have
a photographer here at some point taking pictures. So, and this is important, if
you are at Brown University ID holder, you may
access the database that we’re going to be
discussing on your laptops during the event through
the library’s website. If you do not have
at Brown ID you may still access
the database thanks to Jason via an instant
trial, and you’ll find a links to the
database on the sheets that are around the room. And you’ll also find a
link to the webcast which you should absolutely tweet
out and post on Facebook so that people can follow us. Here is relevant
information for people who are watching on the webcast. You will presumably be
non-Brown users most of you. And that is the link,
the relevant link. It’s a tiny URL link,
“.com/GaleArabic2”. GaleArabic1 is for Brown users. And if you are on Twitter
and wanting to tweet things, the hashtag that I randomly
selected is #DigitalArabic. Julia de Mowbray has been
publisher at Gale since 2006. She has developed several
groundbreaking projects such as State Papers
Online, the digitisation of the British state papers
from the reign of Henry VIII to George the III, the 16th
through the 18th centuries. As an online environment for
studying historical documents, Chatham House online
archive, a digitisation of the publications and
meetings and speeches of the top international
affairs think tank. I thought that the whole
point of Chatham House was that it was supposed
to be offline, as it were. That’s a very
interesting achievement. Including audio
recordings of speeches of major global players in IR,
presidents, prime ministers, foreign, and defense
ministers, et cetera, and now an Arabic program
of primary sources in Arabic for
research and teaching. 2017 will also see the launch of
the first major online archive of the Royal Archives
the Stuart and Cumberland papers from the Royal
Archives Windsor, the papers of the
Stuarts in exile and those of the
Duke of Cumberland. Julia’s academic background
is in art and art history, particularly the early history
of photography and 14th century French sculpture. Julia de Mowbray. Good morning. Is this mic OK? Just that first. I’ve got a quiet voice. For my husband, it’s
probably quite nice. so I’m Julia de Mowbray. I’m publisher at
Gale based in the UK. I’m going to be introducing
you to early Arabic printed books from the British Library. So I’ll be first talking about
the content, what’s in there. Then I’m going to be going
through the functionality. Then I’m going to be taking you
behind the scenes a little bit. And then after that, I’ll be
handing over to my colleague Bret Costain, who’s
going to be talking about our digital humanities
initiatives and the sandbox and all sorts of whizzy
things like that. So early Arabic printed books
from the British Library advances research in Islamic
and Middle East Studies in two major ways. It provides 24/7 access to
the early Arabic printed books of one of the
world’s major collections. And uses can search not
just detailed metadata but the Arabic text
books themselves. The British Library’s
printed book collection is particularly strong on
early Arabic printing and books printed in Europe, as well as
from Egypt, Lebanon, Syria, Jordan, Iraq, and India. This product is a digitization
of all the works listed in Ellis’ catalog
of Arabic printed books in the British Museum. That is all Arabic text in
the British Museum by the 1890 when that catalog was printed. The collection
covers a wide range of subjects presenting
the heritage of Arabic works in Arabic. Supplemented with European
and Asian translations of Arabic works across
the same period, and Arabic translations
of Christian works, it therefore records the long
exchange of ideas and learning between Europe and the Arab
speaking world over 400 years. We are publishing the
collection in four parts, in Islamic religion and
lore and its supplement European translations
and studies of the Quran. There are over 50 Arabic Quran
printed between 1694 and 1890 in Europe, the Middle East,
and pre-partition India, which show the interest in the Quran
in its original Arabic text in countries outside the
Arabic speaking world. There is also a larger
group of translations into a wide range of European,
Indian, and Asian languages. The early translations
within the Islamic world were into Persian and Turkish. Outside the Islamic world, the
earliest printed translation was that into Latin
in 1543, which you can see on the
screen, the second along. Of the commentaries on
the Quran– I apologize. Of the commentaries
on the Quran, there are 87 in this collection
by all the major writers of tafsir, such as
al-Zamakhshari, al-Razi, ibn Kathir and al-Suyuti. Islamic law is the other major
component of this module. And the collection
includes works of the four schools of
jurisprudence in Sunni Islam, Hanafi, Maliki, Shafi’i,
and Hanbali law, as well as Ibadi body
law and Shi’ia law. The number and breadth
of these titles enables scholars to
undertake detailed research within this one product. And the subject limiter
can limit the search to just one school of law. The Christianity module includes
complete bibles as well as different books within it
as separate publications. The Bibles include
Le Jay’s 1645 six language polyglot printed in
Paris and Bolten’s 1657 London nine language version
seen on the left. In the center of the screen
is the complete Bible solely in Arabic. That was arranged by Sergius
Risicns, Maronite archbishop of Damascus and printed
in Rome in 1671. In addition to bibles there
are liturgies, biographies of saints, histories
of the Eastern Church, liturgical and devotional
works for the Greek Orthodox, Syrian Orthodox, Coptic,
or Maronite churches, and works printed by the
missionary presses, all in Arabic or bilingual. Among the Jewish works there
are 23 works and editions of Maimonides. Looking at the science,
history, and geography module, you can see from
the subjects listed that it includes a
wide range of subjects within this broad area. Islamic research and
discovery of science based on the work
of Greek authors is one of the major aspects of
Arabic learning to influence European scholarship. The collection includes works by
all the great Arabic scientists such as al-Kinda, al-Razi, Ibn
Rushd, al-Haytham, al-Farabi, al-Battani, Ibn Sina. On the screen, you can
see al-Rashidi’s work on the diseases of children and
Alfraganus’ work on astronomy. Using medicine, a field in
which Arabic physicians excelled as an example, there
are over 300 works on medicine and physiognomy. The earliest medical
works in Arabic were translations of
the ancient Greek works. Hunayn ibn Ishaq al-Ibadi
was one of the most important translators. His introduction or
Isagoge to his collection of magical tracks, the
Ars Medicinae Articella was so important
that it was adopted as the basis for the
magical curriculum in European universities
until about the 18th century. The edition on the
screen, the second item, is in Latin and printed
in Venice around 1515. There are 24 works by
or translated by him in the module. al-Razi is a major figure in
Arabic philosophy and medicine. He wrote numerous medical
works, some of which were translated into Latin. His most famous
and original work was on smallpox and measles. His comprehensive
manual on medicine, the al-hawi fi al-tibb,
a posthumous compilation of his working
notebooks, earned him the title of the greatest
medical doctor of the Middle Ages. Then Ibn Sina or
Avicenna has been described as the most clear
thinking physician philosopher. His study of symptoms
was a great advance. He was the first
physician to describe meningitis or meningismus. And he also noted
the interdependence of the mental and
physical factors. His most important work is
the al-Qanun fi’t al-Tibb, or Canon of Medicine. We’ve turned 25, which became
the standard medical textbook in the Islamic world and
Europe until the 18th century. History has one of the
most prolific fields of Arabic writing. The collection includes around
400 different historical works as well as 200 histories
in biographical form. They include
chronicles and annals, histories of rulers
and dynasties, local, cultural, and
political histories, and historical encyclopediae. The examples I’ve selected
here are one of the earliest and most important
biographical histories, the Kitab Al-tabaquat
al-kabir of ibn Sa’d. This collection
contains an account of the messages sent
by prophet Mohammad to different kings and princes
inviting them to embrace Islam, followed by a series of
notices of the conversion of various Arab tribes and
families and a short account of the author. Al-Tabari’s Tarikh
al-Rusul wa al-Muluk, the History of the
Prophets and Kings, shows how he followed
the traditionalist method of collecting pieces of evidence
or documentation on each event before writing and creating a
monumental corpus of history of the Abbasid period
between the second and the eighth century. This edition is
bilingual German, Arabic by the German 19th
century orientalist Johann Gottfried Ludwig Kosegarten. There 168 works and translations
of geography, topography, and cosmology, and 58 on
travel in this collection. They include factual
and memorabilia type works, as well as travelogues. An important factual
work is the Kitab Rudjar by Abu Mohammad
al-Idrisi, who was a Muslim geographer,
cartographer, an ancient Egyptologist who
lived in Palermo at the court of Roger II of Sicily. The main focus for
him was the detail of presenting the
world as a sphere. It was accompanied by
70 longitudinal section maps and a descriptive and
valuable physical, cultural, and political description
of each region, for which he gathered information from
contemporary merchants, explorers, and others. The 1592 Latin edition here
is the first translation. The next text here is
al-Mas’udi’s, in English, The meadows of Gold
and Mines of Gems. He was born in Baghdad and
travelled widely through Persia, Arabia, India, Africa,
and the East and apparently sailed the seas. This work is both
geographical and historical and has been copied, translated,
and printed extensively. The final work here is by
Mohammad ibn Ahmad al-Biruni. In English, it’s
al-Biruni’s India is the title of this text. al-Biruni was a scientist,
mathematician, and astronomer. But he also wrote widely,
such as this account of the religion, philosophy,
literature, chronology, astronomy, customs, laws, and
astrology of India around 1030. Turning to the literature,
language, encyclopedia, and periodicals
module, and this module is going to be
published in December. So the rest I’ve shown
you is now available. But this is coming
out in December. Poetry was the most
favored literary form. And this is reflected in over
400 works of poetry, including many Diwan by different
authors and many other works and collections in
Arabic and translations into other languages. The author here is Abu
Mohamed al-Qasim ibn Ali, known as al-Hariri, who was a
12th century poet, philologist, man of letters, and
official from Basra. He wrote works on
Arabic grammar. But he’s most famous
for his book of Maqamat, which he modeled
on those of Badi az-Zaman al-Hamadhani
of the previous century, who was probably the
originator of the genre. al-Hariri’s version
enjoyed great success, changed the status of the
work, and became the standard. The first work
here is an edition printed in Paris in 1819. And then showing the works
transmission, the next three examples, the second one along
is a bilingual Arabic Latin edition printed
in the Netherlands in the 18th century, 1731. The following one
is an additional with a Persian inter-lineary
translation and marginal notes by [INAUDIBLE]
Muhammad Shams al-Din, printed in Lucknow in 1857. And the final one along
there is a Latin edition by the German Orientalist Karl
Rudolf Samuel Pfeiffer in 1832, just showing how
all these works were transmitted through Europe,
through the 18th and 19th century. Looking at some other
examples, the first here is a collection of
Diwan by five classic poets, including al-Farazdaq. The next is a volume of poems
by an early ascetic poet, Abu al-Atahiya, then a
Diwan by Abu Nuwas, followed by al-Mutanabbi
edited by Ahmad ibn Mohamed al-Yamani [INAUDIBLE]
just to show you a sample of some of the
text in this literature module. Internationally, of course,
the most famous literary work was the Alf layla
wa-layla, or 1,001 Nights, of which there are nine
editions in Arabic, including the Calcutta 1814 1818 edition. There are 28 in
English translation. Most of those translations
are made from the Galland French edition. But each were translated
directly from the Arabic. And you can see Galland’s
on the screen on the right. Two famous translations are
Lane’s early sanitized edition on the left, and Sir Richard
Burton’s 1885 to ’88 16 volume edition, very much
unsanitized, translated from the Arabic directly. Other edition include 13 French,
six German, and 39 selections in various other languages. Of the 35 or so 19th century
periodicals included, the majority were printed
in Beirut, Cairo, London, and Paris. They ranged from
general journal weeklies covering arts, culture,
politics, commerce, and science, to
literary monthlies and specialist medical
magazines, Jesuit or American missionary
publications, orientalist serials, and the
Egyptian daily El-Qahira. Together they
provide insight into the cultural, intellectual,
and social lives of people in the Middle East,
perhaps not evident in the monographic works. On screen you have the Diya
al-khafiqan, a monthly printed in London on
politics, literature, and sciences, which is a common
mix for these periodicals. The example is an article about
Alexander the III of Russia, the penultimate Russian czar. He was known as a man of peace. And on the other
side is al-Tabib, a medical monthly
printed in Beirut. So you will find all
the major Arabic authors in this collection, the number
of their works and editions, reflecting their use
and perceived importance over the period. This collection as
a whole provides researches with a picture of
the printing, translation, and transmission, of
Arabic text through print from the 15th century to the end
of the 19th century, offering material for endless
research projects. So I have taken you
through the content. And if we can talk
about any questions you have on that in the
discussion possibly. But now I want to
take you through what we have done to facilitate
your research of these texts. This product was designed for
researchers in the Middle East, as well as throughout academia
in North America, Europe, and beyond. So users can change the
language of the interface into Arabic to find
consistent Arabic script and right to left
labeling throughout. The collection contains
Arabic text as well as translations of Arabic
text into other languages, as I’ve explained. So users can search the English,
French, German, Latin, Spanish, Dutch, Italian texts
as they would normally. For early Arabic printed text,
search is now also possible. And an integrated pop-up
keyboard for Arabic script is provided at every search box. Works in languages which
are not yet tech searchable are discoverable– such as
those in Persian, Hebrew, Indian languages,
an Ottoman Turkish– will be discoverable
through their metadata. And yes, I have been asked when
we’re going to make Persian and Ottoman Turkish
especially searchable. So no news on that yet. I’m sorry. So here on screen, you
can see both versions of the home page, the Arabic
script version and the English. The yellow box indicates where
you change from one language to another. And you can change
the English version, as on all Gale databases. You can change the English
version to French, Dutch, Spanish. There are 32 languages
you can change it into. Looking at this new
search on Arabic text, here we are starting a search
on hikmah, which means wisdom. I have entered the search term
using the pop up keyboard. Below the keyboard,
you can see the search assist, or type
ahead– whichever you call it– in action
providing suggestions. There’s a search ahead type
ahead for the keyword search on the home page on the box
in the banner at the top. And also on the
advanced search, it’s available for author and title. It was very keen that
we had this, because of the many similarities
between author names and titles, so that you have that to select. And also on the subject
index, because we have Library of Congress
subjects to every text. So in that case, we are
doing our search here. And here’s an example of
the results of this search. There are hits on
the title page, as you can see in
the image viewer. And to the right
of the image here you can see listed all the
pages in this work, which have similar hits. And I have pasted examples
of the hits on the right, showing how the search picks up
the word in different locations and letter forms from the
other pages listed there. This is another search on
al-talaq, which means divorce. It has found numerous
hits as you can see. This work actually is
a treatise on divorce. So the word occurs
right through the book. And it was obviously the
first work in the results list sorted by relevance. You can see that, what you will
see when you do your search and find your hits is the green
highlighted box around where your search term is. I’ve just added the red box
around that to actually show you precisely where that word is
to make it a bit easier to see. But you can also
see, to the side of the image viewer, all
the huge number of pages where there are hits. And then this one, final
search on al-tawhid, or unity of the diety, you can see
there’s a hit on the title page again. And then I’ve shown you
the pages where I found. But it’s also come up
with hits on the side. So as well as searching
by word like that, proximity operators also work. And I’ve just done one example
here on hikmar and [INAUDIBLE] to find results where
these two words are within 10 words of each other. So n10 means within 10
words of each other. You can change the 10 to
five or three or whatever. I don’t know how many
people use proximities now, but they’re very useful
for during certain types of searching. Here’s an example of a typical
text of a Quran and commentary. This is an Arabic text with
an interlinearly Hindustani translation. And on the side, at an angle,
the Persian commentary. This text, only the
Arabic text is searchable. As I said, we don’t
yet have search for Hindustani or Persian. But just so that
you’re aware, but you will be able to search on the
Arabic and then view the text. Turning to the metadata,
we created MARC records for each work from Ellis’
catalog of early Arabic printed books from the British
Museum, according to specifications agreed
with the British Library. The entries were then
verified and corrected by viewing the images
of the books themselves as they came in. And this data also
forms the metadata for each work in the product. One of the things we are very
keen on with this product was actually having
very strong metadata. Here are for the librarians,
probably more of interest to the librarians here,
here are the standards and standardizations
that we agreed. author names following NACO,
if there’s no NACO, then VAF or library of Congress. And then transliteration is
Library of Congress form. The British Library did have
some records– not many records for these– but some records
with a different form of translation. And they wanted everything
in Library of Congress form. So we redid all that for them. We have Library of
Congress subjects. So you can do subject
index searching. And then for place
names, as you’ll see in subsequent
slides, in limiters, we’ve got two forms
of place name, in Arabic script and
the English version. And then for titles
of works which are in one of the non-searchable
languages like Persian or Syriac or Ottoman
Turkish, we’re actually supplying an
English translation for that. I was actually asked
yesterday if we could have a transliterated
version as well. So I’ll look into that. But just to make those
more discoverable. So here you can see
the document view page. At the top is the
brief citation. You can see on this
side, that actually comes underneath
the image viewer, the full citation
with your metadata, as well as notes which
are from Ellis’ entry. In Ellis’ catalog,
he quite often gave some notes on
what the book was or what was in it or
something about it. And this information
we’ve retained. And you have that in the notes. And that’s all searchable. The document itself, if
you look at the image view, can be navigated with
page navigation determined by whether the book
has been identified as right to left or
left or right read. And there’s a hyperlink table
of contents on the right side to select a particular part
of the work or chapter. The advanced search
indexes and limiters are extensive, offering users
options for their search. And then a type ahead
and search assist is available for the author,
title, subject searches to assist users
with similarities in author and title names. The user can also limit
their results by date, language of text,
place of publication, Library of Congress subject, and
original language of the work. On the screen, you can
see part of a results list from a search on
the right, together with the limiters available
to narrow down your results. So here you’ve got the
limiters for language and Library of Congress subject. And then under that is
publication date, authors, and then place of publication. In Gale primary
source collections, we have a term
frequency feature. And that is also in here. And here I’ve done an example
using al-tawhid and al-ghayb in two graphs. The top one shows the
frequency, the number of documents per year. And the second one shows the
popularity, the percentage of documents per year. And what you can do, you
can do 10 words or more. But what you’re
doing is comparing the appearance of different
words across all the texts. And you can limit
it as you wish. It’s quite useful for maybe
some of the search topics you might be doing. So that’s some of
the functionality we’ve put in for enabling
you to access these texts. So now I want to,
maybe, take you a little bit behind the scenes. Gale is a global company
with offices and customers on each continent. And we have excelled
in the development of online archives
of primary sources for research,
teaching, and learning. We are known for
our digitization of major and
historic newspapers, 18th century collections online,
the database of 18th century printed books in English,
which is full text searchable, and a host of other archives
covering major themes of humanities, research, and
slavery, international affairs, 19th century studies,
Asian studies. And all the printed materials
of full text searchable. Just to show you this one
example, Making of the Modern World it’s called, it’s an
amazing collection of works on trade, exploration,
fine art, botany, anthropology from
the 1460s to 1914. Here is a search on the earliest
work, a Latin text by Johannes Neider of 1468 on the
contracts of merchants, which was one of the first books
of economics to be printed. So that is searchable. This is another work
on the 15th century edition of [INAUDIBLE] Virgil. For this example, I’ve
searched on [INAUDIBLE] in this finished work,
later finished work. So we’ve got a lot of
history in creating search for early texts. However, our customers
outside the Anglophone world have long been requesting
research archives in the native languages. And this is now a strategic
aim of our company. So the product I’ve
been talking about, the early Arabic printed
books in the British Library, launches our whole program
of Arabic resources for research and teaching. The investment in the
development of a program to search across early
Arabic printed script was considerable. It took a lot of
hoops to jump through to get the company
to agree to it. And the challenge is
actually also numerous. Hear are just some of the
tasks we have to cover, locating vendors or research
teams to develop the OCR program. Actually we started
off– it took a long time to find people. And then we started
off with four teams. And two very
quickly dropped out, because they couldn’t
manage the– they weren’t getting anywhere
with creating an OCR at all. Fortunately we
managed to proceed with two who actually
are producing really, really good results
to such an extent that we now see it’s possible
that we might be able to create search for manuscripts. Second challenge was
creating specifications for the OCR development
and standards. From the very
beginning, not knowing where we were going
to be able to get to, so it’s quite difficult creating
specifications for something you’re not sure what
it’s going to be. And then defining the
specifications for the image capture, data capture, XML
creation, page capture details, and then specification
for the quality assurance, the verification of all that. Because that all has
to be documented. And then training the
staff to QA all that work. So back in the
UK, we have a team of about four Arabic
speaking people. We’re collecting people from
coming over from Syria to work on this project, which is nice. We’ve got a really
nice team there. And then training
the actual teams, the two teams we
have working, the OCR teams, to follow specifications
and maintain consistency. So we have to work to
certain project time span and keep everybody
going to a schedule. And that takes
quite a lot of work. And then we had to modify
Gale’s content systems to enable them to handle Arabic
script and transliteration. And then define
the specifications for functionality to handle
the Arabic script and display text as right to left read. So everything I’ve
shown you, we had to do a lot of
thinking and talking to people about what they
needed and working out how to do all this. And then to the search
indexes, limiters, and display of the metadata in the product. Looking specifically at
the development of search, I have some examples. This is the first book
entirely in Arabic script printed with movable type. It is the Kitab Salat al-sawai,
All Services for the Canonical Hours, printed in 1514
in Fano, in Italy in Fano according to text, but quite
possibly printed in Venice. There are 13 known copies. Two of them are actually
in the British Library. And this is the only one
with a dedicatory letter by Gregorio de Gregori. And that’s the one
we have imaged. To typeface is clear but
presents several misspellings and misprints. It has three series of
typefaces, Arabic, Roman, and Greek. This is an example of a page
and it’s OCR Arabic script. The word accuracy for this book,
because when the books come in, the teams who do the OCR
keep a log of their accuracy rates et cetera. But when it comes
into our office, we also check a standard page
in every work and log that. And this book has
word accuracy at 80%. This example is
printing done in Cairo on printing presses of the
French national press in 1800. And here you can see the
OCR created from that and another page here. I think this one
was 80% as well. But quite a few of
the texts were not printed by movable type. They were printed by
lithographic method, which was obviously favored in the
Middle East for its manuscript quality and its ability
to convey detail and finer scripts. There are around 970 text
printed lithographically. Initially these
were found not to be successful with an OCR program. So most of these have
been treated differently by marking up and
keying keywords, as you can see in the
image on the right. It’s very dark on
my screen here. I hope you can see it
on the right there. The red boxes show
where the words have been identified for keying. And underneath that box,
I’ve put the actual OCR which has come from that page. I’ve highlighted in
yellow alternate words so that you can more clearly
see the actual individual words there. Looking at the success rate
of OCR and the other methods, you will see from these
figures that 36% of the texts have a word accuracy
between 60% and 79%. And 60% word accuracy
was the minimum we set at the very beginning,
not knowing where we were going to be able to get to. So if any texts
didn’t meet the 60, then the team would
key up to that to make sure it was actually
reached that or beyond. Usually if they keyed
up, it went into the 90%. And then 64% of the
titles have word accuracy between 80% and 100%, or
99%, 99 point whatever, which I’m quite proud that
we’ve managed to achieve that. Because coming from
where we started, I think it’s really,
really quite successful. And obviously, now
we’ve developed the OCR. We continue to develop it. And hopefully we’ll use
this on future projects. I just wanted to show
you our QA system, just to show you what
goes on behind the scenes. We have this database where
we pull in the images and XML. And this is where the
QA humans check the XML. You have the XML on
the right of the image. So they can verify
that quite easily. They can also call
up all the metadata. So they can check the metadata
as well to the same book. And if it needs to be
verified against the text, they’ve got it all
in the same screen. This is the same area. That image there, it would
be one pulled up from one of the images at the bottom. And the list of
headings down down here are the chapter headings. And those are the ones you’ll
see as the table of contents in the product. And this is the bottom of
that’s showing you the images. So the QA team can actually just
call up any particular image at a time and check it. QA stands for? Quality assurance. It’s where we, whenever
you commission a work, you always have to
have it checked. So it’s just called quality
assurance or QA for short. There are actually
there are actually companies that just
do that sort of work, which is quite interesting. So now that most of
the metadata is in, we can also engage in
analysis of the corpus. As an example, here is
the place of publication showing which towns were
most active in printing. And you can see,
Cairo was obviously– most texts come from Cairo,
and then Beirut, Paris, Lucknow, and then Istanbul. Here I’ve just mapped
them to countries. And then I’ve mapped them
to regions, which gives you the results you’d expect. That the majority are
from the Middle East, followed by Europe. So one connects, plot
the location of printing across the 400 years, across
the time period as well and start more understanding
the nature of printing, what was being printed
where at what time. So I hope that gives you some
insight into the development this product. To conclude early Arabic printed
books from British Library is a major collection
of Arabic books, which is now accessible through
more extensive metadata and functionality developed
for academic research, with a variety of such limiters,
and crucially, text search on Arabic printed script. So we hope we’ve also created
a research environment here to further research. And we’ll be interested to
hear how it advances research possibilities and new projects. So thank you very much. And I’ll hand over to Bret. [APPLAUSE] Good morning. Hi, my name Bret Costain. I’m leading our digital
humanities effort at Gale. About a year ago, we
took a look at what’s going on in text data
mining, the requests coming from the research community,
and thought about what tools we might be able to develop to
further that area of research. For about two years now,
we’ve been getting requests for access to the text
files underlying databases, similar to what
Julia just showed. And this is across all
different languages. We’ve been doing digital
archives for about 15 years, starting with 18th century
collections online. We now have about 200
million pages of content. So what we’re finding
is some universities with large budgets–
say a Stanford who has a natural language
processing center– can take those files
and build things out. Brown is another example,
who can just work from files, build their own tools. But what we’re seeing is the
majority of our client base doesn’t have the infrastructure
to do text mining. It’s very complicated. It involves terminal windows,
coding in Python and R, different versions of Java. What we want to do
is create a cloud based research environment
where clean content ready for analysis is available
right next to the tools. Some of the tools we’ll
develop, some simple clustering, but will also link
to third party tools, like NLTK, which is the Natural
Language Toolkit, Stanford NLP. We’re also linking to Google
TensorFlow, as well as IBM Watson. So there’s all these
other tools that are out there that we
can access through APIs. So we want to bring those
tools into this environment to query Gale archives. But also we’re going
to open this up to bring in third party content. So in our proof
of concept, which I’ll show you in a second,
we’re linking to Google Books as an example of a
third party content. But we expect to be linking
to local repositories that universities might
have after bit of mapping, so that if they’ve already
done some digitisation of collections, it
might be open source. They can analyze that
through this tool as well. So access to tools and to
other third party content is an important element of this. And we also want to be
open with how we explain how the algorithms are working. We don’t want this to
be a black box the way that some other tools work. So we want to expose
the numbers that are underlying the algorithms. Let users tweak
these algorithms. So back in April, we released
this proof of concept. And we’ve been reviewing
it around the world for about six months. And we have a new set of
requirements for the next round of development. We’ll have a beta production
version available in December. And Brown will be
working with us on evaluating that, along
with other universities. But I want to give you a
quick run through of this. Now for this proof
of concept, we just took one of our archives. It’s primary English
language, and Google Books. That’s for the proof of concept. Eventually all of
archives, including Arabic, will be in this. Now we can enable this
technically for Arabic. But it’s still a question
of what tools are out there to evaluate Arabic. I can use Watson for sentiment
analysis in English text. But they haven’t quite
built up the capability yet in other languages. That will probably come. And we hope tools
like this will spur some of that further research. So what I’ve done here,
this is a very simple search that’s going across one of our
19th century collections online and Google Books. It’s created a 3D term cluster. And it’s displaying
some results here. There are also other ways
the metadata is presented. This is publication location
for this particular collection, much different
than Arabic books. You can see this is primarily
London and New York. We do use some simple word
frequency, a word cloud. And again, here’s this cluster. It’s a 3D cluster. It lets you zoom in. Now the default for
this clustering, we’re using a k-means algorithm. In a future version,
we’re going to let you pick which
algorithm you want and let you decide how
to tune that algorithm. In this case, you can change
the number of clusters. We’ll have many more options
for flexibility down the road. In this search, I just searched
the keyword term memory. And these were the results. So displayed over here, we’re
looking at it all the text that we have from
the Gale document. And as much text is
available from Google Books. So this first result
is from Google. You’ll see their metadata,
which isn’t nearly as extensive as the
metadata we have in a lot of the Gale products. You would also have the
ability to link through to the actual document. So this is taking
me to the source. So this is going out to Google
Books to show the document. We’ll have that same
ability to link back to the archives at Gale. And we’re also exposing
the particular text that is being queried
for the analysis. Other things that
we’re doing here, there’s a concordance view. So this is my search term in
the middle, plus or minus about 50 characters on each side. And as I said, we
want to expose how these algorithms are working. So this is the data that’s
driving the visualization. Now throughout the calculation,
as this algorithm runs, there’s different checkpoints. And we want to
expose more of that in future versions of this. We really want to
focus on the workflow so that we can help researchers
get to their humanities research question and not get
stuck in the computer science, data science side of things. So providing clean data
sets to start with, and having the tools
reside in one environment. This is also going to be a
network environment where researchers can collaborate. So three people could be
using the same content set, the same tools,
all set up the same way and collaborate, store
nodes, communicate all through the network. So this is just a
preview of what we’ve done to analyze collections. The development partner
we’re working with has a lot of experience
in data science. They’ve built a platform that’s
an ecosystem for research. And we had them throw
in some other tools that are used in other industries
away from academia. This is a network tool. Because I have
over 3,000 results, this isn’t particularly
useful right now. But what this is doing
is showing the relation between title,
author, and keywords and displaying them
graphically with a network and showing different
connections between things. This is the type
tool that we want to put into the environment
without really an understanding of how it’s going to be used. But let people start
experimenting with it. We also expect that
some researchers are going to be interested
in building their own tools. You may have a Python script
that we could then wrap and make that tool available. We’ve already been approached
by some researchers who are finishing up their projects. And they’d like their
software applications to live on going forward. We’re talking to
Voyant, for example. We may have a version of
Voyant that’s available here if that’s how people
want to look at the text. So the reason I show you
this development environment is this is what’s possible. This is what’s
underlying the sandbox. There’s all these different
modules with different code associated with them. But it’s a fairly
easy environment to start wiring things together. Certainly requires
some technical skill. But we hope that some
subset of researchers will be interested in developing
applications for the sandbox. One other piece that
we developed as well is to analyze
single documents, we have this tool where we pasted
in text from one document and were able to extract
from it concepts. So what this is doing is taking
this block about five pages. This is an article
on robber barons. And it’s sending
this out IBM Watson. And it’s returning the concepts
that are in the document. This isn’t the terms. It’s the actual concepts
that Watson’s identified that could be in that. It also does a word cloud
and overlays sentiment. So these are the terms that
appear in the document, read as negative sentiment. Green is positive. But there’s not much of
that in this example. And gray is neutral. We’ve used this in some
other applications, too, looking at more
contemporary content like political speeches
or song lyrics. And it’s one way to
generate interest among students in
the tools themselves, before maybe diving into
18th century collections or the Ellis collection. Also possible to extract people
with sentiment and the places that are in the document. And then one last thing I want
to show you is to help users have more context
on their search– when anyone logs into our
tools, it’s a blank slate. We don’t know who the user is
or what they’re searching for. So something we’ve
been experimenting with is using this same tool but
to analyze, say, a syllabus, to then extract those
concepts and those terms and then feed those back
into the collection explorer. So you could find
things that are more relevant to that context. So in this case I’ve put in
a– it’s a course designed document for psychology. And IBM Watson has
come back and told me all of the terms
associated with it. And what I can do, then,
is go back into this tool. I just drag and drop
that context object here. And now when I do the
same keyword search that I had before,
it’s in the context of that other set of 100 terms. And so what we
envision is people might save some of
these context objects. You can share them. It’s fully transparent. But it might help you get
more relevant results. So in this case,
if I’m searching memory in a psychology
context, I’ll get things about brain memory
and that field of study. If I had another object that
was built around, say, computer science, the results would
be totally different, as it would reflect
computer memory. So that’s one of the
directions this is going. As I said earlier, we’ve been
focusing on English initially, because it’s a proof of concept. But we’ll be moving forward
into other languages. And I would expect
in the coming year, we’ll have a
version of this that will work with Arabic text. That’s it. [APPLAUSE] Questions? Well we have about 50
minutes, 45 minutes or so for just discussion. And Julia, do you
want to come up? Maybe if you both– Should we go to the table? Take a seat at the table that
might be the best way to– Another chair? Most efficient way. I’m official furniture mover. I have a question to
both of you, actually. And maybe I’ll just
ask it from here, so that can be captured
by the microphone. Two questions, one
for Julia, I guess the question is, what is
the threshold at which OCR is useful per word? You hear different things. Some people say that OCR
is only useful above 90%. But then you see rates
that are lower than that. And I seem to
recall, when I first started test
driving the archive, I was having trouble
finding anything. But then a couple of
months later I tried again and I was having no
trouble finding anything. It was totally fine. So are you re-running
the algorithm across the collection? Or is there a way to, over
time, essentially improve the results over the same data? Well there are two
points a search created. One is through the
OCR program, which is run on the images, which
creates the XML, which then becomes static. But when there’s a search
engine within the product, which searches on
that static XML. We did find that there were
some problems in that search. I don’t know if this was causing
the difference between the two in that instance. But at first, this search was
picking up every other word with the same stamp. And that was causing
various problems. Fortunately, a couple
of people contacted us and alerted us to it. So we looked into that. And we fixed that. So we’ve undone
that [INAUDIBLE]. So it could have been
something like that? But what about the
generation of the XML? We haven’t rerun script
on the earliest texts. It’s something I’m
quite keen that we do. And I think we would
wait until we’ve finished the initial
corpus and then look at the rates of the
first texts which were done. And I personally would
like us to rerun those. To look at the very interesting
question, boring question you raised about at what
rate of accuracy is it useful for research, we
never attempted, we never aimed to create a database which
was going to be 100% full text. If you want to do
that, you could re key it all, which was going
to be enormously expensive. So it would take the product
out of the price range of most libraries doing that. But if that was the only
way that this data was ever going to be usable, then we
would do that, but probably do smaller corpuses with fewer
texts to get that accuracy. But what we did
think about was what is a researcher
wanting from the text? What are they likely
to want to search on? Because some academics
will probably just want to find all the texts by a
particular author and then just read them. And they probably may not
do any searching at all. But if they do want to
search, what are they likely to want to search on? And it’s likely to be
names, and the other would be phrases probably. And the database may
be weaker on actually getting actual phrases. But I showed you earlier,
using proximity operators. So if you have a phrase, which
has several keywords in it, it maybe if you do the search
using two of the keywords within so many
words of each other, that would get you that far. We didn’t want to not
move ahead to make this content available by
putting too many barriers up. Over time, I’ve seen
many academics projects attempting to improve
OCR, particularly on early [INAUDIBLE]. And the aim is set
really high that it has to be– the XML you get has
to be at totally [INAUDIBLE] of the original. And I thought that was
just going to be putting up too many barriers. But I’d really would welcome
feedback on this and comments. Well just to follow up on that. It’s helpful to keep in mind
that we put a lot of emphasis on text and text
mining and stuff. But as you say, some people
don’t want to mine anything. They just wanted
to read the texts. And a lot of people are
interested in material culture. And I should just
say, as a user– and I’m not being paid by
Gale to say this– just as someone who’s tried out
the product, because Brown purchased it. I’m struck by the imaging. The quality of the
images is wonderful. And the metadata is very
useful for navigating it. So I was looking for
images just to use in the course of the publicity
materials for this event. So I went on the– I said,
well I may as well use an image from the actual archive. And you can do a search
looking only for images. And then the images that you
can search for can be diagrams. Or they can be various kinds
of things, multi chrome. And it comes up very quickly. You can very quickly
navigate and find thumbnails that pop up
immediately within the book that you hover over. So you can quickly
scan to see if that’s the kind of image you want. I suspect most people will
engage with it just for that in that kind of pain, people who
are interested in book history. And I think too, with
the text analysis, if you’re doing something like
word frequency or clustering, it works OK even
if OCR’s not great. But if you try to get into
the real closer reading stuff, sentiment analysis, which
relies on linguistic parsing and part of speech tagging, that
falls apart as soon as there’s bad words in a sentence. So what we would
expect is some people might then pick documents,
clean them up manually to run that more
detailed analysis. Because you will have the XML. If you purchased a product,
you have the whole XML. Yes, you could run it through. But you might find some
of these algorithms really aren’t useful. Then people will
find new solutions, hopefully working with the
sandbox to experiment with it, and maybe just manually
repairing certain documents or not running whole documents
but the higher quality parts of it. And so that’ll be
interesting too to see, particularly in
Arabic, how that develops. Because all these tools have
been developed around English, like in Watson. But there hasn’t been a
lot of text mining tools for sentiment analysis
in Arabic yet. But we expect that
research to come. So Katherine– Thank you so much for sharing
your really beautiful website. I have a ton of
questions for you. But I wanted to ask first on
this question of Arabic OCR, the slide that you had
shown of typefaces, were European
typefaces of Arabic. And I wanted to know
if you’re finding any differences in
accuracy when you were looking at typefaces that
were produced within the Middle East. And on this question of
images within these texts whether or not
you’re also typing or somehow making available
data on manuscript inscriptions or ownership
inscriptions or seals that you’re finding
within these books so that scholars can
find them that way? And just lastly, more generally
on this topic of just wanting to read text that are
presented beautifully, do you have any ambition for
putting up British Library manuscripts? Because it’s
interesting, in one way, to only show print,
specifically when you’re looking at the
Islamic world, which is overwhelmingly
producing manuscript during this particular age? Thanks. Thank you. No, we haven’t done any
comparison of the success rates on European typefaces
and Middle East ones. We can probably do that now. The project’s still in progress. So every now and
then, I dive in. Get the latest data as for
the OCR rates I showed you. But that would be a really
interesting thing to do. So I’ll certainly let
you know once I do that and share that with you. In terms of the annotations
and that in the works, we haven’t been capturing
that separately, I’m afraid. It would amount to quite a
lot of work on so many books. You have to pick
it up on each page. Sometimes there’s a note in the
cataloging about annotations but not frequently, I think. So it would mean that you would
need to go through the text to pick that up. In terms of manuscripts, yes. There is another collection
at the British Library, which we’re talking to
them about, too, and they’re keen
that we digitize. It’s actually a collection
of Arabic, Persian, and Urdu manuscripts. So we will be doing
some of those. Yes? Hi, thank you very much for– It’s not doing it,
just for the webcast. OK, there it is. I’m going to sing. Watch out. Paul Barron, from SHARIAsource
at the Harvard Law School. Thank you very much
for your presentation. That was really interesting. I’m wondering, with
your for profit, not for profit
partnership model, do you give a copy
of what you do then back to the library in
terms of the British Museum, in terms of we’ve spent
the money to digitize this. Now you’re return as people
are using your stuff. But I know also a
lot of organizations that deal with this kind of
partnerships between industries will then provide, as part of
that contract, a permanent copy that goes with the museum
so they can develop it as they like. And I wonder if you could
talk just a little bit about what you’re finding
as you’re developing these. I noted in your talk that
you have for profit, not for profit, academic, tech
related firms working together. It’s a really interesting
constellation. And I’m just wondering if
you could talk a little bit about how you’ve knitted
those together to come up with this really
engaging product. you. It is totally a
commercial product. It is a Gale project. And we have a contract license
with the British Library to use the image of their books. And most of our agreements
are very collaborative. In each contract, we try and
have a win-win situation, where as much as we
can tick the boxes up that the library
wants, we will do so. Generally contracts
are confidential. But we generally– yes, if
we’re paying for the imaging, then we will give the set of
the images to the library. We usually offer on site access
within the institution as well. So anybody can go into
the British library and just access the product. That’s standard. With this project,
they were very keen on having sets of
records, MARC records, because these works were
largely undiscoverable at the British library,
because they weren’t in their online catalogs. So they will be uploading
the MARC records into their cataloging. And so that will
become available. And obviously, in
the British Library and also any other library that
has the MARC with the product, we have the URL in
the MARC records. So you can go directly
from the catalog entry to the actual book. But in terms of profit, non for
profit, we have been running, managing the project. We’ve worked with other teams. Obviously, we’ve paid everybody,
every team for their work and their contribution to it. So there hasn’t really
been a conflict. It’s been relatively
clear cut in that sense. Does that answer your question? Yeah, I was initially looking
for a [INAUDIBLE] issue, letting us know how you
knitted these groups together, because indeed, you’re talking
about very different ideas of business models, work
processes, short, fast, slow, movement decision making. University, as we know,
have a different rate of speed in terms of
proceeding with things as other institutions do. That’s one of the
challenging challenges. When you’ve been working
with particular vendors for various things on
English text for a long time, it’s a very smooth relationship. This project we had to
set up new relationships. And out this dance and
different [INAUDIBLE] in Arabic into English. Although we had Arabic
speakers in this, communication was a
challenge, I have to say. So some of the
processes may have taken a bit longer than we
were initially expecting. But generally, it’s
been pretty successful. And we hope to keep
everybody together so we can just move straight
on to another project and really benefit from that. And on the sandbox, as we’re
trying to link together 9 and tools through multiple
APIs, each one of those has their own license agreement. The legal component of building
out the sandbox is significant. I have a growing list
of license agreements that need to be reviewed. There are some standards. We’re still working
out exactly what the business model
for the sandbox should be, to support
the community. So it’s part of, I guess, some
of the value of the sandbox. We’ll sort those out. And then some other
things that have happened in other collections,
like 18th century, communities have formed–
not for profit– that are improving the OCR
manually and submitting to us some improvements. I don’t think that’s more
than 1% of the collection. But the work goes on. And so we work with
them to try to load that content in as well. And I would expect
that sort of community might develop around
Arabic as well. Can I just add one thing? As well as the images, and
access within the library, we also pay the library
a royalty on sales. So they get an income from
it, which helps them as well. That opens a lot of doors. It’s interesting. You talk about the
contract agreements. Because at SHARIAsource,
we’re involved with globally sourcing
primary source documents on contemporary
Islamic law issues. And so each country’s
copyright issues are certainly important and ones you
have to really navigate. And we’re finding with a
lot of their documentations and a lot of the
source material we’re looking for fall in a lot
of gray areas about what is and what is not included. And so that’s interesting. And some scholars, like folks
just on distant reading, actually make a point of
not reading the document. It’s a point of pride. You’ll see papers like, how
not to read a Victorian novel. Because they’re just
trying to look at millions of pages at a very high level. There’s models like HathiTrust,
where they go read the book, return to use a
word frequency list. You’ve never actually
violated copyright, because you never saw it. So some of those models, I
think, will evolve as well. For certain kinds of
research, that works. Not so much if you really
want to dive in and do the close reading. Yeah and an ethical
treatment of the whole thing, yeah absolutely. Anybody else? Hi, thank you for that. Michael Pregill. I’m from Boston
University Mizan Project, which is a new digital
scholarship initiative there. I have two comments that
are maybe questions. I don’t know, one
more of a comment. I was really, really
happy to see the inclusion of the lithographed material. And it seems to me
that, even– and this is speaking to Elias’
thing about accuracy in text recognition. Being a manuscript
person myself– I don’t know– I’ve
had the experience that so many even well-known
collections, the manuscripts are miscataloged,
misidentified, or have not been examined since the
19th century or before. And it seems to me that if there
was a digitisation initiative to try to get at least
some of those online and to even do OCR at
a very basic level, with a very low level of
return, you would still be able to start capturing
what was actually in those texts, which is a huge
improvement over the situation where an old hand list
completely mis-identifies a text. You know what I mean? So it seems that in that
bridge from printed, to lithographed manuscript,
we are potentially capturing a lot more material
and improving our knowledge of the archive, as we can see
what’s actually in collections, some of which have not
been taught a long time, including the British Library,
which is very strange. Because it’s such a
well-known collection. More of a question as
to a scope, and Bret you had mentioned
high level analysis. I found myself thinking
about the corpus. So we’ve got early Arabic
printed texts in the British Library collection. The natural thing
to think about would be either to moving to other
languages in the larger Islamic Society of the time. Which seems to me, if you want
to do higher level analysis in terms of what’s
actually in the discourse, you cannot limit
yourself to Arabic, because you have to see
what is also being published and written in other Islamic
language of the time, especially important from
13th, 14th century onwards. And obviously for
printing would be later. But you see what I’m saying,
text produced in Persian, Urdu, and so forth, Ottoman
in that later phase. On the other hand, in
terms of the corpus– I don’t know– are
there possibly plans to think about
adding to the corpus from things that are outside
of the British Library’s collection? I have no idea how comprehensive
that collection is. But it can’t have everything. There must be some
stuff out there that hasn’t been represented. And especially in terms of doing
this higher level analysis, we would want a broader pool. You would still have to
impose caveats to say, oh well I’ve looked at all the Tafsir
that were produced in print. And I can say this. Well if you’re missing–
there’s a question of how many are
actually represented and how broad the generalization
could be, based on the sample that you could get there,
which is, of course, a massive improvement
over what we have now. But one wonders what could be
done in, say, 10 or 15 years. So what are you doing? Technically yes that’s possible. You can cross search across
multiple archives now. So you might start
with really Arabic. And then you go look at Google
Books or some other source that you trust. And you find other content
that meets your needs. So you have more
more representation. You can’t just do a research
project on Gale content, either. So it’s possible, technically. Exactly what the quality
is, I mean you already see it in our sandbox. You’ll see Google Books
gives you 10 metadata fields. We have up to 75. Sometimes you only get a snippet
or just a title from Google. You don’t have the full text. And do you filter out
things just because you didn’t have the full text? Then you have a street
light effect problem, where you’re just going
after the high quality stuff. and that continues
to get– and then which stuff is getting
digitized in the first place. Or even more fundamentally,
what was ever printed in– what was
even authored or collected to then be digitized? There’s all those other steps. And I think you’re going
to see varying quality when you do the research. And you got to decide at what
level you want to dive in. To answer your question about
[INAUDIBLE] even in the British Library, there are more
texts of this period, which we haven’t done yet. One of the first
challenges was actually getting an inventory of
works to actually start digitizing and imaging. Because they didn’t
have anything, which is why I decided
to use Ellis’ catalog, which was printed at the
end of the 19th century, because you could create a
list from that to start with. So we could continue with
the rest of the books that they have of that
same period up to 1900. We could also find books
in other libraries, which weren’t included. Whether we do so or not
largely depends on the success of the first project. So if it’s very
successful and there’s a huge demand for more of it,
like a part two in a sense, then of course we would do it. It would be wonderful to do so. Hi, I’m Rich Nielsen. I’m a professor at MIT. So I had a question
for both of you, building perhaps on some of
the questions that have already gone. So first I’m curious for
Julia, how items make it into the collection, which
gets to this issue of thinking about doing analysis
within this corpus. What could you and
could you not conclude about frequencies of
terms as used by authors throughout the Middle East
or throughout Europe writing in Arabic from this corpus? Then for the sandbox, what I’m
still struggling to understand is precisely under what terms
the raw text of documents can be exposed or not and
how much access researchers have to those. Because the type of
work I tend to do involves almost full
access to the text, because I develop new
statistical models to deal with precisely the problems
you don’t have answers to. So I deal with Arabic problems
by developing new Arabic tools. It’s possible that
that could slot into the workflow in the sandbox
if that were the necessary way to do it. But that’s not my
preferred workflow, because I have my own
workflow that works. So I’m curious whether
that’s possible under any circumstances,
what type of arrangements you might have for that? Or if it’s impossible? If it needs to be through
the interface in some way to retain intellectual
property over the documents? For text before the sandbox came
along, our API, if you will, was copying the
database to a hard drive and FedExing it to clients. So that is still available. That is actually my
speed in some way. Free shipping, though. So that model’s still
out there, because there are those researchers,
like yourself, that are really pushing this. We did some interesting
work with Compute Canada. They analyzed our
images and found they have some ideas on how to
improve the OCR on our older collections. Because you can imagine stuff
that we scanned and OCR’d 15 years ago, the technology’s
improved significantly, particularly in
image recognition. So we’re keen to
work with researchers that can improve that,
because it helps the product. Someone else was asking
earlier about images. We haven’t focused
on that just yet. We’re focused on text. But there’s a lot of
really interesting we can do with our images to
develop more metadata to make them more searchable. We have historical maps. We actually commissioned
to get all the GPS metadata for them, all the four corners. So we have 14,000
historical maps out there that could be overlaid and
used in different ways. So there’s not one fixed way how
we work with the researchers. There’s a lot of one
off agreements that are different from,
yeah, here’s a database. Purchase it. So happy to talk to you
about what you might want. Just to address your question
about defining what you’re actually on what the
corpus is, all you can say is that because we
included every book which is in the catalogue
by Ellis [INAUDIBLE]. But you’re researching on every
book that was in the British Library by about 1890. So how useful that
is, I don’t know. That’s useful. But that is– That’s a process– I was going to say on the
topic of that Steven Roman has a book on the history of
the collection buildings. And the British Library,
formerly the British Museum, is featured in that, to
give some sort of context for fleshing it out. And related to that,
I’m also working– because we have 300 archives to
put in the sandbox right now. I could give you an
alphabetical list. And a lot of libraries organize
their database the same way. I’m trying to come up
with metadata standards at the collection level. So you then know what collection
you want to start working on. One of the things
we’re looking at doing is extracting all
the index terms for every document
in a collection, to be a descriptive way to
describe if you even want to go into that room or not. So I had another question. So stop me if I’m wrong, but I
assume that subscribers could then subscribe to one of
the potential research areas that you all
have designated, or perhaps two or three or four? Yes. We’re publishing
it in four modules. So you can buy all of
them, get the whole corpus. Or [INAUDIBLE], get
the whole corpus. Or just one or two or whatever. So then in which case,
am I able to search– if I’m only using one, am I able
to search for hits upon others, so that I can see,
for example, return. I’m thinking specifically on
printers, whom I research. And, of course, they
produce all sorts of texts. So if I want to study
them, I don’t necessarily want to see what they just
contributed to literature but also to religion, sciences,
and to periodicals, et cetera. And to have some sort
of a function there would be super helpful. If you wanted to do
that, and find out what the printers did
across all subjects, you would need to
have access to them. The four– Yeah. Could I ask the uncomfortable
question that Paul, I think, started to ask? But it’s actually something
that I’m really interested in. And I’ve been interested
in this question since we started thinking
about putting this together. And that brings us
back to the theme of profit, nonprofit,
public-private partnerships. Because I think
there’s a lot of fuzzy headedness about this question
in the academic community, especially in digital humanities
communities, where people get very angry or very
worked up about the notion of a corporation
getting into the domain that academics populate
and starting to figure out ways to do things that
academics should do and charging money for it. But many people also don’t feel
that bad about when Google does it, because we’re not paying
for access to Google Books and Google Scholar and so on. So a lot of the things
that have become in a way infrastructural, that we
just take for granted, we don’t spend that much
time thinking about it. But when it comes to whether
it’s Elsevier or Gale or whatever, people,
I think, do raise– it’s worth raising the
question of profits and how we should
think about all this. So I’m curious
about how you think about it from a publisher’s
perspective, a publisher that is producing, let’s be
honest, some incredible tools and some incredible archives
and is really at the bleeding edge of this sort of work and
that we should, as scholars, be engaged in a dialogue with. How do you approach
these questions? I’m actually really
pleased you’ve raised this. I just on the ridge up the East
Coast universities this week, and we were just discussing
this the other day. I find from some
librarians perspective, it’s slightly strange,
the perspective on commercial publishers. If you set up a project to do
digitization of collections here, for example, Harvard,
you would probably get funding from a government body, maybe
a private foundation, but maybe a government body, which is
actually taxpayers’ money. And so your taxpayers’ money
would be funding your project. Whereas what we do is
that we get the money from each university
individually who actually subscribes or purchases. So I don’t see it as clean,
dirty money in a sense, if you want to put
it that crudely, because basically
it’s both being funded from the same money
but just via different means. Does that make sense? We are a commercial company, in
that we have to make a profit. And so that profit,
some of the profits goes into investing
in new products. And some of it goes into
shareholders, et cetera. So there is that element. I’m going to continue to play
devil’s advocate, though. I used to work in publishing. So I am sympathetic
to the perspective. And also your description
of how many labor hours must have gone into producing
something like this and how many steps had to
go into just the OCR angle– forget about all the text
mining and everything that comes into that–
it’s a huge endeavor. And there has to be
some kind of motive. We’re motivated, as academics,
by things like tenure. And we’re not immune to the
idea of incentivizing things. And profit is an
important incentive. But people do like to
raise the question of what is going to happen
to the academy– or what is happening
to the academy– when entire wings
of the Academy– are being shifted into
the more for profit realm, whether it’s journal
subscriptions and what libraries have to
pay for maintaining access to periodicals. And presumably
also we should say, I don’t know how it
compares with the cost of your other archives. But maybe it would
be useful to let people know how much
a subscription to, or purchasing this particular
archive would cost. And which universities
are able to afford that, and how that whole
economy ends up working, and what its, maybe,
unintended consequences might be for places
that can’t afford it? I don’t expect you to
answer all these questions. But these are the themes that
are brought up over and again in places like the Chronicle
of Higher Education, where there a lot of angst. Let’s be honest. There’s a lot of
angst about what will happen to the
academy, or what is happening to the Academy,
through these public-private partnerships. Basically it is a
series of partnerships. And I hope publishers
which are involved in serious academic projects
like these will become closer to the academic community. We’ve always worked very closely
with the academic community and the libraries,
because we actually make the library collections
accessible globally, which they can’t do
themselves quite often. And we have the ability
to source material from several libraries to
reconstruct archives which have been separated, for example. It’s not to say that a library
couldn’t do the same thing. But I think it’s partly attitude
and an approach to what we want from the end product. What we’re aiming for is
a research environment, the tools for
academics to best do the research on this material. And to achieve that, we talked
to the librarians and academics to see what they want. And then we can modify that. Where we sometimes
have a better position is that we develop relations. Because we do this
all the time, we develop relations with
particular vendors who are skilled in various spheres. So we can do things
really more cheaply than you can within the
academic community quite often. But that doesn’t
need to be like that. You’re were asking about hours. At times, we’ve had
1,000 people working on a collection, not this
particular one, but others. Millions of hours have
gone into collectively creating what we’ve done. I was at the DH
Summer Institute, asking well, how’s
your scanning get done in your local projects? We just get undergrads to do it. That’s not sustainable. And that’s not an
academic endeavor. You scan one or two pages. You get it. So there needs to be some way
to create more digital content to enable all these tools. As director of new
strategy, we’re looking at a lot of
different models. So now stepping aside
from publishing, looking at software,
software is a service. Infrastructure is a service. There is, I think,
Harvard Business School identified 27 different
revenue models you could have. Right now we do collections
or subscriptions. That’s what we do. Some of our competitors
offer some software services. And you look at
the libraries, they have collection budget,
subscription budgets, tech and op budgets. Then the university
has different pockets. So there’s no standard
way anyone’s organized. And it doesn’t need to be. Now that things
are digital, there are other ways to do things. But the mindset is
collection or subscription. And that’s it. I think that will evolve. I don’t have an answer
where it’s going. But there is certainly
a lot more flexibility in the digital world. Just to pick up your question
on pricing you asked. We have pricing models,
which the pricing reflects the size of the institution. I think we’re actually refining
the banding for pricing at the moment to try
to make it a bit more accurate to the actual
size and the importance of the institutions. So it means that a small
institution doesn’t necessarily pay the same price
as a major one. So we are quite
aware that you can’t expect– we aim to
make these products available to all institutions. So when I do the business
case for each project, I look at all the costs. And then I look at
what institutions are likely to be able
to afford for this. If the costs mean that the
price is going to be too high, then we maybe modify it
so that it’s available. And one of the other
models we’ve looked at, too, is digitisation
as a service, subject to us having
the right capacity. But libraries have
approached, say look, I’ve got these 1,000 books. Could you on board them
and make them available in your interface, do
all this quality control? Because you probably can do it
per page cheaper than we can. We haven’t figured out
how to make that work, because we want to have
a significant pipeline to build a business around it. Right now it’s
one off inquiries. But it’s something we
look at, if that’s a way. And then that would be, we
would deliver the project. And then it’s open from there,
subject to maybe some hosting. So it’s a different model. What we do now is
underwrite all the costs and recover it over years. There’s a lot of
different ways that we can model the economics. We have time for
one more question before we go to lunch and then
back in for Katherine’s talk. And then we’ll have
time after that for a more general discussion. One more question? I think Rich had a– Yeah, but I don’t
want to double dip. Anybody else? It’s OK to do the last one. So my question actually
builds on this. So one thing, I’m so
excited [INAUDIBLE], although I’m trying to
walk in between somehow. My discipline is very concerned
about replicability right now, which is a
real challenge when using non-open source material. It much of my work, I acquire
myself all of the texts that I’m using. But I still can’t
legally release it. It is often in a gray
zone of the internet, where it’s like, things on
the internet, to what extent can I release them or not. It’s usually sufficient
to release things like summaries of
the text that are intermediate steps
between the raw text and some statistical
analysis that could be wrong. They don’t allow
things, often, like they would allow single word
searching back on the text. But they wouldn’t allow
user reconstructed texts. I’m curious whether this sort
of thing is on your radar, whether this fits into
models for thinking about what of your data
can and can’t be released. But it is somewhat difficult
for me to work with a source where what I have to tell,
some other researchers would like to verify that I
didn’t just make everything up. There are other good
reasons, I think, why this replicating
norm is growing. Tell them, well, you just need
to find a subscription to this. You better hope
that your university is going to get that, in
order for you to check. Is there any kind of
intermediate thing that is possible to release
that does not violate your types of agreements? In the context of the
sandbox, we’ve looked at that. My starting point is
the research process and the full lifecycle, all
the way through peer review and replicating results. So one of the things– It seemed like you were
thinking along those lines. One of the things we want to
do is support that workflow. You might have, ten work sets,
ten algorithms, ten parameters. You’ve got all these results. You’re taking notes. You’re writing them up. Well somebody might want to
go back and look at that. We would like it– this is
still subject to legal review, of course– but we would
like the peer review group to be able to come
into the sandbox and go look at everything. Since then it’s contained and
we’re not sending hard drives to all your peer review
group who maybe can’t even run your software. So we think having the same
software versioning and all that, we might be able to
help that process somewhat. But it’ll still take some
doing on the legal side. But effectively, if
the peer review group could work in a matter
of weeks, months– I don’t know what
that process is like and it probably
varies by project– but they could come into the
environment for a peer review process even if they weren’t
a member institution. That’s our thinking so far. On the using early
Arabic material, it depends on what
exactly you’re wanting to display on the open
web, whether it would infringe or not. So we certainly want
to enable things to happen We’re creating
these products so that you can do this research. We certainly try
as hard as we can to actually agree to things. Thanks. Thanks both of you. Thanks to you both. This as been hugely– [APPLAUSE]

One Reply to “Digitizing Early Arabic Printed Books: A Workshop – Session 1”

Leave a Reply

Your email address will not be published. Required fields are marked *

© Copyright 2019. Amrab Angladeshi. Designed by Space-Themes.com.