PowerPoint Presentation

Published on Slideshow
Static slideshow
Download PDF version
Download PDF version
Embed video
Share video
Ask about this video

Scene 1 (0s)

The management and cataloguing of documentary heritages has been the subject of study in Europe since 1700 New demand for systems and procedures for managing and sharing cultural heritages also in supranational and multi-literate contexts Digital Maktaba, DM (in Arabic “maktaba”, “library”: the “place where books are found” and “where you write”) is a project that aims to build a digital library to preserve cultural heritage. Born from the collaboration between computer scientists, historians, librarians, engineers and linguists University of Modena and Reggio Emilia (UniMoRe) Foundation for Religious Sciences (FSCIRE) mim.fscire start-up.

Scene 2 (26s)

Digital Maktaba software overview. The software is designed to take documents as input and then: Elaborate and digitize each document Use the extracted data to help the user in the task of cataloguing.

Scene 3 (1m 0s)

conversion. Arabic: 157 Pdf docs Persian: 109 Pdf docs Azerbaijani Turkish: 37 Pdf docs Latin: 12 Pdf docs.

Scene 4 (1m 30s)

New OCR tests. Of the systems evaluated, the best are: Google VisionAI, which is a paid service Google Docs, it performs slightly worse w.r.t to Google Vision AI, but it is unusable for the purpose of the project (due to the type of service and output it offers). EasyOCR, is a free service that offers the closest and most complete output of all to the best system.

Scene 5 (1m 53s)

Document annotation. Annotation plays a pivotal role in training Machine Learning (ML) models by providing labelled data or ground truth. ML models learn from annotated data and make predictions based on that learning.

Scene 6 (2m 12s)

The process of manual annotation of all the documents is not feasible: Time-consuming: Manual annotation requires human effort to read through each document and analyze its content Expensive: Hiring a team of annotators or experts to manually annotate documents can be expensive Scalability: As the volume of documents increases, manual annotation becomes even more challenging Digital Maktaba is working to eliminate unnecessary human labour that does not enhance its work An interface has been created to allow the team to collect data from user interaction The team has begun implementing machine learning techniques to automate the annotation process..

Scene 7 (2m 40s)

HD1 (expansion) – biblioteca La Pira 2019 Properties: 136778 files, 5548 folders, 830 GB tot. dimension. Format: mostly PDF PDF status: Digitized / Non-digitized.. Production: printed/manuscript (less) Alphabets: Arabic (Arabic, Persian, Azerbaijani) Arabic characters: Nasḫ, Nastaꜥlīq, Taꜥlīq, Kūfī, Ruqꜥah, Diwānī, Ṯuluṯ, Muḥaqqaq, Tawqīꜥ, “Artistic” (designed for a particular frontspiece).

Scene 8 (3m 17s)

HD1 (expansion) – biblioteca La Pira 2019 Properties: 136778 files, 5548 folders, 830 GB tot. dimension. Format: mostly PDF PDF status: Digitized / Non-digitized.. Production: printed/manuscript (less) Alfabeths: Arabic (Arabic, Persian, Azerbaijani) Arabic characters: Nasḫ, Nastaꜥlīq, Taꜥlīq, Kūfī, Ruqꜥah, Diwānī, Ṯuluṯ, Muḥaqqaq, Tawqīꜥ, “Artistic” (designed for a particular frontspiece).

Scene 9 (3m 52s)

Selected texts from HD2. [image] Immagine che contiene testo schermata documento Descrizione generata automaticamente.

Scene 10 (4m 32s)

Immagine che contiene testo Descrizione generata automaticamente.

Scene 11 (5m 17s)

Dataset description Total of 2300 Arabic documents 12 possible labels (of which only 4 were used to fit the model) Each document can be multi-labelled Dataset preparation under sampling as a balancing method features extracted from document texts: term frequency-inverse document frequency (tf-idf) Adopted strategies kfold method as validation strategy F1-weighted average score as model evaluation metric Best classifier in terms of average F1 score: SVM.

Scene 12 (5m 47s)

Dataset description Total of 2300 Arabic documents 12 possible labels (of which only 4 were used to fit the model) Each document can be multi-labelled Dataset preparation under sampling as a balancing method features extracted from document texts: term frequency-inverse document frequency (tf-idf) Adopted strategies kfold method as validation strategy F1-weighted average score as model evaluation metric Best classifier in terms of average F1 score: SVM.

Scene 13 (5m 57s)

Dataset description Total of 2300 Arabic documents 12 possible labels (of which only 4 were used to fit the model) Each document can be multi-labelled Dataset preparation under sampling as a balancing method features extracted from document texts: term frequency-inverse document frequency (tf-idf) Adopted strategies kfold method as validation strategy F1-weighted average score as model evaluation metric Best classifier in terms of average F1 score: SVM.

Scene 14 (6m 9s)

Dataset description Total of 2300 Arabic documents 12 possible labels (of which only 4 were used to fit the model) Each document can be multi-labelled Dataset preparation under sampling as a balancing method features extracted from document texts: term frequency-inverse document frequency (tf-idf) Adopted strategies kfold method as validation strategy F1-weighted average score as model evaluation metric Best classifier in terms of average F1 score: SVM.

Scene 15 (6m 47s)

The workshop allowed the team to show a primitive cataloguing system in which the importance of first manual and then automatic (intelligent) annotation is appreciated. It enabled the participants to better understand cataloguing in the sense of collecting data in order to develop a meaningful model that reproduces the librarian's work. The workshop demonstrated the complexity of language and textual genres with respect to documents in English or the Latin alphabet. the workshop demonstrated the functioning of the cataloguing interface with ML functionality to assist the librarian's work. Feedback from participants was collected highlighting the “human in the loop” paradigm..

Scene 16 (7m 23s)

Creation of a tagging interface containing a sample of about 100 documents ready to be catalogued by summer school participants. Pre-experiment form to collect useful information on participants cataloguing knowledge knowledge of Islamic studies Arabic language knowledge Explanation of interface functionality to participants Cataloguing session where the participants were free to use any method that might help them Post-experiment form to collect users’ feedback about the interface used mini-tasks were given to be completed through the interface difficulties when performing these tasks were collected final impressions of the tool and useful tips on how to improve it were collected.

Scene 17 (7m 52s)

A screenshot of a computer Description automatically generated.

Scene 18 (8m 5s)

A screenshot of a computer Description automatically generated.

Scene 19 (8m 15s)

Due to the limited sample size documents and the limited number of active participants in the UNIMORE Summer School workshop, we aim to expand both the number of documents catalogued and participants to gather valuable document insight and feedback on the interface. Your assistance in cataloguing a portion of our initial samples would greatly contribute to understanding the most effective direction for the next steps of our project..

Scene 20 (8m 48s)

Papers, articles, conferences - WP5 (Digital Maktaba).

Scene 21 (9m 55s)

Thank you for your attention. 21. As your assistance is essential to our goal of developing a useful tool for librarians, we kindly ask you to join us in our document tagging session. Your experience and knowledge can significantly help us get closer to our goal. Where: In person or online When: tomorrow (if possible) or on a date to be defined What you need: a PC, email, and an internet connection.