avatar

Hello,
my name is Simon29yo and this is my resume

About Me

I am a PhD Candidate at the Université libre de Bruxelles, and have been for nearly three years. I research information extraction in multilingual, unstructured, OCRed, historical textual data and specialise in topic modelling (LDA). Before focusing on LDA, I tackled named-entity recognition, too.

A PDF version is available here.

Education

  • 1988

    I was born in Belgium

    ... at a very young age, to Mr and Mrs Hengchen-Dubois.

  • 2010

    MA in Germanic languages

    The BA was quickly followed by an MA, degree for which I did an internship in an NGO in South Eastern India, Volontariat.

  • 2012

    MSc in Information Science and Technologies, specialising in Natural Language Processing

    Wanting to be prepared for a life in the 21st century, I chose to tackle this challenge and enrolled in the MaSTIC program. Courses included Algorithmics, Programming (C++), Natural Language Processing, Databases, and Library Science.

  • 2017

    PhD in Information Science and Technologies, specialising in Natural Language Processing

    I research information extraction in multilingual, unstructured, OCRed, historical textual data and specialise in topic modelling (LDA).

Experience

Trinity College Dublin
2015
CENDARI Fellow
In the context of the CENDARI project, I have been invited to research a 3.1 million pages dataset pertaining to the daily life of the city of Ypres, in Belgium. This research was carried through at the Long Room Hub for three months.
Université libre de Bruxelles
2013-2017
PhD Candidate
I research information extraction in multilingual, unstructured, OCRed, historical textual data and specialise in topic modelling (LDA). Before focusing on LDA, I tackled named-entity recognition, too. I am the beneficiary of a BELSPO grant and work on the TIC Belgium project, a project aiming to develop a virtual research environment to help historians delve into millions of pages of historical, textual data with a focus on transnational intellectual cooperation.
As a representative of the scientific community, I also take part in various scientific commissions and am a full member of the Faculty Council.
GDF Suez (now ENGIE)
2013
Young Knowledge Officer
As part of the Department of Strategic Watch and Analysis, my tasks were, in a nutshell, to monitor various sources of information and dispatch relevant data to other departments. As a consequence of an NDA, any other information query should be addressed to my then-supervisor, Yohann Delzant.

Publications

Proceedings of the 15th International Symposium of Information Science (ISI 2017)
2017
Text Mining for User Query Analysis
A 5-Step Method for Cultural Heritage Institutions
This paper explores a 5-step methodology, using text-mining, that can help automate the analysis of large volumes of log files. The methodology is illustrated by a case study from the State Archives of Belgium. The paper has been presented at the Everything Changes, Everything Stays the Same? Understanding Information Spaces conference.
The first author of this paper is Anne Chardonnens, with help provided by Raphaël Hubain.
Proceedings of the 2016 IEEE Conference on Big Data
2016
Exploring archives with probabilistic models
Topic Modelling for the valorisation of digitised archives of the European Commission
This paper presents a proof of concept on the use of Latent Dirichlet Allocation (LDA) to semi-automatically create content metadata for multilingual, historical, OCRed archives. It also tackles the reconciliation of the generated metadata with an existing controlled vocabulary, enabling the institution to effortlessly integrate the results into a production system.
Presented at the 2016 IEEE International Conference on Big Data workshop on Computational Archival Science.
Co-authors: Mathias Coeckelbergs, Seth van Hooland, Ruben Verborgh and Thomas Steiner.
Conference Proceedings
2016
Comparing Topic Model Stability across Language and Size
This paper presents a benchmarking study of the use of Latent Dirichlet Allocation (LDA) in parallel corpora (English and French). It furthermore tackles the problem of representativeness (how much data is enough data?) by reducing a large corpus, based on DBpedia, and applying LDA on smaller versions of it.
Presented at the 2016 conference of the Japanese Association for Digital Humanities, JADH2016.
Co-authors: Alexander O'Connor, Gary Munnelly and Jennifer Edmond.
Preprint
2016
How hot is .brussels?
Analysis of the uptake of the .brussels top-level domain name extension
This paper presents an analysis of the uptake of the .brussels domain name extension. A quantitative approach of the dataset determines several characteristics of the gTLDN, such as the name and country of registrants of .brussels domains, or the number of redirections vs used-as-such domains. A more qualitative analysis, based on a representative sample, indicates the language of the .brussels websites, the commercial sectors that use them, and whether there is a direct link to the city of Brussels.
Code and preprint available on howhotis.brussels.
Co-authors: Margot Waty, Seth van Hooland, Mathias Coeckelbergs and Max De Wilde.
De Boeck Université
2016
Introduction aux humanités numériques
Méthodes et pratiques numériques en sciences humaines
et sociales
This book, co-written with Seth van Hooland, Max De Wilde and Florence Gillet, introduces digital methods to humanities students. The book tackles information searching, data modeling, digitisation best practices and data analysis.
Digital Humanities Quarterly
Accepted for publication
Semantic Enrichment of a Multilingual Archive with Linked Open Data
This paper, co-written with Max De Wilde, presents MERCKX, a novel tool to semi-automatically enrich a multilingual archive with Linked Open Data. Using a 3.1 million pages dataset focusing on the city of Ypres in Belgium, we introduce a robust language-independent system that beats state-of-the art solutions.
I2D
2015
L'extraction des entités nommées : une opportunité pour le secteur culturel ?
This paper, co-written with Seth van Hooland, Ruben Verborgh and Max De Wilde, evaluates different NER services on a historical, French-language corpus. By doing so, we demonstrate that it is possible for libraries, archives and museums (LAMs) and, by extension most cultural heritage institutions, to easily enrich their datasets with Linked Data URIs in a low-cost way. PDF available on CAIRN.info.

Skills

Natural Language Processing & Text Mining
Topic Modelling
Named-Entity Recognition
Semantic Web
Linked Data
Linux
Python
Beginner
Proficient
Master
Expert

Conferences, Workshops, Talks

IEEE Big Data
2016, Washington
Exploring archives with probabilistic models: Topic modelling for the European Commission Archives
First Workshop on Computational Archival Science.
Slides.

JADH
2016, Tokyo
Comparing Topic Model Stability across Language and Size
Japanese Association for Digital Humanities.
Slides.

UCSB
2016, Santa Barbara
Topic modelling in the Library
University of California: Santa Barbara libraries.
Slides.
DHBenelux
2015, Antwerp
Semantic Enrichment of a Multilingual Archive with Linked Open Data
DHBenelux
2014, The Hague
NER as a gateway drug to the Linked Data cloud: Application of Named-Entity Recognition on cultural heritage metadata
Digital Humanities FNRS
2014, UCLouvain
Named-Entity Recognition et Linked Data: quelle valeur ajoutée pour les archives ?

Member of boards and committees

Reviewer
ISSN: 2095-9230
Frontiers of Information Technology & Electronic Engineering
Founding Member
2013 -
FNRS contact group for Digital Humanities
Member
2013 -
TIC Belgium Technical Committee
Member
Programme Committee
DHBenelux 2016
Member
Programme Committee
DHBenelux 2017
Member
Scientific Committee
Digital Approaches towards 18th and 19th century serial publications (September 2017)

Languages

French
English
Dutch
Norwegian Bokmål
Modern Hebrew
Tamil

Hobbies

Boxing
Running
Weird languages
Robotics and automation
Electronics
Norse mythology