Unlocking Textual Data: A Beginner's Journey Through Python, NLTK, And SpaCy

Table of Links

Abstract and 1 Introduction

2 Related Work

3 A Virtual Learning Experience

3.1 The Team and 3.2 Course Overview

3.3 Pilot 1

3.4 Pilot 2

4 Feedback

4.1 Relentless Feedback

4.2 Detailed Student Feedback

5 Lessons Learned

6 Summary and Future Work, Acknowledgements, and References

A. Appendix: Three Stars and a Wish

3 A Virtual Learning Experience

3.1 The Team

Our team is made up of three early career academics at the University of Edinburgh. Two teaching fellows have a background in Natural Language Processing with PhDs in Computational Linguistics. The third teaching fellow has a PhD in Computer Science and frequently teaches programming to different types of audiences, including business students as well as students outside of higher education. The author list of this paper also includes a fourth (last) author who was a participant of our first pilot, is a lecturer herself, and who has provided us with useful feedback for future iterations of this course (see Section 4.2).

3.2 Course Overview

In our data-driven society, it is increasingly essential for people throughout the private, public and third sectors to know how to analyse the wealth of information society creates each day. Our TDM course gives participants who have no or very limited coding experience the tools they need to interrogate data. This course is designed to teach noncoders how to analyse textual data using Python as the main programming language. It takes them through the required steps needed to be able to analyse and visualise information in large sets of textual document collections, or corpora.

The course takes place over three three-hour sessions and each session introduces participants to a new topic through a short lecture. The topics build on the previous sessions and at the end of each session there is time for discussion and feedback. In the first session we start with Python for reading in and processing text and teach how individual documents are loaded and tokenised. We work with plain text files but do raise the issue that textual data can be stored in different formats. However, to keep things simple we do not cover other formats in detail in the practical sessions.

In the second session we show how this is done using much larger sets of text and add in visualisations. We used two data sets as examples, the Medical History of British India (of Scotland, 2019) made available by the National Library of Scotland[4] and the inaugural addresses of all American Presidents from 1789 to 2017. We show how participants can create concordance lists, token frequency distributions in a corpus and over time as well as lexical dispersion plots and how they can perform regular expression searches using Python. In this session we also explain that textual data can be messy and that a lot of time can be spent on cleaning and preparing data in a way that is most useful for further analysis. For example, we point students at stop words and punctuation in the results and explain how to filter them when creating frequency-based visualisations.

During the third session we cover POS-tagging and named entity recognition. This last session concludes with a lesson on visualisations of text and derived data by means of text highlighting, frequency graphs, word clouds and networks (see some examples in Figure 1). The underlying NLP tools used for this course are NLTK 3 and spaCy which are widely use for NLP research and development. This is also where we put some of the course material in context of our own research to show how it can be applied in practice in a real project. For example, we mentioned our previous work on collecting topic-specific Twitter datasets for further analysis (Llewellyn et al., 2015), on geoparsing historical and literary text (Clifford et al., 2016; Alex et al., 2019a) and on named entity recognition for radiology reports (Alex et al., 2019b; Gorinski et al., 2019).

Figure 1: Visualisations of text explorations created by the students.

In the two pilots, we ran this course over three afternoon sessions on Monday, Wednesday and Friday, with an office hour on the days in-between to sort out any potential technical issues and answer questions. The main learning outcome is that by the end of the course the participants will have acquired initial TDM skills which they can use in their own research and build on by taking more advanced NLP courses or tutorials. A main goal of this course is to teach the material in a clear stepby-step way so all Python code and the examples are specific to each task but do not go in-depth into complicated programming concepts which we believe would confuse complete novices.

Authors:

(1) Amador Durán, SCORE Lab, I3US Institute, Universidad de Sevilla, Sevilla, Spain ([email protected]);

(2) Pablo Fernández, SCORE Lab, I3US Institute, Universidad de Sevilla, Sevilla, Spain ([email protected]);

(3) Beatriz Bernárdez, I3US Institute, Universidad de Sevilla, Sevilla, Spain ([email protected]);

(4) Nathaniel Weinman, Computer Science Division, University of California, Berkeley, Berkeley, CA, USA ([email protected]);

(5) Aslı Akalın, Computer Science Division, University of California, Berkeley, Berkeley, CA, USA ([email protected]);

(6) Armando Fox, Computer Science Division, University of California, Berkeley, Berkeley, CA, USA ([email protected]).

[4] https://data.nls.uk/ data/digitised-collections/ a-medical-history-of-british-india/

Unlocking Textual Data: A Beginner’s Journey Through Python, NLTK, and spaCy | HackerNoon

Table of Links

3 A Virtual Learning Experience

3.1 The Team

3.2 Course Overview

Leave a Reply Cancel reply

Stay Connected

Latest News

Overall VC Fundraising Drops 33% But AI Investment Surges

8BitDo debuts swappable buttons on the updated Pro 3 controller

Take the Google 45W Power Charger home for just $24.99

00s movie star unrecognizable as she’s spotted in LA – can you guess who she is?

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

3 A Virtual Learning Experience

3.1 The Team

3.2 Course Overview

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News