Picture of boy being examining by doctor at a tuberculosis sanatorium

Understanding our future through Open Access research about our past...

Strathprints makes available scholarly Open Access content by researchers in the Centre for the Social History of Health & Healthcare (CSHHH), based within the School of Humanities, and considered Scotland's leading centre for the history of health and medicine.

Research at CSHHH explores the modern world since 1800 in locations as diverse as the UK, Asia, Africa, North America, and Europe. Areas of specialism include contraception and sexuality; family health and medical services; occupational health and medicine; disability; the history of psychiatry; conflict and warfare; and, drugs, pharmaceuticals and intoxicants.

Explore the Open Access research of the Centre for the Social History of Health and Healthcare. Or explore all of Strathclyde's Open Access research...

Image: Heart of England NHS Foundation Trust. Wellcome Collection - CC-BY.

Cloud-based textual analysis as a basis for document classification

Weir, George and Owoeye, Kolade and Oberacker, Alice and Alshahrani, Haya (2018) Cloud-based textual analysis as a basis for document classification. In: 2018 International Conference on High Performance Computing & Simulation (HPCS). IEEE, Piscataway, New Jersey, pp. 629-633. ISBN 9781538678787

[img]
Preview
Text (Weir-etal-ICHPCS-2018-Cloud-based-textual-analysis-as-a-basis-for-document-classification)
Weir_etal_ICHPCS_2018_Cloud_based_textual_analysis_as_a_basis_for_document_classification.pdf
Accepted Author Manuscript

Download (324kB) | Preview

Abstract

Growing trends in data mining and developments in machine learning, have encouraged interest in analytical techniques that can contribute insights on data characteristics. The present paper describes an approach to textual analysis that generates extensive quantitative data on target documents, with output including frequency data on tokens, types, parts-of-speech and word n-grams. These analytical results enrich the available source data and have proven useful in several contexts as a basis for automating manual classification tasks. In the following, we introduce the Posit textual analysis toolset and detail its use in data enrichment as input to supervised learning tasks, including automating the identification of extremist Web content. Next, we describe the extension of this approach to Arabic language. Thereafter, we recount the move of these analytical facilities from local operation to a Cloud-based service. This transition, affords easy remote access for other researchers seeking to explore the application of such data enrichment to their own text-based data sets.