Connecting firm's web scraped textual content to body of science : utilizing Microsoft Academic Graph hierarchical topic modeling
Hajikhani, Arash and Pukelis, Lukas and Suominen, Arho and Ashouri, Sajad and Schubert, Torben and Notten, Ad and Cunningham, Scott W. (2022) Connecting firm's web scraped textual content to body of science : utilizing Microsoft Academic Graph hierarchical topic modeling. MethodsX, 9. 101650. ISSN 2215-0161 (https://doi.org/10.1016/j.mex.2022.101650)
Preview |
Text.
Filename: Hajikhani_etal_MethodsX_2022_Connecting_firms_web_scraped_textual_content_to_body_of_science.pdf
Final Published Version License: Download (571kB)| Preview |
Abstract
This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure. The method contains three main steps: data source identification, raw data retrieval, and data preparation and transformation. These steps are applied to two distinct data sources.
ORCID iDs
Hajikhani, Arash, Pukelis, Lukas, Suominen, Arho, Ashouri, Sajad, Schubert, Torben, Notten, Ad and Cunningham, Scott W. ORCID: https://orcid.org/0000-0001-7140-916X;-
-
Item type: Article ID code: 79890 Dates: DateEvent10 March 2022Published10 March 2022Published Online22 February 2022AcceptedNotes: Accepted manuscript available online 27 February 2022, Version of Record 10 March 2022. Subjects: Political Science > Political science (General)
Science > Mathematics > Computer softwareDepartment: Faculty of Humanities and Social Sciences (HaSS) > Government and Public Policy > Politics Depositing user: Pure Administrator Date deposited: 15 Mar 2022 14:34 Last modified: 11 Nov 2024 13:25 URI: https://strathprints.strath.ac.uk/id/eprint/79890