Predicting Budget from Transportation Research Grant Description: An Exploratory Analysis of Text Mining and Machine Learning Techniques

Singhal, Ayush; Gopalakrishnan, Kasthurirangan; Khaitan, Siddhartha K.

doi:10.22115/scce.2017.49604

Predicting Budget from Transportation Research Grant Description: An Exploratory Analysis of Text Mining and Machine Learning Techniques

Document Type : Regular Article

Authors

Ayush Singhal ¹

Kasthurirangan Gopalakrishnan ²

Siddhartha K. Khaitan ³

¹ R&D, Contata Solutions, LLC, Minneapolis, Minnesota, USA

² Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208, USA

³ Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA

10.22115/scce.2017.49604

Abstract

Funding agencies such as the U.S. National Science Foundation (NSF), U.S. National Institutes of Health (NIH), and the Transportation Research Board (TRB) of The National Academies make their online grant databases publicly available which document a variety of information on grants that have been funded over the past few decades. In this paper, based on a quantitative analysis of the TRB’s Research In Progress (RIP) online database, we explore the feasibility of automatically estimating the appropriate funding level, given the textual description of a transportation research project. We use statistical Text Mining (TM) and Machine Learning (ML) technologies to build this model using the 14,000 or more records of the TRB’s RIP research grants big data. Several Natural Language Processing (NLP) based text representation models such as the Latent Dirichlet Allocation (LDA), Latent Semantic Indexing (LSI) and the Doc2Vec Machine Learning (ML) approach are used to vectorize the project descriptions and generate semantic vectors. Each of these representations is then used to train supervised regression models such as Random Forest (RF) regression. Out of the three latent feature generation models, we found LDA gives the least Mean Absolute Error (MAE) using 300 feature dimensions and RF regression model. However, based on the correlation coefficients, it was found that it is not very feasible to accurately predict the funding level directly from the unstructured project abstract, given the large variations in source agencies, subject areas, and funding levels. By using separate prediction models for different types of funding agencies, funding levels were better correlated with the project abstract.

Highlights

Google Scholar

Keywords

Text mining

Transportation research

Natural Language Processing (NLP)

Big data

Deep learning

Statistical analysis

Soft computing

Subjects

Data Mining

[1] Daly J. TRB Webinar: Learning About and Using the Research in Progress (RiP) Database 2016:14. http://www.trb.org/ElectronicSessions/Blurbs/174599.aspx.

[2] Gopalakrishnan K, Khaitan SK. TEXT MINING TRANSPORTATION RESEARCH GRANT BIG DATA: KNOWLEDGE EXTRACTION AND PREDICTIVE MODELING USING FAST NEURAL NETS. Int J TRAFFIC Transp Eng 2017;7. doi:10.7708/ijtte.2017.7(3).06.

[3] Foster DP, Liberman M, Stine RA. Featurizing Text: Converting Text into Predictors for Regression Analysis. Whart Sch Univ Pennsylvania, Philadelphia, PA 2013.

[4] Argamon S, Koppel M, Pennebaker JW, Schler J. Automatically profiling the author of an anonymous text. Commun ACM 2009;52:119–23.

[5] Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, Agrawal M, et al. Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS One 2013;8:e73791.

[6] Rosenthal S, McKeown K. Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. Proc. 49th Annu. Meet. Assoc. Comput. Linguist. Hum. Lang. Technol. 1, Association for Computational Linguistics; 2011, p. 763–72.

[7] Nguyen D, Smith NA, Rosé CP. Author age prediction from text using linear regression. Proc. 5th ACL-HLT Work. Lang. Technol. Cult. Heritage, Soc. Sci. Humanit., Association for Computational Linguistics; 2011, p. 115–23.

[8] Joshi M, Das D, Gimpel K, Smith NA. Movie reviews and revenues: An experiment in text regression. Hum. Lang. Technol. 2010 Annu. Conf. North Am. Chapter Assoc. Comput. Linguist., Association for Computational Linguistics; 2010, p. 293–6.

[9] Singhal A, Kasturi R, Srivastava J. Automating Document Annotation Using Open Source Knowledge. 2013 IEEE/WIC/ACM Int. Jt. Conf. Web Intell. Intell. Agent Technol., vol. 1, IEEE; 2013, p. 199–204. doi:10.1109/WI-IAT.2013.30.

[10] Singhal A, Srivastava J. Research dataset discovery from research publications using web context. Web Intell 2017;15:81–99. doi:10.3233/WEB-170354.

[11] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res 2003;3:993–1022.

[12] Landauer TK. Latent Semantic Analysis. Encycl. Cogn. Sci., Chichester: John Wiley & Sons, Ltd; 2006. doi:10.1002/0470018860.s00561.

[13] Le Q, Mikolov T. Distributed representations of sentences and documents. Int. Conf. Mach. Learn., 2014, p. 1188–96.

[14] Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2016.

[15] Breiman L. Random Forests. Mach Learn 2001;45:5–32. doi:10.1023/A:1010933404324.

[16] Holte RC. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Mach Learn 1993;11:63–90. doi:10.1023/A:1022631118932.

[17] Lai T., Robbins H, Wei C. Strong consistency of least squares estimates in multiple regression II. J Multivar Anal 1979;9:343–61. doi:10.1016/0047-259X(79)90093-9.