Document Type : Regular Article
Authors
1 R&D, Contata Solutions, LLC, Minneapolis, Minnesota, USA
2 Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208, USA
3 Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA
Abstract
Highlights
Keywords
Main Subjects
A key responsibility of doctoral students, researchers and university faculty members is to actively write proposals to secure sponsored research grants. In targeting potential funding sources, research proposal writers often look to recent funding levels and patterns over time for a given topic of research. In fact, it is often customary to work backwards from the budget through the abstract while working on final proposal efforts. Thus, it would be advantageous to estimate the appropriate funding level for a given research topic submitted to a funding agency based on the historical trends and patterns in research funding levels. This could help researchers roughly determine the level of funding they can expect for their research topic of interest and appropriately adjust the scope of their research plan. From the agency’s perspective, this can be beneficial in planning budget allocation levels distributed across its research portfolio. Fortunately, Funding agencies such as the U.S. National Science Foundation (NSF), U.S. National Institutes of Health (NIH), and the Transportation Research Board (TRB) of The National Academics make their online grant databases publicly available which document a variety of information on grants that have been funded over the past few decades. In this study, we present a novel framework to explore the feasibility of automatically estimating the appropriate funding level, given the textual description of a transportation research project, based on a quantitative analysis of the TRB’s Research In Progress (RIP) online database. We investigate this approach in an attempt to aid the scientific and transportation research community to quickly gauge the estimated funding level for a research topic they are interested in exploring.
The TRB’s Research in Progress (RIP) database is a public online repository that contains information on more than 14,000 (as of January 2017) current or recently completed transportation projects funded mostly by U.S. Department of Transportation (DOT), State DOTs, and U.S. DOT funded university transportation research centers [1,2]. Users of the RIP website can search the entire RIP database by various fields (keywords, title, etc.), browse subject records by subject category (administration and management, aviation, bridges and structures, construction, data and information technology, etc.), download the records, etc. These records contain a lot of useful information from which knowledge can be extracted using appropriate Data Science (DS) and statistical Text Mining (TM) techniques.
In the reported literature, text mining features have been used for estimating various types of important metrics such as budget, age, cost, etc. Foster et al. [3] used text mining features to estimate the price of a real estate property. Their study focused on building a regression model to predict the price of real estate from its listing, viz., the property description text was used to obtain information about its sale value. Similarly, ‘authorship profiling’ from anonymous text based on the application of Machine Learning (ML) to text categorization is a growing field of importance owing to its forensics and security applications [4]. In the authorship profiling problem, profile dimensions such as author gender, age, native language, and personality are extracted from a given text of unknown authorship using style-based TM features [5,6]. For instance, Nguyen et al. [7] developed a linear regression model to predict author’s age from unknown text. Text mining features have also been useful in predicting movie revenues based on the movie reviews [8]. However, to the best of our knowledge, there is no existing study in the reported literature focusing on developing a text mining based approach to predict the estimated total budget for scientific research projects, especially transportation research grants.
The dataset used in this study comes from the TRB’s RIP database, a public online repository that contains information on more than 14,000 current or recently completed transportation projects funded mostly by government funding organizations [2]. In addition to funded transportation projects in the US, the RIP database also contains curated records from the International Transportation Research Documentation Database and the Canadian Surface Transportation Research Database. As of March 1, 2017, the TRB RIP database consists of 14,184 research project records. However, not all project entries contain abstracts. For the current work, we used only those project records that had an abstract. Further, only those abstracts with a word count of at least 20 words were included. In this case, we performed a space delimited tokenization of the abstracts to obtain the word count within an abstract. In addition to text based filtering, we also performed a funding amount based filtering. During the data cleaning process discussed elsewhere, some discrepancies related to the funding amount were identified for some project records. Therefore, we restrict our analysis to projects whose funding amount varies from USD 10,000 to 5,000,000. Filtering out projects which do not meet the above-mentioned criteria reduced the records to 10,255 project descriptions (abstracts) with their corresponding funding amounts. Before proceeding with our research approach, we first study the interaction between the various source agencies and the subject areas from 2012-2016 to better understand the funding invested per subject area and the number of projects funded per subject area by the source agencies.
In this section, we discuss the results of our experiments conducted to study agency and subject interactions from 2012 to 2016. The interactions are studied in terms of: (1) the amount of funding invested per subject area, and (2) the number of projects funded per subject area by the source agencies. In the interest of readability of the manuscript, we choose to present our results in this manuscript only for the interactions of type (1).
Figure 1 and Figure 2 show the how the top-10 funding agencies in the years 2012 and 2016 have invested across the 37 subject areas. Although similar charts were generated for years 2013, 2014, and 2015, they are not included for the sake of brevity. In Figure 1 and Figure 2, we present the interaction matrix for only the top-10 funding agencies (ranked by the total funding allocated). We have not included year 2017 in this analysis because of incomplete data for this current year. Agencies in the figure are automatically abbreviated using the ‘abbreviate’ package in R.
Fig. 1. A heatmap representation of the interactions between top-10 source agencies and subject areas in 2012
Fig. 2. A heatmap representation of the interactions between top-10 source agencies and subject areas in 2016
Each cell in the interaction matrix denotes the total funding amount for that year between the agency and the subject area. Darker shades represent higher funding allocations. We study these interactions from year 2012 to 2016 to observe any striking patterns from this analysis. In the current analysis, we leverage an unsupervised co-clustering algorithm to cluster the rows and columns into groups for organized information visualization. We used the d3heatmap package of R for producing the heatmaps and performing the co-clustering. It is a D3.js based heatmap htmlwidget for R. The co-clustering was done using the hierarchical clustering algorithm. The clustering (grouping of various cells) is shown along the right and top sides of the map.
Table 1. Auto-generated abbreviations for source agencies in the RIP database [1]
S.no |
Abbreviation |
Full name |
S.no |
Abbreviation |
Full name |
1 |
FDoT |
Florida Dept. of Transportation |
13 |
ODoT |
Ohio Department of Transportation |
2 |
MsDoT |
Mississippi Dept. of Transportation |
14 |
RaITA |
Research and Innovative Technology Administration |
3 |
MnDoT |
Minnesota Dept. of Transportation |
15 |
FdAA |
Federal Aviation Administration |
4 |
MUTC |
Mid-Atlantic Universities Transportation Center |
16 |
IlDoT |
Illinois Department of Transportation |
5 |
NHTSA |
National Highway Traffic Safety Administration |
17 |
TDoT |
Texas Department of Transportation |
6 |
UDoTRaITA |
U.S. Department of Transportation Research and Innovative |
18 |
MDoT |
Michigan Department of Transportation |
7 |
FdHA |
Federal Highway Administration |
19 |
IwDoT |
Iowa Department of Transportation |
8 |
AAoSH&TO |
American Association of State Highway & Transportation official |
20 |
NYSDoT |
New York State Department of Transportation |
9 |
NCHRP |
National Cooperative Highway Research Program |
21 |
CDoT |
California Department of Transportation |
10 |
WDoT |
Wisconsin Department of Transportation |
22 |
ADoT |
Arizona Department of Transportation |
11 |
IDoT |
Illinois Department of Transportation |
23 |
GDoT |
Georgia Department of Transportation |
12 |
SCDoT |
South Carolina Department of Transportation |
24 |
NCDoT |
North Carolina Department of Transportation |
As shown in figure 13, the Federal Aviation Administration or FdAA invested the maximum amount in the Aviation subject area in year 2012 and its other dominant allocations of funding have been in the areas of Vehicles and Equipment, Data and Information Technology followed by Safety and Human factors and Traffic Management. From this interaction matrix, we find that federal agencies (such as FdAA, RaiTA, and FDHA) have invested commonly in the four areas mentioned towards the right of the axis. On the other hand, state agencies such as FDoT, MsDoT and MnDoT have a clustering in the areas shows towards the left in the figure (Pavements, Bridges, Maintenance, Construction and design). And “Highways” is a common area where both federal and state agencies have invested significantly than other areas.
Between 2012 and 2016, the funding patterns seem to have changed significantly. In year 2016 (Figure 2), 9 of the top-10 funding agencies are state DoTs. Texas DoT makes the highest investment in Highways research projects in 2016. While most state Dots concentrated their funds in the areas of Highways and Safety & Human Factors, FdHA being a federal agency spread its funds across various areas. One observation from this preliminary analysis of interactions between source agencies and subject areas is that funding allocations tend to vary with subject areas over the years depending on geo-political factors (ex., passing of highway bill) and policies. Between 2012 and 2016, Highways is one subject area that has received reasonably consistent level of funding from various state and federal agencies.
In this section, we discuss the proposed research approach as shown in Figure 3. The proposed approach consists of 3 main steps: (1) Text to vector conversion module, (2) Training supervised regression models, and (3) Predicting funding amount using trained regression models. These steps are described in the following sections.
Fig. 3. A schematic of the proposed research approach.
The abstracts in the TRB’s RIP database summarize the project information in the form of unstructured text. We assume that the text in the project abstract captures the essential ideas of the project. However, for using this unstructured text for training a regression model and for estimating the funding level or total budget for a new project, the text documents need to be converted to numerical features describing the document.
We use various vector space models to convert a text document to a variety of vector representations. In vector space models, a document is represented through a fixed dimensional vector where each dimension of the vector represents a feature of the documents [9,10]. In this work, we explore the use of latent vector space models. Latent vector space models aim to identify latent features of the document instead of the generic word count (as in Bag-of-Words model) or weighted word frequency (as in Term Frequency-Inverse Document Frequency or TF-IDF). For a very large corpus, the Bag of Word and the TF-IDF models produce very high-dimensional vector representations for documents. High dimensional data is not appropriate for a regression task. Therefore, we leverage latent vector space models which reduce very-high dimensional document vectors to significantly reduced dimensions. The latent vector space models used in this study are described below:
Given a document text D from corpus C, we represent it with k-dimensional vector obtained from LDA modeling. In this case k << |C|, where |C| is the number of unique text tokens in the corpus.
Given a document text D from corpus C, we represent it with a l-dimensional vector obtained from LSI modeling. In this case l << |C|.
The approach for learning paragraph vectors (document vectors) is inspired by the methods for learning the word vectors. The inspiration is that the word vectors are asked to contribute to a prediction task about the next word in the sentence. So, even though the word vectors are initialized randomly, they can eventually capture semantics as an indirect result of the prediction task.
In Doc2Vector framework, every document text (paragraph) is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W. The document vector and word vectors are averaged or concatenated to predict the next word in a context.
Paragraph vectors also address some of the key weaknesses of bag-of-words models. First, they inherit an important property of the word vectors: the semantics of the words. In this space, “powerful”, for instance, is closer to “strong” than to “Paris.” The second advantage of the paragraph vectors is that they take into consideration the word order, at least in a small context, in the same way that an n-gram model with a large n would do.
Given a document text D from corpus C, we represent it with a m-dimensional vector obtained from Doc2Vec modeling. In this case m << |C|.
As shown in Figure 3, the vector representations of the documents are used as features in a regression model. A regression model learns the relationship between one or more dependent variable (denoted by Y or Y1, Y2, Y3, etc.) and a series of other changing variables (known as independent variables). In the proposed approach, the budget or the funds allotted to a transportation research project is the target variable or dependent variable. Various regression models are used to learn relationship between projects’ total budget and the associated text description (represented in vector form as described in the previous step).
Predictive modeling using regression analysis is accomplished in two phases: training and testing. In the training phase, we use a set of transportation research project records and their corresponding budget to train the model about the relationship between the text features and the budget. In the testing phase, given a research project description represented by text features, its budget is estimated as the prediction of the regression model.
The experiments are carried out using a combination of Python, R and WEKA [14] frameworks. Wikato Environment for Knowledge Analysis (WEKA) is a suite of open-source ML software developed at the University of Waikato, New Zealand and is written in Java.
For all the experiments, we use a 10-fold cross-validation approach to validate the performance of the proposed approach. In k-fold cross-validation, the entire dataset is randomly partitioned into k equal size subsamples, where a single subsample is used as the validation data for testing the model and the remaining k-1 subsamples for training the model. This process is repeated k times while ensuring that each of the k subsamples is used exactly once as validation data. k-fold Cross validation ensures the separation of training and testing sets and removes bias of splitting data into training and test sets.
In this work, we used 3 popular regressions models, namely, Random Forest (RF) [15], Decision Stump (DS) [16] and Linear Regression (LR) [17]. The default parameter settings available in the WEKA machine learning suite were employed for each of these regression models.
Quantitative assessments of the degree to how close the models could predict the actual outputs are used to provide an evaluation of the models’ predictive performances. A multi-criteria assessment with two different goodness-of-fit statistics was performed using all the data vectors to test the accuracy of the trained models. The criteria that are employed for evaluation of models’ predictive performances are the Mean Absolute Error (MAE) and Root-Mean-Squared Error (RMSE) between the actual and predicted values.
We use two metrics to report and compare the performance of various regression models in the next section:
In this section, we discuss the results of performance evaluation of the three text mining feature generations approaches. We compare their performances in the following ways:
(a) |
(b) |
(c) |
(d) |
(e) |
(f) |
Fig. 4. Comparing various text mining feature generation models by varying the feature length.
Fig. 5. Box plot showing MAE for Doc2Vec Features for all Dimensions
For LDA features, we find that DS and LR consistently give higher error compared to RF (Figure 6). We observed an opposite phenomenon with the Doc2Vec features (discussed above).
Fig. 6. Box plot showing MAE for LDA Features for all Dimensions
For LSI, as shown in Figure 7, the variance of error is large for all the regression models. However, the errors are lower for the LR model compared to RF and DS regression.
Fig. 7. Box plot showing MAE for LSI Features for all Dimensions
Table 2. Table comparing the best performances of Doc2Vec, LDA and LSI.
|
Doc2Vec |
LDA |
LSI |
|
|
||||
Value |
Dim |
Model |
Value |
Dim |
Model |
Value |
Dim |
Model |
|
Min MAE |
252,110 |
300 |
DS |
237,884 |
300 |
RF |
246,500 |
100 |
LR |
Min RMSE |
471,199 |
300 |
LR |
461,678 |
300 |
RF |
455,152 |
300 |
LR |
Table 3. Performance evaluation of Doc2Vec using pre-trained models with transfer learning.
Wikipedia |
AP-news |
|||||
RF |
DS |
LR |
RF |
DS |
LR |
|
MAE |
267,234 |
256,765 |
255,303 |
269,009 |
254,996 |
254,642 |
RMSE |
471,915 |
483,100 |
468,097 |
472,428 |
482,271 |
468,895 |
Table 4. Performance of LSI-based budget estimation models using filtered dataset
LSI |
|||
Model |
Correlation (R) |
MAE |
RMSE |
RF |
0.5831 |
338,159 |
626,732 |
LR |
0.635 |
382,606 |
691,207 |
DS |
0.5033 |
388,325 |
660,978 |
We proposed a fully automated machine learning approach to predict the approximate budget for a research project, given a short description of the project. For this, we investigated latent feature generation models such as Doc2Vec, LDA and LSI to convert project text description to numeric features which are later used in machine learning regression models to predict project budget. We tested the proposed approach TRB database containing approximately 14,000 project entries and their approved budget information, ranging from USD 10,000 to 500,000,000. Finally, the proposed approach was quantitatively validated using various experiments. The following are the significant findings: