This section presents the framework for data quality assessment. The framework is structured as a sequence of steps in calculating the data quality measures and assessing the quality (Figure 3-1). The framework takes into account the facts that there are different types of traffic data and different customers and users. The data quality assessment approach is determined by the type of application and the type or source of traffic data. The framework identifies three main types of traffic data for which to calculate data quality:
The framework also recognizes that traffic data is used for different applications. As such, the needs and quality requirements are different for the different data customers and applications. Table 3-1 shows the range of data consumers, types of data, and possible applications.
|Data Consumers or Users||Types of Data||Applications or Uses|
|Traffic operators (of all stripes)||Original source data Archived source data||Traffic management Incident management|
|Archived data administrators||Original source data||Database administration|
|Archived data users (Planners and others)||Original source data Archived source data, Archived processed data||Analysis Planning Modeling (development and calibration)|
|Traffic data collectors||Original source data Archived source data||Traffic monitoring Equipment calibration Data collection planning|
|Information Service Providers||Original source data (real time)||Dissemination of traveler information|
|Travelers||Traveler information||Pre-trip planning|
Figure 3-1. Structure of Framework
The following sections present descriptions of the various components of the data quality assessment framework shown in Figure 1.
The first step in assessing the quality of data is to determine the type of application or data consumer for which the data is intended. This is important because the type of application or data consumer determines the type of data and thus the methods of calculating the quality measures and the thresholds for evaluating the quality of data. Therefore, each agency measuring data quality will have to know their customers. The following are typical primary data consumers or customers whose perspectives should be represented in calculating data quality measures. The terms data consumers and customers are used interchangeably throughout this document.
As part of the Traffic Data Quality Workshop project, a white paper titled "Defining and Measuring Traffic Data Quality"2 was developed. This paper reviews current data quality measurement practices in traffic data collection and monitoring; introduces data quality approaches and measures from other disciplines; and recommends approaches to define and measure traffic data quality. The six recommended fundamental measures of traffic data quality are defined below:
These six (6) data quality measures constitute reasonable "categories" but the actual definition or calculation of the measures could vary by application or data user. It is acceptable (and even desirable) to have slightly different measure calculation procedures for different application or groups of users, as the original source traffic data will likely undergo numerous transformations or other changes as it goes from field data collection equipment to data/information consumer. Thus, the original source data changes as it is collected, transformed, and disseminated, and consequently the data quality is also likely to change on its way to the end consumer.
The next step is to set the threshold values for the data quality measures of interest. It is expected that there will be different threshold values for the same measure depending on the application or the data consumer. These thresholds should reflect the acceptable quality based on the data user's needs and applications. Depending on the user and application, data quality measures falling outside the thresholds could be unacceptable for intended application or indication that the data ought to be used with caution.
This section presents the methods for calculating the six data quality measures.
Accuracy is defined as "the measure or degree of agreement between a data value or set of values and a source assumed to be correct." Accuracy can be expressed using one of the following three error quantities. Note that in each of these error formulations, the error is the difference between the observed value(s) and the reference (i.e., ground truth) value, and percent error is the error divided by the reference value.
xi = the observed data value
xreference= the reference value
n = the total number of observed data values
xi = the observed data value
xreference= the reference value
n = the total number of observed data values
xi = the observed data value
xreference= the reference value
n = the total number of observed data values
The RMSE can also be expressed as a percentage value (e.g., % RMSE). When so specified, the % RMSE is the RMSE divided by the average of all reference data values.
These different error formulations are all valid measures of accuracy but may reveal slightly different interpretations. The mean absolute percent error (MAPE) and signed error are expressed as percentages; thus, these formulations may be used to compare the relative accuracy of different attributes (e.g., traffic volume count and speed measurement accuracy). Because the signed error does not use absolute error values (as MAPE does), the signed error formulation may reveal whether there is a consistent bias in measurements. The root mean squared error (RMSE) is an error formulation that is commonly available in many statistical software applications.
As its definition indicates, accuracy requires "…a source (of data) assumed to be correct." This correct source of data is typically referred to as ground truth, reference, or baseline measurements. Ground truth data can be collected in several different ways for each traffic data element. In many cases, ground truth data are collected from specialized equipment and reduced in a rigorous manner that minimizes error. For example, consider the case of collecting ground truth data for traffic volume counts from inductance loop detectors. For the ground truth data, one could record video of the same traffic flows measured by the loop detectors, and then have two different human observers count the number of vehicles during the specified test period. If the ground truth vehicle counts from both human observers are within a specified tolerance (e.g., 1% to 3%), one could assume that the average of these two manual counts represents the ground truth vehicle count.
Another common method for establishing ground truth is to perform rigorous and routine calibration of data collection equipment, and then assume that the data from calibrated equipment represents ground truth. For example, one might calibrate an inductance loop detector on a weekly basis, and then use this loop detector data as ground truth to evaluate other non-intrusive detection devices. However, it should be noted that calibration is specific to type and model of the equipment. Comparison across different types (such as microwave radar detectors versus loops, microloops versus loops) can distort results.
The following reports document obtaining ground truth or reference measurements for traffic data:
Accuracy tests should be performed on usable data from working sensors. In addition to the suggested accuracy measures, quick response or qualitative measures are also needed by data consumers such as TMCs to monitor the performance of detectors. These quick response methods could be graphs showing performance of the detector over time which would indicate any systematic data biases and suggest a need for calibration.
Completeness is defined as "the degree to which data values are present in the attributes that require them." Completeness can be expressed using a percentage (see Equation 4). The equation expresses the available number of data values as a percent of the number of total expected data values.
navailable values = the number of records or rows with available values present
ntotal expected = the total number of records or rows expected
The number of data records expected is a function of the application. For example, state DOTs need at least two weeks worth of data in a month to calculate AADT from automatic traffic recorders. The same DOT might require 30 days of data from ATRs for seasonal adjustment factor calculation. However, from a TMC standpoint, while some data losses can be acceptable, a whole day's worth of incomplete data can be problematic.
The percent complete statistic is defined to include all "values present". In this respect, completeness is defined as including both valid and invalid data values (validity is discussed in Section 3.4.3), as long as both types of data values are present in the version of data being evaluated. However, if a particular data process removes invalid data values from a database instead of flagging them as invalid and permanently storing them, then these purged invalid data values would not be included in the completeness statistic because they are not "present".
The quantities in the percent complete equation can be further specified beyond the example shown here. For example, consider that data analysts may wish to know the percent of data that has actually been measured versus the percent of data that has been estimated. In such a case, one could specify two separate completeness measures: percent complete as defined in Equation 4, and a modified percent complete that counts only directly measured data in the numerator. For example, consider that a particular dataset is 80 percent complete, but only 20 percent complete when counting only directly measured data. These statistics would indicate that 60 percent (80 percent complete minus 20 percent measured data) of the expected dataset contains estimated values.
Validity is defined as "the degree to which data values satisfy acceptance requirements of the validation criteria or fall within the respective domain of acceptable values." Validity can be expressed as the percentage of data passing validity criteria (see Equation 5 below).
nvalid = the number of records or rows with values meeting validity criteria
ntotal = the total number of records or rows subjected to validity criteria
Validity criteria (also referred to as business rules or validation checks) are defined in many data management applications and can range from a single simple rule to several levels of complex rules. A simple rule might specify that traffic volume counts cannot exceed a maximum value associated with road capacity (such as 2,600 vehicles per hour per lane) or that traffic speeds can not exceed a reasonable threshold (such as 100 mph). Other validity criteria for traffic data could include the following:
Validity criteria are often based on "expert opinion" and are generally viewed as "rules of thumb," although some validity criteria may be based on established theory (e.g., road capacity) or scientific fact (e.g., cannot record a zero volume and non-zero speed). The specific validity criteria will likely vary from place to place, as each traffic data collector or manager brings experience with certain roadway locations, traffic data collection equipment, or collection hardware and software.
The difference between completeness and validity is best represented in Figure 3.2. As seen in this figure, the pie represents the total amount of data that is expected to be collected (based on data collection plan or data polling rates). The percent complete statistic includes both valid (slice #3) and invalid (slice #2) values, divided by the total expected number of values (entire pie). The percent valid is the valid values (slice #3) divided by the total values checked (slice #2 and #3).
Figure 3-2. Illustration of Completeness and Validity Measures
Timeliness is defined as "the degree to which data values or a set of values are provided at the time required or specified." Timeliness can be expressed as one of these measures (see Equations 6 and 7 below):
non-time = the number of data messages or packets received within acceptable time limits
ntotal = the total number of data messages or packets received
Eqn. 6 applies to both device-to-TMC communications and for TMC-to-end user applications. The percent timely data indicates the number of submissions or reports delivered on time.
nlate = the number of data messages or packets received outside acceptable time limits
tlate = the actual arrival time of a late data message or packet
texpected = the expected arrival time of a late data message or packet
Coverage is defined as "the degree to which data values in a sample accurately represent the whole of that which is to be measured." Coverage can be expressed as the percent of roadways (or the transportation system) represented by traffic data. Separate coverage statistics should be calculated for different functional classes of roadway.
The definition of coverage leaves several quantities open for interpretation. For example, how much of a sample is needed to "…accurately represent the whole…"? Or in the case of traffic detectors that are placed at several points along a roadway, what spacing between detectors is necessary to "…accurately represent the whole…"? In addition, coverage can also vary with time as detectors are taken off-line or new detectors are added. Ultimately these issues of interpretation are left to those who calculate the coverage statistics. However, additional information should be provided with coverage statistics to indicate the total sample size or nominal/average detector spacing during a particular time period.
Data quality reports should include the coverage measure because it helps to interpret the other data quality measures. The percent coverage statistic essential tells analysts what portion of the system is being measured, and could explain fluctuations in other data quality measures. For example, the completeness could drop for a particular month. If the sensor coverage remained constant, then clearly some problem has arisen in the existing sensor system. If the sensor coverage recently increased and the completeness has dropped, a likely cause could be the new sensors that are not providing as complete data as the previous sensors.
Accessibility is defined as "the relative ease with which data can be retrieved and manipulated by data consumers to meet their needs." Of the six recommended data quality measures, accessibility is the only measure that is best described in both qualitative and quantitative terms:
Some analysts may wish to have a "composite data quality score" that represents two or more data quality attributes in a single number. For example, suppose that you want a single data quality score that captures the results calculated for each of the 6 data quality attributes. Calculating a composite score could be accomplished by assigning a grading scale (say 1 to 10) to the range of expected results for each data quality attribute. For example, an 85 percent value of completeness is graded as an 8.5 value and an accuracy value of 6 percent is graded as a 9.2 (according to a grading scale). These two data quality attributes could then be combined into a composite score of 8.85 by averaging (or weighting by importance) the "grades" of 8.5 and 9.2. A composite data quality score in this sense would be most useful in relative comparisons or rankings; the composite score could be difficult to interpret as a dimensionless number and no insight could be gained as to possible causes or solutions.
A composite data quality score that may be useful for performance monitoring applications combines the completeness and the coverage attributes to create a "composite system completeness" or "composite system availability" measure. For example, the coverage measure is typically used to represent the portion of the total road network represented by collected traffic data. The completeness measure represents the amount of available data on this subset of "covered" roads, and the validity measure captures the percent of all available data values that are valid. In calculating the amount of valid and usable data as a sample percentage of the entire road system in an urban area, the composite system completeness is the percent coverage multiplied by the percent complete and percent valid (see Equation 8).
For example, assume that the coverage for freeway operations sensors is 90 percent (i.e., 90 percent of the urban area freeway road mileage is represented by the collected traffic data). Further assume that the data archive from these sensors is 75 percent complete for the year 2003, and the validity is 80 percent. The composite system completeness for 2003 is 54 percent (i.e., 90 percent × 75 percent × 80 percent). This means that the traffic data archive represents 54 percent of the total data that could possibly be collected for the areawide freeway network. This example is further illustrated in Table 3-2.
Each of the six data quality measures can be calculated at many different levels of detail, from a statistic for a single traffic sensor location for a short time period, to a traffic data archive system that spans multiple years. Certainly, several of the data quality measures are most meaningful at certain levels of detail. For example, accessibility is best used at a system level, whereas accuracy may typically be measured at a few locations that represent the system.
|System||Instrumented (expected data)||Data Available||Valid data|
|Data values (sites)||200||180||135||108|
|Individual measures||Coverage||(180/200)*100 = 90%|
|Composite measures||Coverage completeness||(0.90 *0.75)*100= 67.5%|
|Valid completeness||(0.75*0.80)*100= 60%|
|Complete valid coverage||(0.90*0.75*0.80)*100 = 54%|
It should be recognized that different data consumers will want data quality information available at different levels of detail. For example, a maintenance technician will probably require significantly more detail than an information systems manager. The technician needs detailed data quality information to diagnose and solve problems, whereas the manager may wish to track data quality at a system level to assign resources when needed.
Information systems that report data quality should have the capability to do so at different levels to support the different users of data quality information. "Drilldown" capabilities, which are common in many data warehousing tools, support this presentation and analysis of data at aggregate and disaggregate levels. For example, consider a computer interface that shows a single value for percent complete for the entire data collection system for the entire year. By clicking on the single completeness value, users can "drilldown" to the next level of detail to see completeness values by freeway corridor. Clicking on a completeness value for a freeway corridor "drills down" to a single sensor location, then clicking on that sensor location could provide a day-by-day summary for that location. As such, an information system with this "drilldown" capability easily supports a wide variety of data consumers that would like data quality information at different levels of detail.
This section identifies data quality deficiencies by comparing the data quality results to targets. Based on the results of the comparisons, identify and program resources to improve data quality or lower targets due to resource constraints. Table 3-3 shows the structure of the data quality statistics.
|Data Quality Measures||Data Consumer
Original Source Data e.g.,
Archive Database e.g., Archived data users
Traveller Information e.g., travellers
||X (T)||X (T)||X (T)|
|Completeness||X (T)||X (T)||X (T)|
||X (T)||X (T)||X (T)|
|Timeliness||X (T)||X (T)||X (T)|
|Coverage||X (T)||X (T)||X (T)|
|Accessibility||X (T)||X (T)||X (T)|
X – calculated value
(T) – threshold value
To facilitate reporting of data quality it is important to document data quality. The ASTM Committee E17.54 is currently developing metadata standards for archiving ITS-generated data. Once the ASTM standard is approved, it should be used for documenting traffic data quality.
The next step in the data quality assessment framework is to assign data quality responsibilities to data steward(s) who would ensure that data problems get fixed at the root cause and not simply "scrap and rework". Having assigned the responsibility then data quality reporting can be automated. Data stewards could be anybody charged with the responsible for collecting, accessing and retrieving data and reporting such data to users within an organization as well as to users outside the organization. These could be archived database administrators, or heads of traffic collection and monitoring programs. Data stewards would generate and use data quality measures to track system performance and address problems as they occur through either policy, institutional, or technological decisions.
Data quality reporting includes metadata. Several existing standards provide a framework for using metadata to document data quality. For example, FGDC-STD-001-19983is an existing American standard for digital geospatial data. The FGDC standard is used by numerous public agencies and private software companies in the U.S. and does support the reporting of data quality measures. It is noted however, the metadata standards community in the U.S. is beginning to move toward eventual adoption of ISO 19115, an international metadata standard maintained by the International Standards Organization.
The final next step in the data quality assessment framework is to periodically reassess the data consumers, how they use your data, and the quality targets for their applications. The results of the periodic assessment should guide revisions to data collection protocols including data collection equipment selection, calibration, and maintenance as well as review of acceptance targets. This information would also be useful in reviewing cost implications of data quality assessments and the impacts of decisions based on such data.
The following three case studies demonstrate how the data quality measures can be calculated at three different primary groups of data consumers:
|Austin, Texas Case Study||Based on a single day of data collected by the Texas Department of Transportation (TxDOT)|
|Pittsburgh, Pennsylvania Case Study||Based on data from Mobility Technologies Inc|
|Ohio Department of Transportation Case Study||Based on data collected by traditional methods at various locations in Ohio|
The application of the framework is illustrated with these case studies and presented in Appendices A through C. The case studies are intended to only illustrate the application of the methodologies in evaluating traffic data quality. These case studies are not intended to and do not represent a review of the quality of data of the agencies providing the data for this case studies. Note that while most of the data used in these case studies are provided by agencies, some hypothetical data is also used in the illustration.