This case study is based on data from Mobility Technologies Inc. that operates and maintains a traffic information system to provide an Integrated Surveillance and Data Management Infrastructure (ISDMI). This case study presents procedures for calculating these data quality measures in a specific setting: traffic data collection, dissemination and archiving by Mobility Technologies, Inc. in Pittsburgh, Pennsylvania. The Pennsylvania Department of Transportation also collects and archives data in Pittsburgh; however, we will focus on the data collected by Mobility Technologies, Inc. to keep the example uncomplicated. The same principles used for the Mobility Technologies, Inc. data source can be applied to other data sources. Readers should note that some of the information and details in this case study example are accurate and true representations of actual data measures. However, some details and results are hypothetical and have been embellished or simplified for the purposes of the example. The embellishments are for illustration only and are not intended as criticisms of the data quality or suggested requirement for future data quality measures.
Figure B.1 illustrates the data flows involved in traffic data collection, dissemination, and archiving showing details related to the specific context of Pittsburgh traffic data. In this example, there are 3 different types of the data whose quality should be represented in the data quality measures:
For the Pittsburgh case study, we consider a month of data (i.e., December 2002) collected by Mobility Technologies, Inc. as an example. Note that data quality could also be reported for other time scales, such as every hour, week, quarter, or year. For this particular example month, there were 103 unique stations (in which a "station" measures traffic data for a logical grouping of lanes, typically all functionally similar lanes in a direction) configured to report traffic data (i.e., volume, occupancy, speed, and two vehicle classes) at 1-minute intervals. Each 1-minute traffic reading from each station represents one record. The following sections describe specific calculation procedures for the six data quality measures for the above three different types of data.
Figure B.1. Data Flows and Data Consumers in Pittsburgh Case Study
In this example, consider that we wish to compare the accuracy of traffic volume values from an operations-based sensor to a nearby permanent traffic recorder (ATR). One of the many data products available through the data warehouse is hourly traffic volumes; therefore, the reference measurements are also summed to match the exact date and time of the hourly traffic volumes in the data warehouse.
Note that in many cases, it may be difficult to get ground truth or extremely accurate traffic measurements for an extended period of time. In many cases, an acceptable (or the only) substitute is traffic data from another trusted or familiar source. For planning groups, this is most commonly their continuous vehicle counts from ATR stations. If ATR stations are used as a comparison, one should recognize the limitations of such an approach.
For visual reference, a chart is created that compares the daily volume counts from the ATR to the percent difference from the data warehouse data value (Figure B.2). The mean absolute percent error is calculated by averaging together all of the percentage values in Figure B.2. Thus, the MAPE for this comparison was calculated as 4 percent using Equation 1, and the root mean squared error (RMSE) was calculated as 1,280 vehicles using Equation 2.
Figure B.2. Accuracy of Hourly Traffic Volumes in Archive Database
The comparison described above is actually from the Evaluation of the Integrated Surveillance and Data Management Infrastructure (ISDMI) Program in Pittsburgh and Philadelphia, Pennsylvania (prepared by Battelle, September 5, 2002). Their evaluation found that the daily traffic volume counts at a particular location varied from 0.4 to 6 percent from a nearby ATR station. This accuracy level is considered to be in reasonably good agreement. However, the possibility does exist that both the ISDMI and the ATR data could both be under or over-counting true traffic volumes. Only extensive calibration of a reference sensor will yield a "ground truth" measurement that has a high probability of being very accurate and useful for comparisons.
In this example, the ISP operator provides route-based speed and travel time reports on its website and through other media outlets. The route speeds and travel times are updated every 5 minutes. Assume for the sake of example that the ISP operator arranges for reference travel time measurements to be obtained along selected Pittsburgh routes for various times of the day. The ISP travel times are visually compared to the reference travel times using similar charts (see Figure B.3). The mean absolute percent error was calculated using Equation 1, and the root mean squared error was calculated using Equation 2.
Figure B.3. Accuracy of Route Travel Time Values in Traveler Information
In the Pittsburgh example, we calculate data completeness for the three different versions of data: original source data, data warehouse, and traveler information. In this particular example, the data process includes the flagging and eventual purging of invalid data values. Therefore, the completeness statistics will only include valid data values. The potential contribution of invalid data values to the completeness measure can be determined by combining the completeness and validity statistics.
In Pittsburgh, there are 103 on-line stations that should report a data record every minute for the entire day. Thus, we expect to have 4,597,920 valid volume and occupancy records per day (103 total stations 1,440 records per day 31 days in December 2002). The Pittsburgh field computers perform basic validity tests on 1-minute data and remove invalid values; thus, this invalid data is considered missing since it is removed. Table B.1 contains the completeness statistics and data used in the calculations.
|Number of records||Volume, Occupancy, Speed and Vehicle Classification Data|
|Number of records with non-missing values||4,352,000|
|Number of records that require non-missing values||4,597,920
The completeness statistic in Table B.1 indicates that the original source data in Pittsburgh is 95 percent complete, with 5 percent of the data being incomplete (i.e., missing or invalid). Incomplete data can be caused by 1) large amounts of invalid data; or 2) missing data due to communication, hardware, or software failures. Note that the completeness statistics must be viewed in combination with validity statistics to pinpoint the most likely cause of missing data.
Data quality reports should fully specify or disclose information related to the amount of expected data (the denominator of percent complete), especially for the completeness measure. Malfunctioning detectors should not be discounted from expected data counts simply because device owners are aware of their malfunction but have not been able to repair the devices. The practice of listing malfunctioning detectors by considering them "off-line" is not recommended as it obscures the true device failure rate and data quality results.
As shown in Figure B.1, the archived data administrator retrieves the original source data from the ISP operator. Note that in this example, these functions are both done by the Mobility Technologies company. The archive administrator performs several data processing steps in preparation for loading into the data warehouse:
After the archive administrator has performed these processing steps, completeness statistics are computed by counting the valid data values in the data archive. With 5-minute subtotals, the data archive should have 288 records per day for each station. Thus there should be 919,584 records with valid volume and occupancy values (103 total stations 288 records per day 31 days in December 2002). Note that missing or null data values are not counted as valid data values for the purposes of the following completeness statistics. Table B.2 contains the completeness statistics and data used in the calculations.
|Number of records||Volume||Occupancy||Speed|
|Number of records with valid values||867,606||868,898||835,855|
|Number of records that require valid values||919,584
(103 total stations)
(103 total stations)
(103 total stations)
Table B.2 indicates that the completeness of the archive database is still fairly complete. In this example, the completeness of the archived data it slightly less than that of the original source data because it has undergone additional validation criteria before being stored in the data warehouse. Also note that different traffic attributes (e.g., volume, occupancy, speed) have different completeness statistics because several of the validation checks only removed invalid data for a particular attribute. For example, a high speed value (greater than 100 mph, for example) may have been found to be invalid, but the corresponding volume and occupancy values were kept as valid values.
In this example, the ISP operator provides route-based speed and travel time reports as traveler information on their website. The route speeds and travel times are updated every 5 minutes on the website. There are a total of 10 key routes being reported, thus one would expect to have a total of 89,280 reported travel times during the day (10 key routes 288 updates per day, or one update every 5 minutes 31 days in December 2002). If data is not available for a key route, it is the ISP operator's policy to not provide an estimate of travel time.
Assume for this example that the ISP operator has automated a quality control process that monitors the availability of key route travel times on their website at all times throughout the day. For this example, consider that an intermittent communications failure interrupted data transmittals along one of the ten key routes for 12 days. Thus, there were 12 days of travel time updates for one route that were not available (1 route 12 updates per hour 24 hours per day 12 days of downtime). Table B.3 contains the completeness statistics for the traveler information.
Table B.3 indicates that the completeness or availability of the traveler information was relatively high for the key route travel times. In this example, the ISP operator does not estimate missing travel time values, thus the availability may also reflect missing values in original source data. Where ISPs estimate missing data values when original source data are missing, the availability of traveler information is more affected by hardware or software failures associated with ISP operations.
|Number of records||Key Route Travel Times on Public Website|
|Number of records with valid values||85,824|
|Number of records that require valid values||89,280
(10 routes, updated every 5 minutes,
31 days in December 2002)
For the Pittsburgh example, we calculate data validity for the three different datasets: original source data, data warehouse, and traveler information.
In Pittsburgh, the field computers perform some very basic validity checks on the original source data before it is sent to the ISP operator. Then assume that the field computers remove invalid data and replace it with an error code that distinguishes invalid data from missing data. Having different error codes for different data problems helps to diagnose the root cause of missing data.
To calculate validity of the original source data, we simply count the number of 1-minute data values that have been marked as valid values (i.e., those without "invalid" error codes), and then divide by the total number of data values subjected to the validity criteria. Table B.4 contains the validity statistics and data used in the calculations.
|Number of records||Volume||Occupancy||Speed|
|Number of records meeting validity criteria||4,337,983||4,343,876||4,287,885|
|Number of records subjected to validity criteria||4,352,000||4,352,000||4,352,000|
Table B.4 indicates that the validity of the original source data was very high, as less than 2 percent of all data failed the validity checks. The speed data had slightly lower validity-this could have been due to an improperly calibrated sensor that was reporting speeds outside of an acceptance criteria threshold.
The archive administrator uses several other validation rules once the original source data arrives at the data warehouse. In most real-time data processing (as on field computers), validation criteria are kept simple because processing time must be minimized. In a data warehouse environment, there is less time restriction and more advanced validation criteria can be applied.
Note that these additional validation rules are applied to the original source data before it is aggregated into 5-minute periods. In some cases, validation rules may be applied at several different points in the data flow between original source data and the archive database.
Table B.5 contains the validity statistics and data used in the calculations.
|Number of records||Volume||Occupancy||Speed|
|Number of records meeting validity criteria||855,603||853,862||849,510|
|Number of records subjected to validity criteria||870,400||870,400||870,400|
Table B.5 indicates that the validity of the archive database is still quite high, as less than 3 percent of the data failed the additional validity checks. Because of the number of additional validation checks, we can be reasonably assured that there are no major data validity problems with either the original source data or the archive database.
In this example, consider that the ISP operator and the other media outlets do not apply any additional validity criteria to the route travel times beyond what is applied to the original source data. Because no additional criteria are applied, all reported route travel time values are valid (as there are no criteria by which to reject a route travel time as invalid). By using this practice, the ISP operator is assuming that all invalid data is being addressed in an "upstream" data process (i.e., a data process that occurs before route travel times are computed). Table B.6 contains the validity statistics for the ISP route travel times and data used in the calculations.
|Number of records||Route Travel Times|
|Number of records meeting validity criteria||85,824|
|Number of records subjected to validity criteria||85,824|
In measuring the timeliness of the original source data, we examine the data flow between the field computers and the traffic management center. There are four field computers that are expected to supply the traffic management center computer with data messages every minute, where a data message consists of the volume, occupancy and speed values for the previous minute. By examining the timestamps of the data messages, we can calculate the timeliness of this data flow. Note that in this example, the timestamps represent the time the data messages arrived at the traffic management center, not the time the data messages departed the field computers. This data timestamp convention should be confirmed when calculating timeliness, as it could dramatically affect the results.
The ISP operator has decided that data messages received up to 30 seconds later than when they are expected are acceptable. In analyzing the timestamps on the 1-minute data messages, we find that 4,347,648 of the 4,352,000 data messages were received at the traffic management center within 90 seconds of the previous message. Therefore, timeliness is calculated as:
By further analyzing the timestamps, we calculate that the average delay for the 4,352 late messages is 48 seconds. This means that, when a data message was received late, on average it was received 48 seconds later than expected.
Immediately after collection, the original source data is replicated and copied to a staging area in the data warehouse. The data then go through an automated validation and loading process at a scheduled time during off-peak hours. The goal of the ISP operator is to have the previous day of archived data available through the data warehouse by 8 a.m. of the next day. Thus, any data not loaded by 8 a.m. of the following data is considered late and not on-time.
Assume for this example that a problem in the data warehouse software caused the data to be loaded later than expected for two separate days. Assume that, on each day, the problem was diagnosed and the data were loaded by 6 p.m. later that day. For all other days in December 2002 (29 of 31 days), the data were loaded in the data warehouse and were available by 8 a.m. Thus, the timeliness is as follows:
In this example, average delay is 10 hours, which is the average amount of time between when the data was expected and when it actually became available.
In this example, consider that the ISP operator would like to evaluate the timeliness of the updates to the key route travel times on their website. For this example, assume that the ISP operator has a goal of providing condition updates every 5 minutes. Now consider hypothetically that the ISP operators' web servers have a series of crashes and problems that prevents users from accessing the website for a period of 24 hours. Thus, the timeliness is as follows:
In this example, average delay for late data is not calculated because the travel time updates in on the website were not available at all for the entire 24 hours.
The ISP operator has focused their real-time flow data collection on the freeways in the Pittsburgh area (Note that MT collects real time incident and event data on all roads). Therefore, their goal is to monitor the most important freeway routes in the Pittsburgh area with real-time traffic data. They have chosen to focus initial deployments on the most congested parts of the freeway network. Because the ISP operator is using the data primarily for traveler information (and not for traffic management or incident detection), they have installed sensors on the freeway main lanes only, with an average sensor of about 1.5 miles. Therefore, they consider this sample to adequately represent the freeway locations between point detectors for the purposes of traveler information.
Because of their emphasis, the ISP operator only considers flow data on the functional class of freeways. In the Pittsburgh metropolitan planning area, there are a total of 284 centerline-miles of freeway. The ISP operator has installed traffic detectors along 78 freeway centerline-miles. Therefore, the percent of freeway coverage is 78/284 = 27 percent, with an average detector spacing of 1.5 miles.
The archive administrator has also chosen to focus the flow coverage statistics on the freeway network as well. Therefore, the coverage statistics in the archive database are exactly the same as in the original source data. Therefore, the percent of freeway coverage is 78/284 = 27 percent, with an average detector spacing of 1.5 miles.
The traveler information flow data is also focused on the freeway network only. Therefore, the coverage statistics for traveler information as the same as the archive database and the original source data. Therefore, the percent of freeway coverage is 78/284 = 27 percent, with an average detector spacing of 1.5 miles.
In this example, we will describe the accessibility of traffic data using only qualitative terms.
The accessibility of the original source data is described in these qualitative terms:
The accessibility of the archive database will be of interest to archived data users, who wish to retrieve and manipulate data products from the archive. The accessibility is described in qualitative terms as follows:
The accessibility of traveler information will be of interest to travelers, who wish to make more informed travel decisions. The accessibility of the traveler information is as follows:
In most cases, data accessibility may not be as dynamic as the other data quality measures. The most appropriate time(s) to measure data accessibility is after major system interface or design changes. Measuring accessibility or usability at this time will allow system designers to see whether their interface or design changes have improved accessibility to data consumers.
The data quality statistics for the Pittsburgh case study are summarized in Table B.7.
|Data Quality Measures||Original Source Data||Archived Database||Traveler Information|
|Travel times: 7.5%|
||All data: 95%||Volume: 94%|
|Website key route travel times: 96%|
|Key route travel times: 100%|
||Freeways: 27% with 1.5 mile spacing||Freeways: 27% with 1.5 mile spacing||Freeways: 27% with 1.5 mile spacing|
||Accessible in real-time on sensor bias through secure website; historical data accessible to public agency stakeholders through secure website||Accessible to public agency stakeholders through secure website||Accessible through public website by registration; also accessible through other public media outlets (e.g., web, TV, radio)|