Which of the following datasets was created to collect uniform data across the US for inpatient patient stays?

Chapter 1 defined key characteristics of patient registries for evaluating patient outcomes. They include specific and consistent data definitions for collecting data elements in a uniform manner for every patient. As in randomized controlled trials, the case report form (CRF) is the paradigm for the data structure of the registry. A CRF is a formatted listing of data elements that can be presented in paper or electronic formats. Those data elements and data entry options in a CRF are represented in the database schema of the registry by patient-level variables. Defining the registry CRFs and corresponding database schema are the first steps in data collection for a registry. Chapter 4 describes the selection of data elements for a registry.

Two related documents should also be considered part of the database specification: the data dictionary (including data definitions and parameters) and the data validation rules, also known as queries or edit checks. The data dictionary and definitions describe both the data elements and how those data elements are interpreted. The data dictionary contains a detailed description of each variable used by the registry, including the source of the variable, coding information if used, and normal ranges if relevant. For example, the term “current smoker” should be defined as to whether “smoker” refers to tobacco or other substances and whether “current” refers to active or within a recent time period. Several cardiovascular registries, such as the Get With The Guidelines® Coronary Artery Disease1 program define “current smoker” as someone who smoked tobacco within the last year.

Data validation rules refer to the logical checks on data entered into the database against predefined rules for either value ranges (e.g., systolic blood pressure less than 300 mmHg) or logical consistency with respect to other data fields for the same patient; these are described more fully in Section 2.5, “Cleaning Data,” below. While neither registry database structures nor database requirements are standardized, the Clinical Data Interchange Standards Consortium2 is actively working on representative models of data interchange and portability using standardized concepts and formats. Chapter 4 further discusses these models, which are applicable to registries as well as clinical trials.

Data collection procedures need to be carefully considered in planning the operations of a registry. Successful registries depend on a sustainable workflow model that can be integrated into the day-to-day clinical practice of active physicians, nurses, pharmacists, and patients, with minimal disruption. (See Chapter 10.) Programs can benefit tremendously from preliminary input from the health care workers or study coordinators who are likely to be participants.

One method of gathering input from likely participants before the full launch of a registry is pilot testing. Whereas feasibility testing, which is discussed in Chapter 2, Section 2.4, focuses on whether a registry should be implemented, pilot testing focuses on how it should be implemented. Piloting can range from testing a subset of the procedures, CRFs, or data capture systems, to a full launch of the registry at a limited subset of sites with a limited number of patients.

The key to effective pilot testing is to conduct it at a point where the results of the pilot can still be used to modify the registry implementation. Through pilot testing, one can assess comprehension, acceptance, feasibility, and other factors that influence how readily the patient registry processes will fit into patient lifestyles and the normal practices of the health care provider.

For example, some data sources may or may not be available for all patients. Chapter 4, Section 5 discusses pilot testing in more detail.

The data collection procedures for each registry should be clearly defined and described in a detailed manual. The term manual here refers to the reference information in any appropriate form, including hard copy, electronic, or via interactive Web or software-based systems. Although the detail of this manual may vary from registry to registry depending on the intended purpose, the required information generally includes protocols, policies, and procedures; the data collection instrument; and a listing of all the data elements and their full definitions. If the registry has optional fields (i.e., fields that do not have to be completed on every patient), these should be clearly specified.

In addition to patient inclusion and exclusion criteria, the screening process should be specified, as should any documentation to be retained at the site level and any plans for monitoring or auditing of screening practices. If sampling is to be performed, the method or systems used should be explained, and tools should be provided to simplify this process for the sites. The manual should clearly explain how patient identification numbers are created or assigned and how duplicate records should be prevented. Any required training for data collectors should also be described.

If paper CRFs are used, the manual should describe specifically how they are used and which parts of the forms (e.g., two-part or three-part no-carbon-required forms) should be retained, copied, submitted, or archived. If electronic CRFs are used, clear user manuals and instructions should be available. These procedures are an important resource for all personnel involved in the registry (and for external auditors who might be asked to assure the quality of the registry).

The importance of standardizing procedures to ensure that the registry uses uniform and systematic methods for collecting data cannot be overstated. At the same time, some level of customization of data entry methods may be required or permitted to enable the participation of particular sites or subgroups of patients within some practices. As discussed in Chapter 10, if the registry provides payments to sites for participation, then the specific requirements for site payments should be clearly documented, and this information should be provided with the registry documents.

All personnel involved in data collection should be identified, and their job descriptions and respective roles in data collection and processing should be described. Examples of such “roles” include patient, physician, data entry personnel, site coordinator, help desk, data manager, and monitor. The necessary documentation or qualification required for any role should be specified in the registry documentation. As an example, some registries require personnel documentation such as a curriculum vitae, protocol signoff, attestation of intent to follow registry procedures, or confirmation of completion of specified training.

The sources of data for a registry may include new information collected from the patient, new or existing information reported by or derived from the clinician and the medical record, and ancillary stores of patient information, such as laboratory results. Since registries for evaluating patient outcomes should employ uniform and systematic methods of data collection, all data-related procedures—including the permitted sources of data; the data elements and their definitions; and the validity, reliability, or other quality requirements for the data collected from each source—should be predetermined and defined for all collectors of data. As described in Section 3, “Quality Assurance,” below, data quality is dependent on the entire chain of data collection and processing. Therefore, the validity and quality of the registry data as a whole ultimately derive from the least, not the most, rigorous link.

In Chapter 6, data sources are classified as primary or secondary, based on the relationship of the data to the registry purpose and protocol. Primary data sources incorporate data collected for direct purposes of the registry (i.e., primarily for the registry). Secondary data sources consist of data originally collected for purposes other than the registry (e.g., standard medical care, insurance claims processing). The sections below incorporate and expand on these definitions.

Patient-reported data are data specifically collected from the patient for the purposes of the registry rather than interpreted through a clinician or an indirect data source (e.g., laboratory value, pharmacy records). Such data may range from basic demographic information to validated scales of patient-reported outcomes (PROs). From an operational perspective, a wide range of issues should be considered in obtaining data directly from patients. These range from presentation (e.g., font size, language, reading level) to technologies (e.g., paper-and-pencil questionnaires, computer inputs, telephone or voice inputs, or hand-held patient diaries). Mistakes at this level can inadvertently bias patient selection, invalidate certain outcomes, or significantly affect cost. Limiting the access for patient reporting to particular languages or technologies may limit participation. Patients with specific diagnoses may have difficulties with specific technologies (e.g., small font size for visually impaired, paper and pencil for those with rheumatoid arthritis). Other choices, such as providing a PRO instrument in a format or method of delivery that differs from how it was validated (e.g., questionnaire rather than interview), may invalidate the results. For more information on patient-reported outcome development and use, see Chapter 5.

Clinician-reported or -derived data can also be divided into primary and secondary subcategories. As an example, specific clinician rating scales (e.g., the National Institutes of Health Stroke Scale)3 may be required for the registry but not routinely captured in clinical encounters. Some variables might be collected directly by the clinician for the registry or obtained from the medical record. Data elements that the clinician must collect directly (e.g., because of a particular definition or need to assess a specific comorbidity that may or may not be routinely present in the medical record) should be specified. These designations are important because they determine who can collect the data for a particular registry or what changes must be made in the procedures the clinician follows in recording a medical record for a patient in a registry. Furthermore, the types of error that arise in registries (discussed in Section 3, “Quality Assurance”) will differ by the degree of use of primary and secondary sources, as well as other factors. As an example, registries that use medical chart abstracters, as discussed in Section 2.2.7 below, may be subject to more interpretive errors.4

Data abstraction is the process by which a data collector other than the clinician interacting with the patient extracts clinician-reported data. While physical examination findings, such as height and weight, or laboratory findings, such as white blood cell counts, are straightforward, abstraction usually involves varying degrees of judgment and interpretation.

Clarity of description and standardization of definitions are essential to the assurance of data quality and to the prevention of interpretive errors when using data abstraction. Knowledgeable registry personnel should be designated as resources for the data collectors in the field, and processes should be put in place to allow the data collectors in the field continuous access to these designated registry personnel for questions on specific definitions and clinical situations. Registries that span long periods, such as those intended for surveillance, might be well served by a structure that permits the review of definitions on a periodic basis to ensure the timeliness and completeness of data elements and definitions, and to add new data elements and definitions. A new product or procedure introduced after the start of a registry is a common reason for such an update.

Abstracting data from unformatted hard copy (e.g., a hospital chart) is often an arduous and tedious process, especially if free text is involved, and it usually requires a human reader. The reader, whose qualifications may range from a trained “medical record analyst” or other health professional to an untrained research assistant, may need to decipher illegible handwriting, translate obscure abbreviations and acronyms, and understand the clinical content to sufficiently extract the desired information. Registry personnel should develop formal chart abstraction guidelines, documentation of processes and practical definitions of terms, and coding forms for the analysts and reviewers to use.

Generally, the guidelines include instructions to search for particular types of data that will go into the registry (e.g., specific diagnoses or laboratory results). Often the analyst will be asked to code the data, using either standardized codes from a codebook (e.g., the ICD-9 [International Classification of Diseases, 9th Revision] code) corresponding to a text diagnosis in a chart, or codes that may be unique to the registry (e.g., a severity scale of 1 to 5).

All abstraction and coding instructions must be carefully documented and incorporated into a data dictionary for the registry. Because of the “noise” in unstructured, hard-copy documents (e.g., spurious marks or illegible writing) and the lack of precision in natural language, the clinical data abstracted by different abstracters from the same documents may differ. This is a potential source of error in a registry.

To reduce the potential for this source of error, registries should ensure proper training on the registry protocol and procedures, condition(s), data sources, data collection systems, and most importantly, data definitions and their interpretation. While training should be provided for all registry personnel, it is particularly important for nonclinician data abstracters. Training time depends on the nature of the source (charts or CRFs), complexity of the data, and number of data items. A variety of training methods, from live meetings to online meetings to interactive multimedia recordings, have all been used with success.5 Training often includes test abstractions using sample charts. For some purposes, it is best practice to train abstracters using standardized test charts. Such standardized tests can be further used both to obtain data on the inter-rater reliability of the CRFs, definitions, and coding instructions and to determine whether individual abstracters can perform up to a defined minimum standard for the registry. Registries that rely on medical chart abstraction should consider reporting on the performance characteristics associated with abstraction, such as inter-rater reliability.6 Examining and reporting on intra-rater reliability may also be useful. Some key considerations in standardizing medical chart abstractions are—

  • Standardized materials (e.g., definitions, instructions)

  • Standardized training

  • Testing with standardized charts

  • Reporting of inter-rater reliability

An electronic medical record (EMR) is an electronic record of health-related information on an individual that can be created, gathered, managed, and consulted by authorized clinicians and staff within one health care organization. More complete than an EMR, an electronic health record (EHR) is an electronic record of health-related information on an individual that conforms to nationally recognized interoperability standards and that can be created, managed, and consulted by authorized clinicians and staff across more than one health care organization.7 For the purposes of this discussion, we will refer to the more limited capabilities of the EMR.

The EMR (and EHR) will play an increasingly important role as a source of clinical data for registries. The medical community is currently in a transition period in which the primary repository of a patient's medical record is changing from the traditional hard-copy chart to the EMR. The main function of the EMR is to aggregate all clinical electronic data about a patient into one database, in the same way that a hard-copy medical chart aggregates paper records from various personnel and departments responsible for the care of the patient. Depending on the extent of implementation, the EMR may include patient demographics, diagnoses, procedures, progress notes, orders, flow sheets, medications, and allergies. The primary sources of data for the EMR are the health care providers. Data may be entered into the EMR through keyboards or touch screens in medical offices or at the bedside. In addition, the EMR system is usually interfaced with ancillary systems (discussed below), such as laboratory, pharmacy, radiology, and pathology systems. Ancillary systems, which usually have their own databases, export relevant patient data to the EMR system, which imports the data into its database.

Since EMRs include the majority of clinical data available about a patient, they can be a major source of patient information for a registry. What an EMR usually does not include is registry-specific (primary source) data that are collected separately from hard-copy or electronic forms. In the next several years, suitable EMR system interfaces may be able to present data needed by registries in accordance with registry-specified requirements, either within the EMR (which then populates the registry) or in an electronic data capture system (which then populates the EMR). EMRs already serve as secondary data sources in some registries, and this practice will continue to grow as EMRs become more widely used. In these situations, data may be extracted from the EMR, transformed into registry format, and loaded into the registry, where they will reside in the registry database together with registry-specific data imported from other sources. In a sense, this is similar to medical chart abstraction except that it is performed electronically.

Electronic capture differs from manual medical chart abstraction in two key respects. First, the data are “abstracted” once for all records. In this context, abstraction refers to the mapping and other decisionmaking needed to bring the EMR data into the registry database. It does not eliminate the potential for interpretive errors, as described later in this chapter, but it centralizes that process, making the rules clear and easily reviewed. Second, the data are uploaded electronically, eliminating duplicative data entry, potential errors associated with data reentry, and the related cost of this redundant effort.

When the EMR is used as a data source for a registry, a significant problem occurs when the information needed by the registry is stored in the EMR as free text, rather than codified or structured data. Examples of structured data include ICD-9 diagnoses and laboratory results. In contrast, physician progress notes, consultations, radiology reports, et cetera, are usually dictated and transcribed as narrative free text. While data abstraction of free text derived from an EMR can be done by a medical record analyst, with the increasing use of EMRs, automated methods of data abstraction from free text have been developed. Natural language processing (NLP) is the term for this technology. It allows computers to process and extract information from human language. The goal of NLP is to parse free text into meaningful components based on a set of rules and a vocabulary that enable the software to recognize key words, understand grammatical constructions, and resolve word ambiguities. Those components can be extracted and delivered to the registry along with structured data extracted from the EMR, and both can be stored as structured data in the registry database.

An increasing number of NLP software packages are available (e.g., caTIES from the National Cancer Institute,8 i2b2 (Informatics for Integrating Biology and the Bedside),9 and a number of commercial products). However, NLP is still in an early phase of development and cannot yet be used for all-purpose chart abstraction. In general, NLP software operates in specific clinical domains (e.g., radiology, pathology), whose vocabularies have been included in the NLP software's database. Nevertheless, NLP has been used successfully to extract diagnoses and drug names from free text in various clinical settings.

It is anticipated that EMR/EHR use will grow significantly with the incentives provided under the American Recovery and Reinvestment Act of 2009 health information technology provisions. Currently, only a minority of U.S. patients have their data stored in systems that are capable of retrieval at the level of a data element. Furthermore, only a small number of these systems currently store data in structured formats with standardized data definitions for those data elements that are common across different vendors. A significant amount of attention is currently focused on interchange formats between clinical and research systems (e.g., from Health Level Seven [HL-7]10 to Clinical Data Interchange Standards Consortium2 models). Attention is also focused on problems of data syntax and semantics. The adoption of common database structures and open interoperability standards will be critical for future interchange between EHRs and registries. This topic is discussed in depth in Chapter 15.

Some of the clinical data used to populate registries may be derived from repositories other than EMRs. Examples of other data sources include billing systems, laboratory databases, and other registries. Chapter 6 discusses the potential uses of other data sources in more detail.

Once the primary and any secondary data sources for a registry have been identified, the registry team can determine how data will be entered into the registry database. Many techniques and technologies exist for entering or moving data into the registry database, including paper CRFs, direct data entry, facsimile or scanning systems, interactive voice response systems, and electronic CRFs. There are also different models for how quickly those data reach a central repository for cleaning, reviewing, monitoring, or reporting. Each approach has advantages and limitations, and each registry must balance flexibility (the number of options available) with data availability (when the central repository is populated), data validity (whether all methods are equally able to produce clean data), and cost. Appropriate decisions depend on many factors, including the number of data elements, number of sites, location (local preferences that vary by country, language differences, and availability of different technologies), registry duration, followup frequency, and available resources.

With paper CRFs, the clinician enters clinical data on the paper form at the time of the clinical encounter, or other data collectors abstract the data from medical records after the clinical encounter. CRFs may include a wide variety of clinical data on each patient gathered from different sources (e.g., medical chart, laboratory, pharmacy) and from multiple patient encounters. Before the data on formatted paper forms are entered into a computer, the forms should be reviewed for completeness, accuracy, and validity. Paper CRFs can be entered into the database by either direct data entry or computerized data entry via scanning systems.

With direct data entry, a computer keyboard is used to enter data into a database. Key entry has a variable error rate depending on personnel, so an assessment of error rate is usually desirable, particularly when a high volume of data entry is performed. Double data entry is a method of increasing the accuracy of manually entered data by quantifying error rates as discrepancies between two different data entry personnel; data accuracy is improved by having up to two individuals enter the data and a third person review and manage discrepancies. With upfront data validation checks on direct data entry, the likelihood of data entry errors significantly decreases. Therefore, the choice of single versus double data entry should be driven by the requirements of the registry for a particular maximal error rate and the ability of each method to achieve that rate in key measures in the particular circumstance. Double data entry, while a standard of practice for registrational trials, may add significant cost. Its use should be guided by the need to reduce an error rate in key measures and the likelihood of accomplishing that by double data entry as opposed to other approaches. In some situations, assessing the data entry error rates by re-entering a sample of the data is sufficient for reporting purposes.

With hard-copy structured forms, entering data using a scanner and special software to extract the data from the scanned image is possible. If data are recorded on a form as marks in checkboxes, the scanning software enables the user to map the location of each checkbox to the value of a variable represented by the text item associated with the checkbox, and to determine whether the box is marked. The presence of a mark in a box is converted by the software to its corresponding value, which can then be transmitted to a database for storage. If the form contains hand-printed or typed text or numbers, optical character recognition software is often effective in extracting the printed data from the scanned image. However, the print font must be of high quality to avoid translation errors, and spurious marks on the page can cause errors. Error checking is based on automated parameters specified by the operator of the system for exception handling. The comments on assessing error rates in the section above are applicable for scanning systems as well.

An electronic CRF (eCRF) is defined as an auditable electronic form designed to record information required by the clinical trial protocol to be reported to the sponsor on each trial subject.11 An eCRF allows clinician-reported data to be entered directly into the electronic system by the data collector (the clinician or other data collector). Site personnel in many registries still commonly complete an intermediate hard-copy worksheet representing the CRF and subsequently enter the data into the eCRF. While this approach increases work effort and error rates, it is still in use because it is not yet practical for all electronic data entry to be performed at the bedside, during the clinical encounter, or in the midst of a busy clinical day.

An eCRF may originate on local systems (including those on an individual computer, a local area network server, or a hand-held device) or directly from a central database server via an Internet-based connection or a private network. For registries that exist beyond a single site, the data from the local system must subsequently communicate with a central data system. An eCRF may be presented visually (e.g., computer screen) or aurally (e.g., telephonic data entry, such as interactive voice response systems). Specific circumstances will favor different presentations. For example, in one clozapine patient registry that is otherwise similar to Case Example 24, both pharmacists and physicians can obtain and enter data via a telephone-based interactive voice response system as well as a Web-based system. The option is successful in this scenario because telephone access is ubiquitous in pharmacies and the eCRF is very brief.

A common method of electronic data entry is to use Web-based data entry forms. Such forms may be used by patients, providers, and interviewers to enter data into a local repository. The forms reside on servers, which may be located at the site of the registry or co-located anywhere on the Internet. To access a data entry form, a user on a remote computer with an Internet connection opens a browser window and enters the address of the Web server. Typically, a login screen is displayed and the user enters a user identification and password, provided by personnel responsible for the Web site or repository. Once the server authenticates the user, the data entry form is displayed, and the user can begin entering data. As described in “Cleaning Data” (Section 2.5), many electronic systems can perform data validation checks or edits at the time of data entry. When data entry is complete, the user submits the form, which is sent over the Internet to the Web server.

Smart phones or other mobile devices may also be used to submit data to a server to the extent such transmissions can be done with appropriate information security controls. Mobility has recently become an important attribute for clinical data collection. Software has been developed that enables wireless devices to collect data and transmit them over the Internet to database servers in fixed locations. As wireless technology continues to evolve and data transmission rates increase, these will become more essential data entry devices for patients and clinicians.

When the medical record or ancillary data are in electronic format, they may be abstracted to the CRF by a data collector or, in some cases, uploaded electronically to the registry database. The ease of extracting data from electronic systems for use in a registry depends on the design of the interfaces of ancillary and registry systems, and the ability of the EMR or ancillary system software to make the requested data accessible. However, as system vendors increasingly adopt open standards for interoperability, transferring data from one system to another will likely become easier. Many organizations are actively working toward improved standards, including HL7,10 the National eHealth Collaborative,12 the National Institute of Standards and Technology,13 and others. Chapter 15 describes standards and certifications specific to EHR systems.

Electronic interfaces are necessary to move data from one computer to another. If clinical data are entered into a local repository from an eCRF form or entered into an EMR, the data must be extracted from the source dataset in the local repository, transformed into the format required by the registry, and loaded into the registry database for permanent storage. This is called an “extract, transform, and load” process. Unless the local repository is designed to be consistent with the registry database in terms of the names of variables and their values, data mapping and transformation can be a complex task. In some cases, manual transfer of the data may be more efficient and less time-consuming than the effort to develop an electronic interface. Emerging open standards can enable data to be transferred from an EHR directly into the registry. This topic is discussed in more detail in Chapter 15.

If an interface between a local electronic system and registry system is developed, it is still necessary to communicate to the ancillary system the criteria for retrieval and transmission of a patient record. Typically, the ancillary data are maintained in a relational database, and the system needs to run an SQL (Structured Query Language) query against the database to retrieve the specified information. An SQL query may specify individual patients by an identifier (e.g., a medical record number) or by values or ranges of specific variables (e.g., all patients with hemoglobin A1c over 8 mg/dl). The results of the query are usually stored as a file (e.g., XML, CSV, CDISC ODM) that can be transformed and transferred to the registry system across the interface. A variety of interface protocols may be used to transfer the data.

Because data definitions and formats are not yet nationally standardized, transfer of data from an EMR or ancillary system to a registry database is prone to error. Careful evaluation of the transfer specifications for interpretive or mapping errors is a critical step that the registry coordinating center should verify. Furthermore, a series of test transfers and validation procedures should be performed and documented. Finally, error checking must be part of the transfer process because new formats or other errors not in the test databases may be introduced during actual practice, and these need to be identified and isolated from the registry itself. Even though each piece of data may be accurately transferred, the data may have different representations on the different systems (e.g., value discrepancies such as the meaning of “0” vs. “1,” fixed vs. floating point numbers, date format, integer length, and missing values). In summary, any system used to extract EMR records into registry databases should be validated and should include an interval sampling of transfers to ensure that uploading of this information is consistent over time.

The ancillary system must also notify the registry when an error correction occurs in a record already transferred to the registry. Registry software must be able to receive that notification, flag the erroneous value as invalid, and insert the new, corrected value into its database. Finally, it is important to recognize that the use of an electronic-to-electronic interchange requires not only testing but also validation of the integrity and quality of the data transferred. Few ancillary systems or EMR systems are currently validated to a defined standard. For registries that intend to report data to FDA or to other sponsors or data recipients with similar requirements, including electronic signatures, audit trails, and rigorous system validation, the ways in which the registry interacts with these other systems must be carefully considered.

Data cleaning refers to the correction or amelioration of data problems, including missing values, incorrect or out-of-range values, responses that are logically inconsistent with other responses in the database, and duplicate patient records. While all registries strive for “clean data,” in reality, this is a relative term. How and to what level the data will be cleaned should be addressed upfront in a data management manual that identifies the data elements that are intended to be cleaned, describes the data validation rules or logical checks for out-of-range values, explains how missing values and values that are logically inconsistent will be handled, and discusses how duplicate patient records will be identified and managed.

Data managers should develop formal data review guidelines for the reviewers and data entry personnel to use. The guidelines should include information on how to handle missing data; invalid entries (e.g., multiple selections in a single-choice field, alphabetic data in a numeric field); erroneous entries (e.g., patients of the wrong gender answering gender-based questions); and inconsistent data (e.g., an answer to one question contradicting the answer to another one). The guidelines should also include procedures to attempt to remediate these data problems. For example, with a data error on an interview form, it may be necessary to query the interviewer or the patient, or to refer to other data sources that may be able to resolve the problem. Documentation of any data review activity and remediation efforts, including dates, times, and results of the query, should be maintained.

Ideally, automated data checks are preprogrammed into the database for presentation at the time of data entry. These data checks are particularly useful for cleaning data at the site level while the patient or medical record is readily accessible. Even relatively simple edit checks, such as range values for laboratories, can have a significant effect on improving the quality of data. Many systems allow for the implementation of more complex data edit checks, and these checks can substantially reduce the amount of subsequent manual data cleaning. A variation of this method is to use data cleaning rules to deactivate certain data fields so that erroneous entries cannot even be made. A combination of these approaches can also be used. For paper-based entry methods, automated data checks are not available at the time the paper CRF is being completed but can be incorporated when the data are later entered into the database.

Data managers perform manual data checks or queries to review data for unexpected discrepancies. This is the standard approach to cleaning data that are not entered into the database at the site (e.g., for paper CRFs entered via data entry or scanning). By carefully reviewing the data using both data extracts analyzed by algorithms and hand review, data managers identify discrepancies and generate “queries” to send to the sites to resolve. Even eCRF-based data entry with data validation rules may not be fully adequate to ensure data cleaning for certain purposes. Anticipating all potential data discrepancies at the time that the data management manual and edit checks are developed is very difficult. Therefore, even with the use of automated data validation parameters, some manual cleaning is often still performed.

The registry coordinating center should generate, on a periodic basis, query reports that relate to the quality of the data received, based on the data management manual and, for some purposes, additional concurrent review by a data manager. The content of these reports will differ depending on what type of data cleaning is required for the registry purpose and how much automated data cleaning has already been performed. Query reports may include missing data, “out-of-range” data, or data that appear to be inconsistent (e.g., positive pregnancy test for a male patient). They may also identify abnormal trends in data, such as sudden increases or decreases in laboratory tests compared with patient historical averages or clinically established normal ranges. Qualified registry personnel should be responsible for reviewing the abnormal trends with designated site personnel. The most effective approach is for sites to provide one contact representative for purposes of queries or concerns by registry personnel. Depending on the availability of the records and resources at the site to review and respond to queries, resolving all queries can sometimes be a challenge. Creating systematic approaches to maximizing site responsiveness is recommended.

For most registry purposes, tracking of data received (paper CRFs), data entered, data cleaned, and other parameters is an important component of active registry management. By comparing indicators, such as expected to observed rates of patient enrollment, CRF completion, and query rates, the registry coordinating center can identify problems and potentially take corrective action— either at individual sites or across the registry as a whole.

As further described in Chapter 4, the use of standardized coding dictionaries is an increasingly important tool in the ability to aggregate registry data with other databases. As the health information community adopts standards, registries should routinely apply them unless there are specific reasons not to use such standard codes. While such codes should be implemented in the data dictionaries during registry planning, including all codes in the interface is not always possible. Some free text may be entered as a result. When free text data are entered into a registry, recoding these data using standardized dictionaries (e.g., MedDRA, WHODRUG, SNOMED®) may be worthwhile. There is cost associated with recoding, and in general, it should be limited to data elements that will be used in analysis or that need to be combined or reconciled with other datasets, such as when a common safety database is maintained across multiple registries and studies.

When data on a form are entered into a computer for inclusion in a registry, the form itself, as well as a log of the data entered, should be maintained for the regulatory archival period. Data errors may be discovered long after the data have been stored in the registry. The error may have been made by the patient or interviewer on the original form or during the data entry process. Examination of the original form and the data entry log should reveal the source of the error. If the error is on the form, correcting it may require reinterviewing the patient. If the error occurred during data entry, the corrected data should be entered and the registry updated. By then, the erroneous registry data may have been used to generate reports or create cohorts for population studies. Therefore, instead of simply replacing erroneous data with corrected data, the registry system should have the ability to flag data as erroneous without deleting them and to insert the corrected data for subsequent use.

Once data are entered into the registry, the registry must be backed up on a regular basis. There are two basic types of backup, and both types should be considered for use as best practice by the registry coordinating center. The first type is real-time disk backup, which is done by the disk storage hardware used by the registry server. The second is a regular (e.g., daily) backup of the registry to removable media (e.g., tape, CD-ROM, DVD). In the first case, as data are stored on disk in the registry server, they are automatically replicated to two or more physical hard drives. In the simplest example, called “mirroring,” registry data are stored on a primary disk and an exact replica is stored on the mirrored disk. If either disk fails, data continue to be stored on the mirrored disk until the failed disk is replaced. This failure can be completely transparent to the user, who may continue entering and retrieving data from the registry database during the failure. More complex disk backup configurations exist, in which arrays of disks are used to provide protection from single disk failures.

The second type of periodic backup is needed for disaster recovery. Ideally, a daily backup copy of the registry database stored on removable media should be maintained off site. In case of failure of the registry server or disaster that closes the data center, the backup copy can be brought to a functioning server and the registry database restored, with the only potential loss of data being for the interval between the regularly scheduled backups. The lost data can usually be reloaded from local data repositories or re-entered from hard copy. Other advanced and widely available database solutions and disaster recovery techniques may support a “standby” database that can be located at a remote data center. In case of a failure at the primary data center, the standby database can be used, minimizing downtime and preventing data loss.

As with all other registry processes, the extent of change management will depend on the types of data being collected, the source(s) of the data, and the overall timeframe of the registry. There are two major drivers behind the need for change during the conduct of a registry: internally driven change to refine or improve the registry or the quality of data collected, and externally driven change that comes as a result of changes in the environment in which the registry is being conducted.

Internally driven change is generally focused on changes to data elements or data validation parameters that arise from site feedback, queries, and query trends that may point to a question, definition, or CRF field that was poorly designed or missing. If this is the case, the registry can use the information coming back from sites or data managers to add, delete, or modify the database requirements, CRFs, definitions, or data management manual as required. At times, more substantive changes, such as the addition of new forms or changes to the registry workflow, may be desirable to examine new conditions or outcomes. Externally driven change generally arises in multiyear registries as new information about the disease and/or product under study becomes available, or as new therapies or products are introduced into clinical practice. Change and turnover in registry personnel is another type of change, and one that can be highly disruptive if procedures are not standardized and documented.

A more extensive form of change may occur when a registry either significantly changes its CRFs or changes the underlying database. Longstanding registries address this issue from time to time as information regarding the condition or procedure evolves and data collection forms and definitions require updating. Chapter 14 discusses in more detail the process for making significant modifications to a registry.

Proper management of change is crucial to the maintenance of the registry. A consistent approach to change management, including decisionmaking, documentation, data mapping, and validation, is an important aspect of maintaining the quality of the registry and the validity of the data. While the specific change management processes might depend on the type and nature of the registry, change management in registries that are designed to evaluate patient outcomes requires, at the very least, the following structures and processes:

  • Detailed manual of procedures: As described earlier, a detailed manual that is updated on a regular basis—containing all the registry policies, procedures, and protocols, as well as a complete data dictionary listing all the data elements and their definitions—is vital for the functioning of a registry. The manual is also a crucial component for managing and documenting change management in a registry.

  • Governing body: As described in Chapter 2, Section 6, registries require oversight and advisory bodies for a number of purposes. One of the most important is to manage change on a regular basis. Keeping the registry manual and data definitions up to date is one of the primary responsibilities of this governing body. Large prospective registries, such as the National Surgical Quality Improvement Program, have found it necessary to delegate the updating of data elements and definitions to a special definitions committee.

  • Infrastructure for ongoing training: As mentioned above, change in personnel is a common issue for registries. Specific processes and an infrastructure for training should be available at all times to account for any unanticipated changes and turnover of registry personnel or providers who regularly enter data into the registry.

  • Method to communicate change: Since registries frequently undergo change, there should be a standard approach and timeline for communicating to sites when changes will take place.

In addition to instituting these structures, registries should also plan for change from a budget perspective (Chapter 2) and from an analysis perspective (Chapter 13).

As registries increasingly collect data in electronic format, the time between care delivery and data collection is reduced. This shorter timeframe offers significant opportunities to use registry functionalities to improve care delivery at the patient and population levels. These functionalities (Table 11–1) include the generation of outputs that promote care delivery and coordination at the individual patient level (e.g., decision support, patient reports, reminders, notifications, lists for proactive care, educational content) and the provision of tools that assist with population management, quality improvement, and quality reporting (e.g., risk adjustment, population views, benchmarks, quality report transmissions). A number of registries are designed primarily for these purposes. Several large national registries1, 14-16 have shown large changes in performance during the course of hospital or practice participation in the registry. For example, in one head-to-head study that used hospital data from Hospital Compare, an online database created by the Centers for Medicare & Medicaid Services, patients in hospitals enrolled in the American Heart Association's Get With The Guidelines® Coronary Artery Disease registry, which includes evidence-based reminders and real-time performance measurement reports, fared significantly better in measures of guidelines compliance than those in hospitals not enrolled in the registry.17

A performance-linked access system (PLAS), also known as a restricted access or limited distribution system, is another application of a registry to serve more than an observational goal. Unlike a disease and exposure registry, a PLAS is part of a detailed risk-minimization action plan that sponsors develop as a commitment to enhance the risk-benefit balance of a product when approved for the market. The purpose of a PLAS is to mitigate a certain known drug-associated risk by ensuring that product access is linked to a specific performance measure. Examples include systems that monitor laboratory values, such as white blood cell counts during clozapine administration to prevent severe leukopenia, or routine pregnancy testing during thalidomide administration to prevent in utero exposure to this known teratogenic compound. Additional information on PLAS can be found in FDA's Guidance for Industry: Development and Use of Risk Minimization Action Plans.18