Code Sets in the Ed-Fi Data Standard
We live in a world surrounded by conceptual categories and classifications. In K12 education, such classifications are everywhere in school operations and processes, and so they are everywhere in our data too.
- Does the course count for “math”, “social sciences”, or “language arts” credits?
- Is the student in “first grade”, “second grade”, “third grade”, etc.?
- Is a student who doesn’t show up to school “absent”, “absent with medical documentation”, “homebound”, “absent for CTE program” or other?
- Is a school a “elementary”, “middle”, “high”, “credit recovery high” or other, or multiple of these?
Commonly these various lists are referred to as “code sets.” In many technical contexts, they are referred to as “enumerations.” In Ed-Fi, they are referred to as “descriptors.” Fundamentally, these code sets address the real-world problem of how to deal with and classify variance.
Change is coming to how Ed-Fi handles descriptors. We are excited about these changes as we see them creating some new possibilities and resolving some pain points for the community.
If you are in a hurry and just want to know about the changes, skip below to the final section “What’s New”
What follows between here and that final section is a long history of descriptors in Ed-Fi. It is certainly longer than it needs to be, but we hope it helps explain the why behind the upcoming changes. It is also something of a historical piece that perhaps might be useful to other efforts considering data standardization. Or perhaps: it’s just a personal coping mechanism to account for the mountains of time myself and others have spent over the past years thinking about education code sets!
The Importance and Difficulties of Code Sets
Code sets are powerful tools in data, and if you think about it, that’s largely because much of time they are categorizations that capture the outcomes of some process or meaningful classifications of things.
If a student shows up to school late, we could simply give all of the raw facts about the event: at what time attendance was checked, who checked it, what time the student showed up, if there was not a note provided, who verified the note, who signed the note, and so on. But there is an attendance process that does all that work, and simply issues a value “absent with medical documentation.” The code set is the result of a classification process, often a complex one.
I’m sure no one reading this needs to be convinced of the value of these code sets – or descriptors (I’m going to default to the latter Ed-Fi term for this article). We took a slight detour to think about their values as the output of processes in order to shed some light on what makes descriptors difficult in data standardization: variation.
The key goal of any data standard is to provide a blueprint that captures community agreement on how to represent data.
Descriptors challenge that goal.
Let’s take an example of a descriptor that is fairly standardized: grade level. All across K12 education, we see schools use the concept of grade level to organize students and instruction, and the semantics of the values everyone uses are mostly very similar: “Pre-Kindergarten”, “Kindergarten”, “First Grade”, etc. The code values in that list may vary – maybe a school SIS uses numbers “-1”, “0”, “1”, “2”, “3”, etc. – but the meaning of the values is essentially the same (or close enough) between those lists (we would say it is easy to “map” the one list to the other).
However, even with grade levels, there are outliers. For grade level specifically, we typically see outliers at the beginning and end of that list: “Pre-kindergarden1”, “Pre-kindergarden2”, “Transitional kindergarten”, “Grade 13” and so on.
Grade level is a simple example: think about trying to classify attendance, behavioral incidents or academic subjects.
Sometimes code sets differ simply due to legacy decisions that were made in lieu of data standard – as in the grade level example. Many times code sets differ because of fundamental differences in how a school district runs its business of education. For example, differences in attendance codes typically reflect the processes that the district has chosen to track and remediate absences.
When designing a data standard, it is tempting to try to add all possible values to your descriptor lists: be inclusive! That strategy results in madness: unending requests to add new values (“we need the value ‘Pre-transitional kindergarten level B’ please”) resulting in long lists that are hard to understand and questionable value for consumers of the data: they need to be ready for anything to appear in the data they get. That degrades the project and promise of standardization.
It is also possible to force everyone into the same list: “I’m sorry, the Ed-Fi values are ‘X’, ‘Y’, ‘Z’ – you have some very uncommon values, and you have to deal with that” (i.e., you must map them in a way that loses data fidelity, changes your processes, etc.). That could be a blocker, possibly large enough to undermine an organization’s willingness to use the data standard in the first place: what if the school district superintendent is asking specifically about the success of the “Pre-transitional kindergarten level B” program?
Ed-Fi’s standards and tools have responded to this challenge in a number of ways over the history of the Ed-Fi Data Standard, and we are iterating again.
Generations of Thinking on Descriptors
To learn how we arrived at the present, it is useful to look at the past. There have been 2 major generations of thought on descriptors, and we are entering a 3rd generation.
Note that this history isn’t really a history of descriptor implementation. There has been a steady stream of changes and tweaks over time to exactly how descriptors looked and behaved, both in Ed-Fi’s standards and in Ed-Fi’s tools. We will reference some, but not all, of these changes.
Rather, this is a history of the concepts behind descriptors, and a history of how the vision of Ed-Fi standards shaped and defined descriptors.
The First “Visionary” Generation
In the first generation, the concept was largely that – in Ed-Fi specifications – descriptors could and would be standardized across K12. This was a visionary and expansive view of code sets: if we shared code sets, it would simplify our capability to share tools and code as well, and sharing these types of artifacts can be huge accelerators to unblocking and unlocking data use in K12 at scale.
This vision was also sustained by looking around at other standards – those had pre-defined code sets (for the most part at least) and K12 standards like CEDS that were important to the design and structure of Ed-Fi had them as well (in fact, CEDS was the main source of the original default Ed-Fi descriptor values that are published in the standard).
The vision was also heavily influenced by the state reporting use case, where code sets for reporting were set by the SEA. Because many of the code sets were based upon Federal reporting needs, there was often code set commonality between states; for other code sets there were differences due to state-specific laws or policies.
There were various “implementation” phases to this early period, and in the main period descriptors were divided into 2 separate elements: Descriptors and Types.
- Types were the core, shared values of the Ed-Fi Data Standard. These allowed the Ed-Fi standards to say: “there will be 14 and only 14 grade levels in Ed-Fi, and they are …” These values would come pre-loaded into the Ed-Fi ODS platform, and you were not supposed to touch them.
- Descriptors were the values that sat in front of these Types. They played 2 roles: they allowed for the presence of local values, and they allowed the code value to vary from the official code value
So, for example, you might see this structure in the data
Descriptor (i.e. code set for the agency) | Mapped to Ed-Fi Type |
“1” | “First Grade” |
“PK-1” | “Pre-Kindergarten” |
“PK-2” |
The goal was to allow some flexibility while holding onto strongly defined and shared code sets. This generation held the vision of broad code set standardization across K12 as a means of unlocking analytics and interoperability at scale.
The example above also illustrates the problem that when mapping code sets one may lose granularity, such as when PK1 and PK2 are mapped to Pre-Kindergarten. This means that when viewing a student’s grade level in Ed-Fi, one cannot discern where a Pre-Kindergarten student was operationally enrolled in either PK-1 or PK-2. This also means that if only the mapped values are stored, the more granular values cannot be reconstructed.
The Second “Tactical” Generation
The second generation approach was more tactical. This generation confronted the difficulties of the initial “visionary” generation and tried to preserve that vision, and to do so in pursuit of the same goal of scaling and sharing tools for data analytics.
In this generation, field usage of Ed-Fi slowly but surely revealed the difficulties of standardizing code set values across K12.
The thinking was now influenced by more district implementations and broader experiences with student information systems that allowed districts to control their code sets to reflect their local policies and practices.
First, the Descriptor/Type division really didn’t solve the core problem of variation – it just offered some flexibility. If Types were the enumeration sets that “really matter” and what we were using to “really drive” an ecosystem of shared tools and practice, then I want my local values to be in the list of Types…and the original issues appear again:
- Everyone wants their codes as part of the standard, so you get lots of requests for more and more values, Type values. This causes the values that are part of the standard to lose coherence and be difficult to consume.
- If the standard fails to add those values, then agencies might choose to add them anyhow or simply not use the standard, introducing fragmentation in either case
What transpired was a slow irrelevance of Types – since those were not flexible they largely became ignored – and descriptors became strongly localized: use whatever descriptor values were needed. That unblocked field work in practical ways, but it didn’t help advance the goals of standardization. And sharing analytics tools was still difficult.
The budding realization was that it would never be possible to have the community agree sufficiently on the core values for descriptors, even for descriptors that look like they could be standardized, like grade level. The key implementation pivot of this generation was the removal of Types from the standard (in the move from Suite 2 to Suite 3) and the emergence of the concept of “operational context.”
The concept of operational context was essentially that data exchange is shaped by the exchange context. When I transmit data to the state, that is one context. When I move data from my SIS to my local data warehouse, that is another, more local context. Contexts obviously map strongly to governance spheres, and these spheres can be strongly governed by legislation and policy, or they can be ones I simply agree to (I won’t go into data governance and its role in shaping data, for fear of filling up my email inbox with corrections and commentary!).
So, it is impossible for K12 to share a single set of values for any code set, because the composition (including definitions/semantics) of any set is dependent on a particular operational context. There is never a single “Platonic” context for any set of descriptor values, just shared contexts.
Now, this can look like a disturbing realization: if Ed-Fi is designed to standardize data across K12, doesn’t it need to lock down descriptor values so we can jointly benefit by sending and receiving data that we can understand fully? Plus, if you look at other standards, don’t they have specific lists of allowed values? For example OneRoster 1.0 defines strict enumeration values for a user’s “role”: “teacher”, “student”, “parent”, “guardian”, “relative”, “aide”, “administrator”. Shouldn’t Ed-Fi do the same?
The OneRoster example actually helps make the point about operational context. That list of values is one that implementers agree to when they use OneRoster 1.0. Why can’t Ed-Fi simply do the same?
We covered that a little bit ago when we talked about the problem with the first generation, but to understand why this works for OneRoster, you might think of standards as being wide or deep. The more you go wide, the more difficult it is to go deep, and vice versa: the more you go narrow, the deeper you can go. The rostering aspects of a standard like OneRoster 1.0 were focused on a narrow problem: getting lists of users into learning tools. By defining a narrower problem, it is easier to go deep. Or said another way: it is easier to generate community buy-in for a standard if it is narrowly scoped.
Ed-Fi is wide, because one of the core concepts is to unlock data at scale and to bring disparate data across the organization together. The concept of operational context didn’t mean you couldn’t share code sets – you could – but you would do this as an explicit governance choice.
This is why this generation of descriptor implementation introduced namespaces into the data exchange; those namespaces give you a mechanism for reflecting the operational context and the ownership i.e., it provides an indicator of what the context is and who governs it. Descriptors in transit went from looking like this:
“academicSubjectDescriptor”: “English Language Arts”
…to looking like this…
“academicSubjectDescriptor”: “uri://ed-fi.org/AcademicSubjectDescriptor#English Language Arts”,
Note the namespace – in this example it is Ed-Fi’s namespace, but others could publish their own values (such as states who do this today).
Important to this second generation was also not giving up on the vision of a community sharing code sets and therefore removing significant friction to sharing tools for analytics.The hallmark of this second generation was recommendations like this the Descriptor Guidance on TechDocs: the attempt to tease apart where descriptor values could be standard and where that standardization provided less value. I’d characterize those efforts as well-meaning but unsatisfying: there was still something missing from the picture of how to manage code sets.
The Third Generation
We are entering a third generation of thought and action, informed by the prior two generations. It is an exciting time. (If you are the kind of person who has read this article this far, then you might be the kind of person to find this exciting as well).
This third generation of thinking is influenced by having greater clarity about the many operational contexts that naturally exist. We started with the state reporting use case – now known to be an operational context that requires code sets that comply with Federal and state laws and policies. The district’s operations of schools is another operational context, where the code sets reflect the unique details of how the schools are run. We now know that there are many more operational contexts – different classifications that reflect the unique viewpoints of learning management systems, of executive oversight, of chronic absenteeism, or equitable discipline policies, just to name a few.
This third generation of thought fully accepts the concept of operational context: that there are defined data exchange contexts with agreed on descriptor set values. These agreements often go beyond descriptor sets as well; for example, these contexts often also define how identity of data elements are handled (e.g., “when doing state reporting, you must use the state-assigned student IDs and the state-defined course codes”).
But this third generation of thought goes beyond that to think about the mappings of descriptor values (and possibly other elements in the future) from one context to another as data itself.
For example, let’s consider the example of student absence codes. In the school district student information system (SIS), there is a local code for a student absence. The value for a particular student on a particular day might be a value like “Absent with Medical Documentation.” But when reporting to the state, the local value “Absent with Medical Documentation” may not be an allowed state value; instead, the code value sent to the state is simply “Absent.”
School district (Grand Bend) | State | |
Code | ABSENT-MED | ABS |
Definition | Absent with Medical Documentation | Absent for any reason |
Now the SIS system knows about both of these codes and the SIS system knows the mapping. So it is possible for the SIS system to express and exchange that mapping as data.
Data Standard 4.0 (currently in version 4.0-a, an early access release) introduces a new element called DescriptorMapping that captures such mappings. For the API, the exchange listed above might look something like this:
HTTPS POST /ed-fi/descriptorMappings
{
“Value” : “ABSENT-MED”,
“Namespace” : “uri://grandbend.edu/AttendanceEvent”,
“MappedValue” : “ABS”,
“MappedNamespace” : “uri://somestate.edu/AttendanceEvent”,
“MappingScope” : []
}
This exchange is solving a key problem that emerged under the second generation of descriptors: the need to choose 1 and only 1 value for a descriptor, even when a source system has multiple values. Now systems can send a mapping that allows other systems to see the value of a descriptor in other operational contexts, and make it clear – via the namespace – what the context is.
This is a very important development in Ed-Fi’s data exchange capability as it allows multiple operational contexts for data elements to be reflected in the descriptors.
Why might this be useful? Under the previous generation, an LEA interested in populating a local data warehouse to run analyses on attendance would have to choose if the SIS system sends the local value or the state value to a local API. But what if the LEA wanted to generate some reports that needed the local value and some reports that needed the state value?
Using this feature, an LEA will no longer have to choose if it receives the local attendance code or the state attendance code: it can receive data that opens up both use cases. The usage pattern is also clear here: the SIS should send the local code with the original attendance event, and then the state descriptor mapping. In this way, the mapping is always from the specific/local values to the more generalized values.
What’s New
So, what’s new? There is a new technical feature as well as new recommendations, in the form of normative guidance and programmatic changes. These changes were previewed at the Ed-Fi Tech Congress 2022 where we hosted a fantastic community discussion (who knew you could conduct a 75 minute discussion on code sets and yet still have more to say? – see the session notes here).
First, Data Standard 4.0 (currently in version 4.0-a) introduces a new element called DescriptorMapping that allows the capture and publication of descriptor mappings from source systems as data. The entity looks like this (this is pseudo code – you can see an example of a JSON binding in the previous section.
DescriptorMapping
- Value The value being mapped
- Namespace The namespace of the value being mapped
- MappedValue The value to map to
- MappedNamespace The namespace of the value mapped to
- MappingScope The scope of the mapping; i.e. which entities or resources it applies to
This entity will allow a student information system or other system to send as data the mappings it contains, allowing a receiver of that data to understand both the local and other contexts.
For example, now a school district will be able to receive both the local attendance code for a student (e.g., “Student A is ‘Absent with Medical Documentation’”) and the state value via the mapping (“Absent with Medical Documentation” maps to the state value “Absent”).
Second, there is guidance to go along with this, and the Alliance is making some program changes to help push this forward. The guidance is that the normative behavior of a source system that has a mapping of descriptors to other contexts should be to send the data in the most local and specific context, and then send the mapping to the other context as mapping data. Doing so minimizes the potential loss of valuable information by the receiving agency.
The Alliance will roll this guidance into our certification testing. That testing will validate that systems with known mappings can provide them for use cases the community has identified where this is a priority. For example, the SIS certification will begin testing that local descriptors are provided, and that mappings to state values that have been made by district staff in the SIS,are also provided. That opens up huge possibilities for both LEAs and for the ecosystem generally, as those “canonical” state mappings can also be propagated by API to other systems.
It’s been a long road to get this far, and we really appreciate all the input and feedback from the community that got us this far.
As always – and especially with anything that is new – we want your feedback. Please reach out to me via Ed-Fi Slack or email, or stop me in the hallways at the 2022 Ed-Fi Summit.