2014 Dataset

Task 1 Dataset

The input data provided to participants consists of six carefully chosen cases from the CLEFeHealth2013 tasks. Using the first case is mandatory for all participants and the other five cases are optional. Each case includes a discharge summary, including the disorder spans marked and mapped to SNOMED-CT (Systematized Nomenclature of Medicine Clinical Terms, Concept Unique Identifiers), and the shorthand spans marked and mapped to the UMLS (Unified Medical Language System). Each discharge summary is also associated with a profile (e.g., “A forty year old woman, who seeks information about her condition” for the mandatory case) to describe the patient, a narrative to describe her information need (e.g., “description of what type of disease hypothyreoidism is”), a query to address this information need by searching the Internet documents, and the list of the documents that were judged as relevant to the query. Each query consists of a description (e.g., “What is hypothyreoidism”) and title (e.g., “Hypothyreoidism”).


Cases

Case 1 (mandatory)

1. Patient profile: This 55-year old woman with a chronic pancreatitis is worried that her condition is getting worse. She wants to know more about jaundice and her condition

2. De-identified discharge summary

3. Information need: chronic alcoholic induced pancreatitis and jaundice in connection with it

4. Query: is jaundice an indication that the pancreatitis has advanced

a.Title: chronic alcoholic induced pancreatitis and jaundice


Case 2 (optional)

1. Patient profile: A forty year old woman, who seeks information about her condition

2. De-identified discharge summary

3. Information need: description of what type of disease hypothyreoidism is

4. Query: What is hypothyreoidism

a.Title: Hypothyreoidism


Case 3 (optional)

1. Patient profile: This 50-year old female is worried about what is MI, that her father has and is this condition hereditary. She does not want additional trouble on top of her current illness

2. De-identified discharge summary

3. Information need: description of what type of disease hypothyreoidism is

4. Query: MI

a.Title: MI and hereditary


Case 4 (optional)

1. Patient profile: This 87-year old female has had several incidences of abdominal pain with no clear reason. The family now wants to seek information about her bruises and raccoon eyes. Could they be a cause of some blood disease

2. De-identified discharge summary

3. Information need: can bruises and raccoon eyes be symptoms of blood disease

4. Query: bruises and raccoon eyes and blood disease

a.Title: bruises and raccoon eyes and blood disease


Case 5 (optional)

1. Patient profile: A 60-year-old male who knows that helicobacter pylori is causing cancer and now wants to know if his current abdominal pain could be a symptom of cancer

2. De-identified discharge summary

3. Information need: is abdominal pain due to helicobacter pylori a symptom of cancer

4. Query: cancer, helicobacter pylori and abdominal pain

a.Title: abnominal pain and helicobacter pylori and cancer


Case 6 (optional)

1. Patient profile: A 43-year old male with down Syndrome lives in an extended care facility. The personnel wants to know if they can avoid frothy sputum in connection with the patient's chronic aspiration and status post laryngectomy

2. De-identified discharge summary

3. Information need: how to avoid frothy sputum

4. Query: frothy sputum and how to avoid and care for this condition

a.Title: frothy sputum and care


Discharge Summaries

After the participants have completed the registration and data agreement steps, they will receive the set of 6 de-identified discharge summaries, including the disorder spans marked and mapped to SNOMED-CT (Systematized Nomenclature of Medicine Clinical Terms, Concept Unique Identifiers), and the shorthand spans marked and mapped to the UMLS (Unified Medical Language System).


Query Set

The CLEF eHealth 2013 Task 3 data set consisted of a set of 50 real patient queries generated from discharge summaries, a set of in the order of 1 million health-related documents (web pages) that the queries can be searched on, and a list of the documents which were judged to be relevant to each of the queries (named result set). In 2014, we use the aforementioned 6 query cases.

The queries have been manually generated by healthcare professionals from a manually extracted set of highlighted disorders from the discharge summaries. A mapping between each query and the associated matching discharge summary (from which the disorder was taken) is provided.

Queries are distributed for use in an extended TREC style format, where title, description and narrative are as in the classic format and the additional fields are as follows:


1. discharge_summary: matching discharge summary, and

2. profile: details about the patient extracted, or inferred, from the discharge summary (which is required for determining the information which is being sought by the patient).


Document Set

1. The web pages that were judged for relevance for each of the 6 queries are provided in a set of .dat files.

2. Each .dat file contains a collection of web pages and metadata, where the data for one web page is organised as follows:

a. a unique identifier (#UID) for a web page in this document collection,

b. the date of crawl in the form YYYYMM (#DATE),

c. the URL (#URL) to the original web page, and

d. the raw HTML content (#CONTENT) of the web page.


A short example illustrates the structure of a .dat file:


#UID:acidr1783_12_000001

#DATE:201204-06

#URL:http://www.acidreflux-heartburn-gerd.net

#CONTENT:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

.

<body>

.

<h2 class="graytext">Children's Reflux and Infant Reflux</h2>

<p class="tighterleading"><a href="/acidreflux/children.html"><strong>Children and Acid Reflux</strong></a><br /> Children experiencing reflux can exhibit typical symptoms, such as heartburn and regurgitation, or atypical symptoms...</p>

.

.

</body>

</html>

#EOR

#UID:acidr1783_12_000002

#DATE:201204-06

#URL:http://www.acidreflux-heartburn-gerd.net/News/beatheartburn.html

#CONTENT:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

.

.

<li><a href="/acidreflux/nighttimeacidreflux.html">nighttime acid reflux<br /> </a></li>

</ul>

<h3><a href="../heartburn/index.html">Heartburn</a></h3>

<ul class="menulist">

<li><a href="../heartburn/acidheartburn.html">acid heartburn</a></li>

<li><a href="../heartburn/heartburn_remedies.html">heartburn remedies</a></li>

.

.

</html>

#EOR


Result Set

Relevance assessment was performed by medical professionals. Relevance is provided on a 2-point-scale: Non relevant (0); Relevant (1). The relevance assessments are provided in a file in the standard TREC qrel format. Extract from the provided file is:

qtest1 0 atlas0954_12_001451 0

qtest1 0 atlas0954_12_001673 0

qtest1 0 atlas0954_12_001766 0

qtest1 0 atlas0954_12_002713 0

qtest1 0 atlas0954_12_002762 1

qtest1 0 atlas0954_12_002793 0

qtest1 0 atlas0954_12_002799 0

qtest1 0 clini0836_12_016941 0

qtest1 0 clini0836_12_016942 0

qtest1 0 clini0836_12_044473 0


Here, and of interest in this Task 1, the first column refers to the query number, the third column refers to the document ID, and the fourth column indicates if the document is relevant (1) or not relevant (0) to the query.

 
Obtaining Task 1 Dataset

To participate, you must first register to CLEF2014. The registration page will be opened in November 2013 and we will provide its link here (as well as on the registration page). After we have received your registration, we will email you further guidelines about gaining access to Task 1 data.