Structure, Parse, and Analyze Messages

Parse a collection of messages, then use pandas to do post-hoc analysis on the structured results.

In this example, we will structure a number of patient messages that describe side effects experienced after taking a medication. In a real-world setting, these messages may originate from chat conversations or emails between patients and the clinician, or from the clinician’s clinical notes for each patient. The patient messages seen on this page were created by us for the purposes of this example.

Step 1: Import Packages

First, import all of the packages you will need to successfully run the code. These include:

  • the ScienceIO library
  • pandas (under the pd alias)
  • tdqm (to create a progress bar for loops)
# Import all required packages 
from scienceio import ScienceIO
import pandas as pd 
from tqdm.notebook import tqdm

Step 2: Load Messages

Next, take a list of dictionaries with 10 messages sent by patients describing the side effects of the medication they took, load them into a pandas DataFrame, and examine the first few rows.

# Input patient messages
messages = [{"message_id":1,
             "user_id": 25,
            "message_text":"I had severe headache, nausea, bodyache, decreased appetite from taking Cetirizine"},
           {"message_id":2,
             "user_id": 26,
            "message_text":"I had severe diarrhea, insomnia, dry mouth, sexual dysfunction from taking Clozapine"},
            {"message_id":3,
             "user_id": 29,
            "message_text":"I had severe constipation, drowsiness, nausea, decreased appetite from taking Doxorubicin"},
            {"message_id":4,
             "user_id": 31,
            "message_text":"I had increased sweating, insomnia, bodyache, sexual dysfunction from taking Cetirizine"},
            {"message_id":5,
             "user_id": 25,
            "message_text":"I had severe headache, dizziness, dry mouth, decreased appetite from taking Clozapine"},
            {"message_id":6,
             "user_id": 12,
            "message_text":"I had severe insomnia, sexual dysfunction, dry mouth, dizziness from taking Cetirizine"},
            {"message_id":7,
             "user_id": 123,
            "message_text":"I had severe nausea, constipation, insomnia, decreased appetite from taking Doxorubicin"},
            {"message_id":8,
             "user_id": 23,
            "message_text":"I had headache, nausea, bodyache, dry mouth from taking Doxorubicin"},
            {"message_id":9,
             "user_id": 122,
            "message_text":"I had diarrhea, insomnia, dry mouth, dizziness from taking Clozapine"},
            {"message_id":10,
             "user_id": 5,
            "message_text":"I had symptoms of insomnia, nausea, bodyache, increased sweating from taking Clozapine"}
           ]
# Load to df
messages_df = pd.DataFrame(messages)
messages_df.head()

The resulting table includes the message_id, user_id, and message_text and looks like this:

Step 3: Structure the Messages

Now we can structure the messages by calling the structure endpoint. To do this, use the following code to loop through the messages and perform the following tasks on each one:

  • Structure all healthcare concepts found by calling the endpoint
  • Add the structured results into a pandas DataFrame
  • Add columns containing the message_id and user_id to the DataFrame
  • Create a concatenated DataFrame containing all healthcare concepts that were identified

You may be asked for your API keys during this process.

# Call the ScienceIO API
scio = ScienceIO()
# Create a df that contains structured results from the ScienceIO API 
structured_results = pd.DataFrame()
for _,message in tqdm(messages_df.iterrows()):
    # Structure each message separately using the ScienceIO API 
    results = scio.structure(message["message_text"])
    if results: 
        temp_df = pd.DataFrame(results["spans"])
        temp_df["message_id"] = message["message_id"]
        temp_df["user_id"] = message["user_id"]
        structured_results = pd.concat([structured_results,temp_df])
#Inspect the df
structured_results.head()

The resulting table includes both the message_id and user_id columns as well as the columns returned by the API. Note that instead of message_text, we have text to denote each healthcare concept found within the original message_text.

Next Steps

Want to do more? Try filtering these messages by concept type.

Questions?

If you need additional help, we’re standing by ready to assist! Contact support.