Ontology Construction
This step aims to process the NSDUH 2022 data and codebook to extract class structures as defined by domain experts. The extracted data will be used for ontology construction, specifically for mapping entities and relationships within the data.
Class Structure Parsing from PDF into JSON
code_book_class_structure_extraction.py
This script processes the NSDUH-2022-DS0001-info-codebook.pdf file to extract the class and variable mappings defined by domain experts. It reads specific pages from the PDF, processes the text to extract relevant class structures, and generates a JSON file (category.json
) that organizes these structures hierarchically.
The JSON output is used for ontology construction tasks, particularly for entity and relation mapping.
Installation
To install the necessary dependencies, run the following command:
pip install pymupdf pandas
Run script
python code_book_class_structure_extraction.py --pdf_file /path/to/NSDUH-2022-DS0001-info-codebook.pdf
Entity and Relation Extraction Using LLMs
This section explains the process of entity and relation extraction using large language models (LLMs) based on the code provided. The process consists of four scripts, executed in a specific order, to extract entities and relationships from the NSDUH 2022 data codebook, convert them to a structured format, and remove duplicates.
Installation
Ensure you have the necessary Python packages installed for running the scripts. Use the following command to install the required dependencies:
pip install pandas json
1. Entity and Relation Extraction from the Codebook
Entity_Rel_Extract_Codebook.py
Purpose:
This script performs entity and relation extraction from the NSDUH 2022 codebook using a language model. It extracts entities (e.g., variables) and their relations from the text of the codebook.
Input:
- NSDUH-2022-DS0001-info-codebook.pdf: The PDF file containing the codebook for the NSDUH 2022 data.
Output:
- entity_rel.json: A JSON file containing the extracted entities and relationships.
Example Command:
python Entity_Rel_Extract_Codebook.py --pdf_file /path/to/NSDUH-2022-DS0001-info-codebook.pdf --output_file entity_rel.json
2. JSON Comment Removal
Entity_Rel_JSON_Extract.py
Purpose:
This script removes comments from the JSON file generated by the previous step. Some JSON files may contain unnecessary comments that need to be cleaned before further processing.
Input:
- entity_rel.json: The JSON file generated by the previous script, containing the extracted entities and relationships.
Output:
- clean_entity_rel.json: A cleaned JSON file with comments removed.
Example Command:
python Entity_Rel_JSON_Extract.py --input_file entity_rel.json --output_file clean_entity_rel.json
3. Convert JSON to CSV
Entity_Rel_JSON_toCSV.py
Purpose:
This script converts the cleaned JSON file into a CSV format, which is more suitable for further analysis or machine learning tasks.
Input:
- clean_entity_rel.json: The cleaned JSON file from the previous step.
Output:
- entity_rel.csv: A CSV file containing the extracted entities and relationships in a structured format.
Example Command:
python Entity_Rel_JSON_toCSV.py --input_file clean_entity_rel.json --output_file entity_rel.csv
4. Remove Duplicate Entities
Entity_duplicate_drop.py
Purpose:
This script removes any duplicate entities from the CSV file generated in the previous step to ensure that each entity appears only once in the dataset.
Input:
- entity_rel.csv: The CSV file containing the entities and relationships.
Output:
- unique_entity_rel.csv: A cleaned CSV file with duplicate entities removed.
Example Command:
python Entity_duplicate_drop.py --input_file entity_rel.csv --output_file unique_entity_rel.csv
Ontology TTL Generation from Extracted Entities and Relationships
generate_ontology_ttl.py
This script generates an ontology in TTL format based on the extracted entities and relationships from the NSDUH 2022 codebook. It uses the class structure provided in category.json
and the relationships in a CSV file to create RDF triples and serialize them into a TTL file for ontology construction.
Input:
- category.json: A JSON file containing the class structure extracted from the NSDUH 2022 codebook.
- relation.csv: A CSV file containing the relationships between entities.
Output:
- entity_as_instance_with_relation.ttl: A TTL file representing the ontology with entities and their relationships.
Installation
To install the necessary dependencies, run the following command:
pip install pandas rdflib
Run script
python generate_ontology_ttl.py --category_file /path/to/category.json --relation_csv /path/to/relation.csv --output_file /path/to/output.ttl