API Reference
This section contains the complete API reference for chemsource.
Main Classes
chemsource: A tool for classifying novel drugs and health-related chemicals by origin.
This package provides functionality to classify chemical compounds using AI models and retrieve chemical information from various sources including PubMed and Wikipedia.
- Classes:
ChemSource: Main class for chemical compound classification and information retrieval.
- Version:
1.1.17
- class chemsource.ChemSource(model_api_key=None, model='gpt-4o', ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer. Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', temperature=0, top_p=1e-07, max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
Bases:
ConfigMain class for chemical compound classification and information retrieval.
chemsource combines configuration management, information retrieval from multiple sources (PubMed, Wikipedia), and AI-powered classification of chemical compounds. It extends the Config class to provide a complete solution for chemical information processing.
- Parameters:
model_api_key (str, optional) – API key for the language model service.
model (str, optional) – Name of the language model to use. Defaults to “gpt-4o”.
ncbi_key (str, optional) – API key for NCBI/PubMed access.
prompt (str, optional) – Custom prompt template. Defaults to BASE_PROMPT.
temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.
top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.0000001.
max_tokens (int, optional) – Maximum number of tokens for model context. Defaults to 250000.
clean_output (bool, optional) – Whether to clean and validate output. Defaults to False.
explanation (bool, optional) – Whether to expect explanations in model responses. Only effective when clean_output=True. Requires a custom prompt that instructs the model to include the explanation_separator. Defaults to False.
explanation_separator (str, optional) – Delimiter separating explanation from classification. Only used when both clean_output and explanation are True. Defaults to “EXPLANATION_COMPLETE”.
allowed_categories (List[str], optional) – List of allowed categories for filtering. Defaults to None.
custom_client (Any, optional) – Custom OpenAI client instance. Defaults to None.
- Raises:
ValueError – If clean_output is True but allowed_categories is None or empty.
TypeError – If allowed_categories is not a list when clean_output is True.
- spell_checker
Spell checker instance for output correction (when clean_output is enabled).
- Type:
SpellChecker
- custom_client
The custom client instance.
- Type:
Any
Example
>>> chem = ChemSource(model_api_key="your_key") >>> info, classification = chem.chemsource("aspirin") >>> print(classification) "MEDICAL"
>>> # Using explanation feature with custom prompt >>> custom_prompt = "Explain your reasoning, then write EXPLANATION_COMPLETE, then provide categories..." >>> chem = ChemSource(model_api_key="your_key", prompt=custom_prompt, ... clean_output=True, explanation=True, ... allowed_categories=["MEDICAL", "FOOD"]) >>> info, classification = chem.chemsource("aspirin")
- Parameters:
output_explanation (
bool)
- __init__(model_api_key=None, model='gpt-4o', ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer. Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', temperature=0, top_p=1e-07, max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
- Parameters:
- chemsource(name, priority='WIKIPEDIA', single_source=False)[source]
Retrieve information and classify a chemical compound.
This is the main method that combines information retrieval and classification. It retrieves information about the compound from specified sources and then classifies it using the configured AI model.
- Parameters:
- Return type:
Union[Tuple[Tuple[Optional[str],Optional[str]],Optional[str]],Tuple[Tuple[Optional[str],Optional[str]],Optional[str],Optional[str]]]- Returns:
- Union[Tuple[Tuple[Optional[str], Optional[str]], Optional[str]],
Tuple[Tuple[Optional[str], Optional[str]], Optional[str], Optional[str]]]:
A tuple containing: - Information tuple: (source, content) - Classification result (list or tuple depending on output_explanation) - Explanation text (only if output_explanation=True)
- Raises:
ValueError – If model_api_key is not provided.
Example
>>> chem = ChemSource(model_api_key="your_key") >>> info, classification = chem.chemsource("aspirin") >>> print(info[0]) # Source >>> print(info[1]) # Content >>> print(classification) # Classification result
>>> # With explanation output >>> chem = ChemSource(model_api_key="your_key", explanation=True, ... output_explanation=True) >>> info, (classification, explanation) = chem.chemsource("aspirin") >>> print(classification) # List of categories >>> print(explanation) # Explanation text
- classify(name, information)[source]
Classify a chemical compound based on provided information.
This method classifies a chemical compound using the provided information and the configured AI model.
- Parameters:
- Returns:
- Classification result. Returns None if
information is empty, otherwise returns a string (if clean_output=False) or list of strings (if clean_output=True).
- Return type:
- Raises:
ValueError – If neither model_api_key nor custom_client is provided.
Example
>>> chem = ChemSource(model_api_key="your_key") >>> result = chem.classify("aspirin", "pain relief medication") >>> print(result) "MEDICAL"
- retrieve(name, priority='WIKIPEDIA', single_source=False)[source]
Retrieve information about a chemical compound from various sources.
This method retrieves information about a chemical compound from sources like Wikipedia and PubMed without performing classification.
- Parameters:
- Returns:
A tuple containing (source, content).
- Return type:
Example
>>> chem = ChemSource() >>> source, content = chem.retrieve("aspirin") >>> print(f"Retrieved from {source}: {content[:100]}...")
ChemSource Class
- class chemsource.ChemSource(model_api_key=None, model='gpt-4o', ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer. Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', temperature=0, top_p=1e-07, max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
Bases:
ConfigMain class for chemical compound classification and information retrieval.
chemsource combines configuration management, information retrieval from multiple sources (PubMed, Wikipedia), and AI-powered classification of chemical compounds. It extends the Config class to provide a complete solution for chemical information processing.
- Parameters:
model_api_key (str, optional) – API key for the language model service.
model (str, optional) – Name of the language model to use. Defaults to “gpt-4o”.
ncbi_key (str, optional) – API key for NCBI/PubMed access.
prompt (str, optional) – Custom prompt template. Defaults to BASE_PROMPT.
temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.
top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.0000001.
max_tokens (int, optional) – Maximum number of tokens for model context. Defaults to 250000.
clean_output (bool, optional) – Whether to clean and validate output. Defaults to False.
explanation (bool, optional) – Whether to expect explanations in model responses. Only effective when clean_output=True. Requires a custom prompt that instructs the model to include the explanation_separator. Defaults to False.
explanation_separator (str, optional) – Delimiter separating explanation from classification. Only used when both clean_output and explanation are True. Defaults to “EXPLANATION_COMPLETE”.
allowed_categories (List[str], optional) – List of allowed categories for filtering. Defaults to None.
custom_client (Any, optional) – Custom OpenAI client instance. Defaults to None.
- Raises:
ValueError – If clean_output is True but allowed_categories is None or empty.
TypeError – If allowed_categories is not a list when clean_output is True.
- spell_checker
Spell checker instance for output correction (when clean_output is enabled).
- Type:
SpellChecker
- custom_client
The custom client instance.
- Type:
Any
Example
>>> chem = ChemSource(model_api_key="your_key") >>> info, classification = chem.chemsource("aspirin") >>> print(classification) "MEDICAL"
>>> # Using explanation feature with custom prompt >>> custom_prompt = "Explain your reasoning, then write EXPLANATION_COMPLETE, then provide categories..." >>> chem = ChemSource(model_api_key="your_key", prompt=custom_prompt, ... clean_output=True, explanation=True, ... allowed_categories=["MEDICAL", "FOOD"]) >>> info, classification = chem.chemsource("aspirin")
- Parameters:
output_explanation (
bool)
- __init__(model_api_key=None, model='gpt-4o', ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer. Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', temperature=0, top_p=1e-07, max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
- Parameters:
- chemsource(name, priority='WIKIPEDIA', single_source=False)[source]
Retrieve information and classify a chemical compound.
This is the main method that combines information retrieval and classification. It retrieves information about the compound from specified sources and then classifies it using the configured AI model.
- Parameters:
- Return type:
Union[Tuple[Tuple[Optional[str],Optional[str]],Optional[str]],Tuple[Tuple[Optional[str],Optional[str]],Optional[str],Optional[str]]]- Returns:
- Union[Tuple[Tuple[Optional[str], Optional[str]], Optional[str]],
Tuple[Tuple[Optional[str], Optional[str]], Optional[str], Optional[str]]]:
A tuple containing: - Information tuple: (source, content) - Classification result (list or tuple depending on output_explanation) - Explanation text (only if output_explanation=True)
- Raises:
ValueError – If model_api_key is not provided.
Example
>>> chem = ChemSource(model_api_key="your_key") >>> info, classification = chem.chemsource("aspirin") >>> print(info[0]) # Source >>> print(info[1]) # Content >>> print(classification) # Classification result
>>> # With explanation output >>> chem = ChemSource(model_api_key="your_key", explanation=True, ... output_explanation=True) >>> info, (classification, explanation) = chem.chemsource("aspirin") >>> print(classification) # List of categories >>> print(explanation) # Explanation text
- classify(name, information)[source]
Classify a chemical compound based on provided information.
This method classifies a chemical compound using the provided information and the configured AI model.
- Parameters:
- Returns:
- Classification result. Returns None if
information is empty, otherwise returns a string (if clean_output=False) or list of strings (if clean_output=True).
- Return type:
- Raises:
ValueError – If neither model_api_key nor custom_client is provided.
Example
>>> chem = ChemSource(model_api_key="your_key") >>> result = chem.classify("aspirin", "pain relief medication") >>> print(result) "MEDICAL"
- retrieve(name, priority='WIKIPEDIA', single_source=False)[source]
Retrieve information about a chemical compound from various sources.
This method retrieves information about a chemical compound from sources like Wikipedia and PubMed without performing classification.
- Parameters:
- Returns:
A tuple containing (source, content).
- Return type:
Example
>>> chem = ChemSource() >>> source, content = chem.retrieve("aspirin") >>> print(f"Retrieved from {source}: {content[:100]}...")
Configuration
Configuration module for chemsource.
This module contains configuration classes and constants used throughout the chemsource package.
- chemsource.config.BASE_PROMPT = 'You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer. Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\n'
Default prompt template for chemical compound classification
- class chemsource.config.Config(model_api_key=None, model='gpt-4o', temperature=0, top_p=1e-07, ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer. Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
Bases:
objectConfiguration class for chemsource parameters.
This class manages all configuration parameters for the chemsource system, including API keys, model settings, and output formatting options.
- Parameters:
model_api_key (str, optional) – API key for the language model service.
model (str, optional) – Name of the language model to use. Defaults to “gpt-4o”.
temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.
top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.
ncbi_key (str, optional) – API key for NCBI/PubMed access.
prompt (str, optional) – Custom prompt template. Defaults to BASE_PROMPT.
max_tokens (int, optional) – Maximum number of tokens for model context. Defaults to 250000.
clean_output (bool, optional) – Whether to clean and validate output. Defaults to False.
explanation (bool, optional) – Whether to expect explanations in model responses. Only effective when clean_output=True. Defaults to False.
explanation_separator (str, optional) – Delimiter separating explanation from classification. Only used when both clean_output and explanation are True. Defaults to “EXPLANATION_COMPLETE”.
allowed_categories (List[str], optional) – List of allowed categories for filtering. Defaults to None.
custom_client (Any, optional) – Custom OpenAI client instance. Defaults to None.
- custom_client
The custom client instance.
- Type:
Any
- Parameters:
output_explanation (
bool)
- __init__(model_api_key=None, model='gpt-4o', temperature=0, top_p=1e-07, ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer. Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
- Parameters:
- set_explanation_output(output_explanation)[source]
Set whether to output explanations along with classifications.
- set_allowed_categories(allowed_categories)[source]
Set the list of allowed categories for filtering.
- set_custom_client(custom_client)[source]
Set a custom OpenAI client instance.
- Parameters:
custom_client (Any, optional) – Custom OpenAI client instance.
- Return type:
- configure(ncbi_key=None, model_api_key=None, model='gpt-4o', temperature=0, top_p=0, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer. Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
Configure all parameters at once.
- Parameters:
ncbi_key (str, optional) – API key for NCBI/PubMed access.
model_api_key (str, optional) – API key for the language model service.
model (str, optional) – Name of the language model to use. Defaults to “gpt-4o”.
temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.
top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.
prompt (str, optional) – Custom prompt template. Defaults to BASE_PROMPT.
max_tokens (int, optional) – Maximum number of tokens for model context. Defaults to 250000.
clean_output (bool, optional) – Whether to clean and validate output. Defaults to False.
explanation (bool, optional) – Whether to expect explanations in model responses. Defaults to False.
explanation_separator (str, optional) – Delimiter separating explanation from classification. Defaults to “EXPLANATION_COMPLETE”.
output_explanation (bool, optional) – Whether to return the explanation text alongside classification. Defaults to False.
allowed_categories (List[str], optional) – List of allowed categories for filtering. Defaults to None.
custom_client (Any, optional) – Custom OpenAI client instance. Defaults to None.
- Return type:
Classification
Chemical classification module for chemsource.
This module provides AI-powered classification functionality for chemical entities and compounds.
- chemsource.classifier.classify(name, input_text=None, api_key=None, baseprompt=None, model='gpt-4o', temperature=0, top_p=0, max_length=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None, spell_checker=None)[source]
Classify a chemical compound using an AI language model.
This function takes a chemical compound name and additional information, then uses an AI model to classify it into predefined categories.
- Parameters:
name (str) – The name of the chemical compound to classify.
input_text (str, optional) – Additional information about the compound.
api_key (str, optional) – API key for the language model service.
baseprompt (str, optional) – Base prompt template for classification.
model (str, optional) – Name of the language model to use. Defaults to ‘gpt-4o’.
temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.
top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.
max_length (int, optional) – Maximum length of the prompt in characters. Defaults to 250000.
clean_output (bool, optional) – Whether to clean and validate the output. Defaults to False.
explanation (bool, optional) – Whether to expect and extract explanations from the model response. Only used when clean_output=True. The model’s response should contain an explanation followed by the separator, then the classification. Defaults to False.
explanation_separator (str, optional) – The delimiter string that separates the explanation from the classification in the model’s response. Only used when both clean_output=True and explanation=True. Defaults to “EXPLANATION_COMPLETE”.
output_explanation (bool, optional) – Whether to return the explanation text alongside classification. When True, returns a tuple (classification_list, explanation_text). Only used when both explanation=True and clean_output=True. Defaults to False.
allowed_categories (List[str], optional) – List of allowed categories for filtering output.
custom_client (Any, optional) – Custom OpenAI client instance.
spell_checker (SpellChecker, optional) – Spell checker instance for output correction.
- Returns:
If clean_output=False: Raw model output string
If clean_output=True and output_explanation=False: List of categories
If clean_output=True, explanation=True, and output_explanation=True: Tuple of (category_list, explanation_text)
- Return type:
- Raises:
ValueError – If clean_output is True but allowed_categories is None, or if output_explanation=True but explanation=False.
IndexError – If explanation=True but the explanation_separator is not found in the response.
Example
>>> classify("aspirin", "pain relief medication", api_key="your_key") "MEDICAL"
>>> classify("aspirin", "pain relief medication", api_key="your_key", ... clean_output=True, allowed_categories=["MEDICAL", "FOOD"]) ["MEDICAL"]
>>> # Using explanation feature >>> custom_prompt = "Explain why, then say EXPLANATION_COMPLETE, then classify: ..." >>> classify("aspirin", "pain relief", api_key="your_key", baseprompt=custom_prompt, ... clean_output=True, explanation=True, ... allowed_categories=["MEDICAL", "FOOD"]) ["MEDICAL"]
>>> # Getting both classification and explanation >>> categories, explanation = classify("aspirin", "pain relief", api_key="your_key", ... baseprompt=custom_prompt, clean_output=True, ... explanation=True, output_explanation=True, ... allowed_categories=["MEDICAL", "FOOD"]) >>> print(categories) # ["MEDICAL"] >>> print(explanation) # "Aspirin is widely used as a pain reliever..."
Information Retrieval
Information retrieval module for chemsource.
This module handles the retrieval of information from various sources such as PubMed and Wikipedia for chemical research purposes.
- chemsource.retriever.SEARCH_PARAMS = {'api_key': None, 'db': 'pubmed', 'retmax': '3', 'sort': 'relevance', 'term': '', 'usehistory': 'n'}
Default parameters for PubMed search API
- chemsource.retriever.XML_RETRIEVAL_PARAMS = {'WebEnv': '', 'api_key': None, 'db': 'pubmed', 'query_key': '1', 'retmax': '3', 'rettype': 'abstract'}
Default parameters for PubMed abstract retrieval API
- chemsource.retriever.retrieve(name, priority='WIKIPEDIA', single_source=False, ncbikey=None)[source]
Retrieve information about a chemical compound from various sources.
This function retrieves information about a chemical compound from multiple sources including Wikipedia and PubMed, with configurable priority and source selection.
- Parameters:
name (str) – The name of the chemical compound to look up.
priority (str, optional) – Priority source for information retrieval. Options: “WIKIPEDIA”, “PUBMED”. Defaults to “WIKIPEDIA”.
single_source (bool, optional) – Whether to use only the priority source. Defaults to False.
ncbikey (str, optional) – API key for NCBI/PubMed access.
- Returns:
- A tuple containing (source, content) where source indicates
the data source used and content contains the retrieved information.
- Return type:
- Raises:
PubMedSearchXMLParseError – If PubMed search XML cannot be parsed.
PubMedSearchResultsError – If search results cannot be retrieved from PubMed.
PubMedAbstractXMLParseError – If PubMed abstract XML cannot be parsed.
PubMedAbstractRetrievalError – If abstracts cannot be retrieved from PubMed.
PubMedAbstractConcatenationError – If abstract texts cannot be concatenated.
WikipediaRetrievalError – If Wikipedia content cannot be retrieved.
Example
>>> source, content = retrieve("aspirin") >>> print(f"Retrieved from {source}: {content[:100]}...")
- chemsource.retriever.pubmed_retrieve(drug, ncbikey=None)[source]
Retrieve abstracts from PubMed for a given compound.
This function searches PubMed for articles related to a chemical compound and retrieves the abstracts of the most relevant articles.
- Parameters:
- Returns:
Concatenated abstract texts from PubMed articles, or ‘NO_RESULTS’ if no articles found.
- Return type:
- Raises:
PubMedSearchXMLParseError – If the search XML response cannot be parsed.
PubMedSearchResultsError – If search results cannot be retrieved.
PubMedAbstractXMLParseError – If abstract XML cannot be parsed.
PubMedAbstractRetrievalError – If abstracts cannot be retrieved.
PubMedAbstractConcatenationError – If abstract texts cannot be concatenated.
Example
>>> abstracts = pubmed_retrieve("aspirin", ncbikey="your_ncbi_key") >>> print(abstracts[:100])
- chemsource.retriever.wikipedia_retrieve(drug)[source]
Retrieve content from Wikipedia for a given compound.
This function fetches the Wikipedia page content for a chemical compound and processes it by removing newlines, tabs, and extra spaces.
- Parameters:
drug (str) – The name of the compound to look up on Wikipedia.
- Returns:
The processed Wikipedia content with cleaned formatting.
- Return type:
- Raises:
WikipediaRetrievalError – If Wikipedia content cannot be retrieved.
Example
>>> content = wikipedia_retrieve("aspirin") >>> print(content[:100])
Constants
- chemsource.config.BASE_PROMPT = Default classification prompt template
Default prompt template for chemical compound classification
The default prompt template used for chemical compound classification. This prompt instructs the AI model to classify compounds into categories such as MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, and INDUSTRIAL.