API Reference

This section contains the complete API reference for chemsource.

Main Classes

chemsource: A tool for classifying novel drugs and health-related chemicals by origin.

This package provides functionality to classify chemical compounds using AI models and retrieve chemical information from various sources including PubMed and Wikipedia.

Classes:

ChemSource: Main class for chemical compound classification and information retrieval.

Version:

1.1.17

class chemsource.ChemSource(model_api_key=None, model='gpt-4o', ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer.  Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', temperature=0, top_p=1e-07, max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]

Bases: Config

Main class for chemical compound classification and information retrieval.

chemsource combines configuration management, information retrieval from multiple sources (PubMed, Wikipedia), and AI-powered classification of chemical compounds. It extends the Config class to provide a complete solution for chemical information processing.

Parameters:
  • model_api_key (str, optional) – API key for the language model service.

  • model (str, optional) – Name of the language model to use. Defaults to “gpt-4o”.

  • ncbi_key (str, optional) – API key for NCBI/PubMed access.

  • prompt (str, optional) – Custom prompt template. Defaults to BASE_PROMPT.

  • temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.

  • top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.0000001.

  • max_tokens (int, optional) – Maximum number of tokens for model context. Defaults to 250000.

  • clean_output (bool, optional) – Whether to clean and validate output. Defaults to False.

  • explanation (bool, optional) – Whether to expect explanations in model responses. Only effective when clean_output=True. Requires a custom prompt that instructs the model to include the explanation_separator. Defaults to False.

  • explanation_separator (str, optional) – Delimiter separating explanation from classification. Only used when both clean_output and explanation are True. Defaults to “EXPLANATION_COMPLETE”.

  • allowed_categories (List[str], optional) – List of allowed categories for filtering. Defaults to None.

  • custom_client (Any, optional) – Custom OpenAI client instance. Defaults to None.

Raises:
  • ValueError – If clean_output is True but allowed_categories is None or empty.

  • TypeError – If allowed_categories is not a list when clean_output is True.

spell_checker

Spell checker instance for output correction (when clean_output is enabled).

Type:

SpellChecker

clean_output

Whether output cleaning is enabled.

Type:

bool

explanation

Whether to extract explanations from responses.

Type:

bool

explanation_separator

The delimiter for separating explanations.

Type:

str

allowed_categories

The allowed categories list.

Type:

List[str]

custom_client

The custom client instance.

Type:

Any

Example

>>> chem = ChemSource(model_api_key="your_key")
>>> info, classification = chem.chemsource("aspirin")
>>> print(classification)
"MEDICAL"
>>> # Using explanation feature with custom prompt
>>> custom_prompt = "Explain your reasoning, then write EXPLANATION_COMPLETE, then provide categories..."
>>> chem = ChemSource(model_api_key="your_key", prompt=custom_prompt,
...                   clean_output=True, explanation=True,
...                   allowed_categories=["MEDICAL", "FOOD"])
>>> info, classification = chem.chemsource("aspirin")
Parameters:

output_explanation (bool)

__init__(model_api_key=None, model='gpt-4o', ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer.  Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', temperature=0, top_p=1e-07, max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
Parameters:
chemsource(name, priority='WIKIPEDIA', single_source=False)[source]

Retrieve information and classify a chemical compound.

This is the main method that combines information retrieval and classification. It retrieves information about the compound from specified sources and then classifies it using the configured AI model.

Parameters:
  • name (str) – The name of the chemical compound to process.

  • priority (str, optional) – Priority source for information retrieval. Options: “WIKIPEDIA”, “PUBMED”. Defaults to “WIKIPEDIA”.

  • single_source (bool, optional) – Whether to use only the priority source. Defaults to False.

Return type:

Union[Tuple[Tuple[Optional[str], Optional[str]], Optional[str]], Tuple[Tuple[Optional[str], Optional[str]], Optional[str], Optional[str]]]

Returns:

Union[Tuple[Tuple[Optional[str], Optional[str]], Optional[str]],

Tuple[Tuple[Optional[str], Optional[str]], Optional[str], Optional[str]]]:

A tuple containing: - Information tuple: (source, content) - Classification result (list or tuple depending on output_explanation) - Explanation text (only if output_explanation=True)

Raises:

ValueError – If model_api_key is not provided.

Example

>>> chem = ChemSource(model_api_key="your_key")
>>> info, classification = chem.chemsource("aspirin")
>>> print(info[0])  # Source
>>> print(info[1])  # Content
>>> print(classification)  # Classification result
>>> # With explanation output
>>> chem = ChemSource(model_api_key="your_key", explanation=True,
...                   output_explanation=True)
>>> info, (classification, explanation) = chem.chemsource("aspirin")
>>> print(classification)  # List of categories
>>> print(explanation)  # Explanation text
classify(name, information)[source]

Classify a chemical compound based on provided information.

This method classifies a chemical compound using the provided information and the configured AI model.

Parameters:
  • name (str) – The name of the chemical compound to classify.

  • information (str) – Information about the compound to use for classification.

Returns:

Classification result. Returns None if

information is empty, otherwise returns a string (if clean_output=False) or list of strings (if clean_output=True).

Return type:

Optional[Union[str, List[str]]]

Raises:

ValueError – If neither model_api_key nor custom_client is provided.

Example

>>> chem = ChemSource(model_api_key="your_key")
>>> result = chem.classify("aspirin", "pain relief medication")
>>> print(result)
"MEDICAL"
retrieve(name, priority='WIKIPEDIA', single_source=False)[source]

Retrieve information about a chemical compound from various sources.

This method retrieves information about a chemical compound from sources like Wikipedia and PubMed without performing classification.

Parameters:
  • name (str) – The name of the chemical compound to look up.

  • priority (str, optional) – Priority source for information retrieval. Options: “WIKIPEDIA”, “PUBMED”. Defaults to “WIKIPEDIA”.

  • single_source (bool, optional) – Whether to use only the priority source. Defaults to False.

Returns:

A tuple containing (source, content).

Return type:

Tuple[str, str]

Example

>>> chem = ChemSource()
>>> source, content = chem.retrieve("aspirin")
>>> print(f"Retrieved from {source}: {content[:100]}...")

ChemSource Class

class chemsource.ChemSource(model_api_key=None, model='gpt-4o', ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer.  Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', temperature=0, top_p=1e-07, max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]

Bases: Config

Main class for chemical compound classification and information retrieval.

chemsource combines configuration management, information retrieval from multiple sources (PubMed, Wikipedia), and AI-powered classification of chemical compounds. It extends the Config class to provide a complete solution for chemical information processing.

Parameters:
  • model_api_key (str, optional) – API key for the language model service.

  • model (str, optional) – Name of the language model to use. Defaults to “gpt-4o”.

  • ncbi_key (str, optional) – API key for NCBI/PubMed access.

  • prompt (str, optional) – Custom prompt template. Defaults to BASE_PROMPT.

  • temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.

  • top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.0000001.

  • max_tokens (int, optional) – Maximum number of tokens for model context. Defaults to 250000.

  • clean_output (bool, optional) – Whether to clean and validate output. Defaults to False.

  • explanation (bool, optional) – Whether to expect explanations in model responses. Only effective when clean_output=True. Requires a custom prompt that instructs the model to include the explanation_separator. Defaults to False.

  • explanation_separator (str, optional) – Delimiter separating explanation from classification. Only used when both clean_output and explanation are True. Defaults to “EXPLANATION_COMPLETE”.

  • allowed_categories (List[str], optional) – List of allowed categories for filtering. Defaults to None.

  • custom_client (Any, optional) – Custom OpenAI client instance. Defaults to None.

Raises:
  • ValueError – If clean_output is True but allowed_categories is None or empty.

  • TypeError – If allowed_categories is not a list when clean_output is True.

spell_checker

Spell checker instance for output correction (when clean_output is enabled).

Type:

SpellChecker

clean_output

Whether output cleaning is enabled.

Type:

bool

explanation

Whether to extract explanations from responses.

Type:

bool

explanation_separator

The delimiter for separating explanations.

Type:

str

allowed_categories

The allowed categories list.

Type:

List[str]

custom_client

The custom client instance.

Type:

Any

Example

>>> chem = ChemSource(model_api_key="your_key")
>>> info, classification = chem.chemsource("aspirin")
>>> print(classification)
"MEDICAL"
>>> # Using explanation feature with custom prompt
>>> custom_prompt = "Explain your reasoning, then write EXPLANATION_COMPLETE, then provide categories..."
>>> chem = ChemSource(model_api_key="your_key", prompt=custom_prompt,
...                   clean_output=True, explanation=True,
...                   allowed_categories=["MEDICAL", "FOOD"])
>>> info, classification = chem.chemsource("aspirin")
Parameters:

output_explanation (bool)

__init__(model_api_key=None, model='gpt-4o', ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer.  Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', temperature=0, top_p=1e-07, max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
Parameters:
chemsource(name, priority='WIKIPEDIA', single_source=False)[source]

Retrieve information and classify a chemical compound.

This is the main method that combines information retrieval and classification. It retrieves information about the compound from specified sources and then classifies it using the configured AI model.

Parameters:
  • name (str) – The name of the chemical compound to process.

  • priority (str, optional) – Priority source for information retrieval. Options: “WIKIPEDIA”, “PUBMED”. Defaults to “WIKIPEDIA”.

  • single_source (bool, optional) – Whether to use only the priority source. Defaults to False.

Return type:

Union[Tuple[Tuple[Optional[str], Optional[str]], Optional[str]], Tuple[Tuple[Optional[str], Optional[str]], Optional[str], Optional[str]]]

Returns:

Union[Tuple[Tuple[Optional[str], Optional[str]], Optional[str]],

Tuple[Tuple[Optional[str], Optional[str]], Optional[str], Optional[str]]]:

A tuple containing: - Information tuple: (source, content) - Classification result (list or tuple depending on output_explanation) - Explanation text (only if output_explanation=True)

Raises:

ValueError – If model_api_key is not provided.

Example

>>> chem = ChemSource(model_api_key="your_key")
>>> info, classification = chem.chemsource("aspirin")
>>> print(info[0])  # Source
>>> print(info[1])  # Content
>>> print(classification)  # Classification result
>>> # With explanation output
>>> chem = ChemSource(model_api_key="your_key", explanation=True,
...                   output_explanation=True)
>>> info, (classification, explanation) = chem.chemsource("aspirin")
>>> print(classification)  # List of categories
>>> print(explanation)  # Explanation text
classify(name, information)[source]

Classify a chemical compound based on provided information.

This method classifies a chemical compound using the provided information and the configured AI model.

Parameters:
  • name (str) – The name of the chemical compound to classify.

  • information (str) – Information about the compound to use for classification.

Returns:

Classification result. Returns None if

information is empty, otherwise returns a string (if clean_output=False) or list of strings (if clean_output=True).

Return type:

Optional[Union[str, List[str]]]

Raises:

ValueError – If neither model_api_key nor custom_client is provided.

Example

>>> chem = ChemSource(model_api_key="your_key")
>>> result = chem.classify("aspirin", "pain relief medication")
>>> print(result)
"MEDICAL"
retrieve(name, priority='WIKIPEDIA', single_source=False)[source]

Retrieve information about a chemical compound from various sources.

This method retrieves information about a chemical compound from sources like Wikipedia and PubMed without performing classification.

Parameters:
  • name (str) – The name of the chemical compound to look up.

  • priority (str, optional) – Priority source for information retrieval. Options: “WIKIPEDIA”, “PUBMED”. Defaults to “WIKIPEDIA”.

  • single_source (bool, optional) – Whether to use only the priority source. Defaults to False.

Returns:

A tuple containing (source, content).

Return type:

Tuple[str, str]

Example

>>> chem = ChemSource()
>>> source, content = chem.retrieve("aspirin")
>>> print(f"Retrieved from {source}: {content[:100]}...")

Configuration

Configuration module for chemsource.

This module contains configuration classes and constants used throughout the chemsource package.

chemsource.config.BASE_PROMPT = 'You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer.  Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\n'

Default prompt template for chemical compound classification

class chemsource.config.Config(model_api_key=None, model='gpt-4o', temperature=0, top_p=1e-07, ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer.  Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]

Bases: object

Configuration class for chemsource parameters.

This class manages all configuration parameters for the chemsource system, including API keys, model settings, and output formatting options.

Parameters:
  • model_api_key (str, optional) – API key for the language model service.

  • model (str, optional) – Name of the language model to use. Defaults to “gpt-4o”.

  • temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.

  • top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.

  • ncbi_key (str, optional) – API key for NCBI/PubMed access.

  • prompt (str, optional) – Custom prompt template. Defaults to BASE_PROMPT.

  • max_tokens (int, optional) – Maximum number of tokens for model context. Defaults to 250000.

  • clean_output (bool, optional) – Whether to clean and validate output. Defaults to False.

  • explanation (bool, optional) – Whether to expect explanations in model responses. Only effective when clean_output=True. Defaults to False.

  • explanation_separator (str, optional) – Delimiter separating explanation from classification. Only used when both clean_output and explanation are True. Defaults to “EXPLANATION_COMPLETE”.

  • allowed_categories (List[str], optional) – List of allowed categories for filtering. Defaults to None.

  • custom_client (Any, optional) – Custom OpenAI client instance. Defaults to None.

model_api_key

The model API key.

Type:

str

model

The language model name.

Type:

str

temperature

The temperature parameter.

Type:

float

top_p

The top-p parameter.

Type:

float

ncbi_key

The NCBI API key.

Type:

str

prompt

The prompt template.

Type:

str

max_tokens

The maximum token limit.

Type:

int

clean_output

Whether output cleaning is enabled.

Type:

bool

explanation

Whether to extract explanations from responses.

Type:

bool

explanation_separator

The delimiter for separating explanations.

Type:

str

allowed_categories

The allowed categories list.

Type:

List[str]

custom_client

The custom client instance.

Type:

Any

Parameters:

output_explanation (bool)

__init__(model_api_key=None, model='gpt-4o', temperature=0, top_p=1e-07, ncbi_key=None, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer.  Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]
Parameters:
set_ncbi_key(ncbi_key)[source]

Set the NCBI API key.

Parameters:

ncbi_key (str, optional) – The NCBI API key to set.

Return type:

None

set_model_api_key(model_api_key)[source]

Set the model API key.

Parameters:

model_api_key (str, optional) – The model API key to set.

Return type:

None

set_model(model)[source]

Set the language model name.

Parameters:

model (str) – The name of the language model to use.

Return type:

None

set_prompt(prompt)[source]

Set the prompt template.

Parameters:

prompt (str) – The prompt template to use for classification.

Return type:

None

set_token_limit(max_tokens)[source]

Set the maximum token limit.

Parameters:

max_tokens (int) – The maximum number of tokens for model context.

Return type:

None

set_temperature(temperature)[source]

Set the temperature parameter for model creativity.

Parameters:

temperature (float) – The temperature value (0.0 to 1.0).

Return type:

None

set_top_p(top_p)[source]

Set the top-p parameter for nucleus sampling.

Parameters:

top_p (float) – The top-p value (0.0 to 1.0).

Return type:

None

set_clean_output(clean_output)[source]

Set whether to enable output cleaning and validation.

Parameters:

clean_output (bool) – Whether to clean and validate output.

Return type:

None

set_explanation(explanation)[source]

Set whether to include explanations in the output.

Parameters:

explanation (bool) – Whether to include explanations.

Return type:

None

set_explanation_separator(explanation_separator)[source]

Set the explanation separator string.

Parameters:

explanation_separator (str) – The string that separates explanations in the output.

Return type:

None

set_explanation_output(output_explanation)[source]

Set whether to output explanations along with classifications.

Parameters:

output_explanation (bool) – Whether to output explanations.

Return type:

None

set_allowed_categories(allowed_categories)[source]

Set the list of allowed categories for filtering.

Parameters:

allowed_categories (List[str], optional) – List of allowed categories.

Return type:

None

set_custom_client(custom_client)[source]

Set a custom OpenAI client instance.

Parameters:

custom_client (Any, optional) – Custom OpenAI client instance.

Return type:

None

configure(ncbi_key=None, model_api_key=None, model='gpt-4o', temperature=0, top_p=0, prompt='You are a helpful scientist that will classify the provided compound COMPOUND_NAME using only the information provided as any combination of the following: MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, INDUSTRIAL. Note that MEDICAL refers to compounds actively used as approved medications in humans or in late-stage clinical trials in humans. Note that ENDOGENOUS refers to compounds that are produced by the human body specifically. ENDOGENOUS excludes essential nutrients that cannot be synthesized by the human body. Note that FOOD refers to compounds present in natural food items or food additives. Note that PERSONAL CARE refers to non-medicated compounds typically used for activities such as skincare, beauty, and fitness. Note that INDUSTRIAL should be used only for synthetic compounds not used as a contributing ingredient in the medical, personal care, or food industries. Specify INFO instead if more information is needed. DO NOT MAKE ANY ASSUMPTIONS, USE ONLY THE INFORMATION PROVIDED AFTER THE COMPOUND NAME BY THE USER. A classification of INFO will also be rewarded when correctly applied and is strongly encouraged if information is of poor quality, if there is not enough information, or if you are not completely confident in your answer.  Provide the output as a plain text separated by commas, and provide only the categories listed (either list a combination of INDUSTRIAL, ENDOGENOUS, PERSONAL CARE, MEDICAL, FOOD or list INFO), with no justification. Provided Information:\\n', max_tokens=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None)[source]

Configure all parameters at once.

Parameters:
  • ncbi_key (str, optional) – API key for NCBI/PubMed access.

  • model_api_key (str, optional) – API key for the language model service.

  • model (str, optional) – Name of the language model to use. Defaults to “gpt-4o”.

  • temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.

  • top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.

  • prompt (str, optional) – Custom prompt template. Defaults to BASE_PROMPT.

  • max_tokens (int, optional) – Maximum number of tokens for model context. Defaults to 250000.

  • clean_output (bool, optional) – Whether to clean and validate output. Defaults to False.

  • explanation (bool, optional) – Whether to expect explanations in model responses. Defaults to False.

  • explanation_separator (str, optional) – Delimiter separating explanation from classification. Defaults to “EXPLANATION_COMPLETE”.

  • output_explanation (bool, optional) – Whether to return the explanation text alongside classification. Defaults to False.

  • allowed_categories (List[str], optional) – List of allowed categories for filtering. Defaults to None.

  • custom_client (Any, optional) – Custom OpenAI client instance. Defaults to None.

Return type:

None

configuration()[source]

Get the current configuration as a dictionary with masked sensitive data.

Returns:

A dictionary containing all configuration parameters with API keys masked.

Return type:

dict

Classification

Chemical classification module for chemsource.

This module provides AI-powered classification functionality for chemical entities and compounds.

chemsource.classifier.classify(name, input_text=None, api_key=None, baseprompt=None, model='gpt-4o', temperature=0, top_p=0, max_length=250000, clean_output=False, explanation=False, explanation_separator='EXPLANATION_COMPLETE', output_explanation=False, allowed_categories=None, custom_client=None, spell_checker=None)[source]

Classify a chemical compound using an AI language model.

This function takes a chemical compound name and additional information, then uses an AI model to classify it into predefined categories.

Parameters:
  • name (str) – The name of the chemical compound to classify.

  • input_text (str, optional) – Additional information about the compound.

  • api_key (str, optional) – API key for the language model service.

  • baseprompt (str, optional) – Base prompt template for classification.

  • model (str, optional) – Name of the language model to use. Defaults to ‘gpt-4o’.

  • temperature (float, optional) – Temperature parameter for model creativity. Defaults to 0.

  • top_p (float, optional) – Top-p parameter for nucleus sampling. Defaults to 0.

  • max_length (int, optional) – Maximum length of the prompt in characters. Defaults to 250000.

  • clean_output (bool, optional) – Whether to clean and validate the output. Defaults to False.

  • explanation (bool, optional) – Whether to expect and extract explanations from the model response. Only used when clean_output=True. The model’s response should contain an explanation followed by the separator, then the classification. Defaults to False.

  • explanation_separator (str, optional) – The delimiter string that separates the explanation from the classification in the model’s response. Only used when both clean_output=True and explanation=True. Defaults to “EXPLANATION_COMPLETE”.

  • output_explanation (bool, optional) – Whether to return the explanation text alongside classification. When True, returns a tuple (classification_list, explanation_text). Only used when both explanation=True and clean_output=True. Defaults to False.

  • allowed_categories (List[str], optional) – List of allowed categories for filtering output.

  • custom_client (Any, optional) – Custom OpenAI client instance.

  • spell_checker (SpellChecker, optional) – Spell checker instance for output correction.

Returns:

  • If clean_output=False: Raw model output string

  • If clean_output=True and output_explanation=False: List of categories

  • If clean_output=True, explanation=True, and output_explanation=True: Tuple of (category_list, explanation_text)

Return type:

Union[str, List[str], Tuple[List[str], str]]

Raises:
  • ValueError – If clean_output is True but allowed_categories is None, or if output_explanation=True but explanation=False.

  • IndexError – If explanation=True but the explanation_separator is not found in the response.

Example

>>> classify("aspirin", "pain relief medication", api_key="your_key")
"MEDICAL"
>>> classify("aspirin", "pain relief medication", api_key="your_key",
...          clean_output=True, allowed_categories=["MEDICAL", "FOOD"])
["MEDICAL"]
>>> # Using explanation feature
>>> custom_prompt = "Explain why, then say EXPLANATION_COMPLETE, then classify: ..."
>>> classify("aspirin", "pain relief", api_key="your_key", baseprompt=custom_prompt,
...          clean_output=True, explanation=True,
...          allowed_categories=["MEDICAL", "FOOD"])
["MEDICAL"]
>>> # Getting both classification and explanation
>>> categories, explanation = classify("aspirin", "pain relief", api_key="your_key",
...                                     baseprompt=custom_prompt, clean_output=True,
...                                     explanation=True, output_explanation=True,
...                                     allowed_categories=["MEDICAL", "FOOD"])
>>> print(categories)  # ["MEDICAL"]
>>> print(explanation)  # "Aspirin is widely used as a pain reliever..."

Information Retrieval

Information retrieval module for chemsource.

This module handles the retrieval of information from various sources such as PubMed and Wikipedia for chemical research purposes.

chemsource.retriever.SEARCH_PARAMS = {'api_key': None, 'db': 'pubmed', 'retmax': '3', 'sort': 'relevance', 'term': '', 'usehistory': 'n'}

Default parameters for PubMed search API

chemsource.retriever.XML_RETRIEVAL_PARAMS = {'WebEnv': '', 'api_key': None, 'db': 'pubmed', 'query_key': '1', 'retmax': '3', 'rettype': 'abstract'}

Default parameters for PubMed abstract retrieval API

chemsource.retriever.retrieve(name, priority='WIKIPEDIA', single_source=False, ncbikey=None)[source]

Retrieve information about a chemical compound from various sources.

This function retrieves information about a chemical compound from multiple sources including Wikipedia and PubMed, with configurable priority and source selection.

Parameters:
  • name (str) – The name of the chemical compound to look up.

  • priority (str, optional) – Priority source for information retrieval. Options: “WIKIPEDIA”, “PUBMED”. Defaults to “WIKIPEDIA”.

  • single_source (bool, optional) – Whether to use only the priority source. Defaults to False.

  • ncbikey (str, optional) – API key for NCBI/PubMed access.

Returns:

A tuple containing (source, content) where source indicates

the data source used and content contains the retrieved information.

Return type:

Tuple[str, str]

Raises:
  • PubMedSearchXMLParseError – If PubMed search XML cannot be parsed.

  • PubMedSearchResultsError – If search results cannot be retrieved from PubMed.

  • PubMedAbstractXMLParseError – If PubMed abstract XML cannot be parsed.

  • PubMedAbstractRetrievalError – If abstracts cannot be retrieved from PubMed.

  • PubMedAbstractConcatenationError – If abstract texts cannot be concatenated.

  • WikipediaRetrievalError – If Wikipedia content cannot be retrieved.

Example

>>> source, content = retrieve("aspirin")
>>> print(f"Retrieved from {source}: {content[:100]}...")
chemsource.retriever.pubmed_retrieve(drug, ncbikey=None)[source]

Retrieve abstracts from PubMed for a given compound.

This function searches PubMed for articles related to a chemical compound and retrieves the abstracts of the most relevant articles.

Parameters:
  • drug (str) – The name of the compound to search for in PubMed.

  • ncbikey (str, optional) – API key for NCBI/PubMed access for higher rate limits.

Returns:

Concatenated abstract texts from PubMed articles, or ‘NO_RESULTS’ if no articles found.

Return type:

str

Raises:
  • PubMedSearchXMLParseError – If the search XML response cannot be parsed.

  • PubMedSearchResultsError – If search results cannot be retrieved.

  • PubMedAbstractXMLParseError – If abstract XML cannot be parsed.

  • PubMedAbstractRetrievalError – If abstracts cannot be retrieved.

  • PubMedAbstractConcatenationError – If abstract texts cannot be concatenated.

Example

>>> abstracts = pubmed_retrieve("aspirin", ncbikey="your_ncbi_key")
>>> print(abstracts[:100])
chemsource.retriever.wikipedia_retrieve(drug)[source]

Retrieve content from Wikipedia for a given compound.

This function fetches the Wikipedia page content for a chemical compound and processes it by removing newlines, tabs, and extra spaces.

Parameters:

drug (str) – The name of the compound to look up on Wikipedia.

Returns:

The processed Wikipedia content with cleaned formatting.

Return type:

str

Raises:

WikipediaRetrievalError – If Wikipedia content cannot be retrieved.

Example

>>> content = wikipedia_retrieve("aspirin")
>>> print(content[:100])

Constants

chemsource.config.BASE_PROMPT = Default classification prompt template

Default prompt template for chemical compound classification

The default prompt template used for chemical compound classification. This prompt instructs the AI model to classify compounds into categories such as MEDICAL, ENDOGENOUS, FOOD, PERSONAL CARE, and INDUSTRIAL.