peptidedigest#

LLM for summarization of scientific articles related to computational peptides.

Submodules#

Package Contents#

Functions#

create_database(name)

Create a SQLite database with the given name.

get_article(database[, doi, pmc_id])

Get the article information and model responses for a given DOI.

get_articles(database)

Get all articles from the database.

check_article_exists(database, value, column)

Check if an article with the given value in the specified column exists in the database.

delete_article(database[, doi, pmc_id])

Delete an article and its model responses from the database.

insert_article(database, article_info[, model_responses])

Insert an article and its model responses into the database.

update_article(database, doi, model_responses)

Update the model responses for an article in the database.

process_scidir_article(database, tokenizer, model, api_key)

Process a ScienceDirect article, summarize the article using the model, and store the information in the database.

process_multiple_scidir_articles(database, tokenizer, ...)

Process multiple ScienceDirect articles, summarize the articles using the model, and store the information in the database.

process_pmc_article(database, tokenizer, model, pmc_id)

Process a PubMed Central article, summarize the article using the model, and store the information in the database.

process_multiple_pmc_articles(database, tokenizer, ...)

Process multiple PubMed Central articles, summarize the articles using the model, and store the information in the database.

summarize_article_segments(fulltext, tokenizer, model)

Summarizes a scientific article into bullet points and a concise summary.

summarize_article_meta(fulltext, tokenizer, model)

score_texts_peptide_research(texts_to_score, summary, ...)

peptidedigest.create_database(name)[source]#

Create a SQLite database with the given name.

Parameters:

name (str) – The name of the database to create.

Returns:

The database is created in the current working directory.

Return type:

None

peptidedigest.get_article(database, doi=None, pmc_id=None)[source]#

Get the article information and model responses for a given DOI.

Parameters:
  • database (str) – The name of the database to retrieve the article from.

  • doi (str) – The DOI of the article to retrieve.

  • pmc_id (str) – The PMC ID of the article to retrieve.

Returns:

A dictionary containing the article information and model responses.

Return type:

dict

peptidedigest.get_articles(database)[source]#

Get all articles from the database.

Parameters:

database (str) – The name of the database to retrieve the articles from.

Returns:

A list of dictionaries containing the article information and model responses.

Return type:

list

peptidedigest.check_article_exists(database, value, column)[source]#

Check if an article with the given value in the specified column exists in the database.

Parameters:
  • database (str) – The name of the database to check for the article.

  • value (str) – The value to check for in the specified column.

  • column (str) – The column to check for the value.

Returns:

True if the article exists, False otherwise.

Return type:

bool

peptidedigest.delete_article(database, doi=None, pmc_id=None)[source]#

Delete an article and its model responses from the database.

Parameters:
  • database (str) – The name of the database to delete the article from.

  • doi (str) – The DOI of the article to delete.

  • pmc_id (str) – The PMC ID of the article to delete.

Returns:

The article and model responses are deleted from the database.

Return type:

None

peptidedigest.insert_article(database, article_info, model_responses=None)[source]#

Insert an article and its model responses into the database.

Parameters:
  • database (str) – The name of the database to insert the article into.

  • article_info (dict) – A dictionary containing the article information.

  • model_responses (dict) – A dictionary containing the model responses for the article.

Returns:

The article and model responses are inserted into the database.

Return type:

None

peptidedigest.update_article(database, doi, model_responses)[source]#

Update the model responses for an article in the database.

Parameters:
  • database (str) – The name of the database to update the article in.

  • doi (str) – The DOI of the article to update.

  • model_responses (dict) – A dictionary containing the updated model responses.

Returns:

The model responses for the article are updated in the database.

Return type:

None

peptidedigest.process_scidir_article(database, tokenizer, model, api_key, doi=None, pii=None, url=None, chunk_size=4200, update=False)[source]#

Process a ScienceDirect article, summarize the article using the model, and store the information in the database.

Parameters:
  • database (str) – The database to store the processed article information.

  • tokenizer (transformers.PreTrainedTokenizer) – The tokenizer to use for the model.

  • model (transformers.PreTrainedModel) – The model to use to process the article.

  • api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.

  • doi (str, optional) – The DOI of the article to be processed.

  • pii (str, optional) – The PII of the article to be processed.

  • url (str, optional) – The URL of the article to be processed.

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is 4200.

  • update (bool, optional) – If True, the article will be updated in the database if it already exists. Default is False.

Returns:

The processed article information is stored in the database.

Return type:

None

peptidedigest.process_multiple_scidir_articles(database, tokenizer, model, api_key, dois=None, piis=None, urls=None, chunk_size=4200, update=False)[source]#

Process multiple ScienceDirect articles, summarize the articles using the model, and store the information in the database.

Parameters:
  • database (str) – The database to store the processed articles information.

  • tokenizer (transformers.PreTrainedTokenizer) – The tokenizer to use for the model.

  • model (transformers.PreTrainedModel) – The model to use to process the articles.

  • api_key (str) – The API key for the ScienceDirect API. API keys can be obtained by creating an account at https://dev.elsevier.com/.

  • dois (list of str, optional) – The DOIs of the articles to be processed.

  • piis (list of str, optional) – The PIIs of the articles to be processed.

  • urls (list of str, optional) – The URLs of the articles to be processed.

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is 4200.

  • update (bool, optional) – If True, the articles will be updated in the database if they already exist. Default is False.

Returns:

The processed articles information is stored in the database.

Return type:

None

peptidedigest.process_pmc_article(database, tokenizer, model, pmc_id, chunk_size=4200, update=False)[source]#

Process a PubMed Central article, summarize the article using the model, and store the information in the database.

Parameters:
  • database (str) – The database to store the processed article information.

  • tokenizer (transformers.PreTrainedTokenizer) – The tokenizer to use for the model.

  • model (transformers.PreTrainedModel) – The model to use to process the article.

  • pmc_id (str) – The PMC ID of the article to be processed.

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is 4200.

  • update (bool, optional) – If True, the article will be updated in the database if it already exists. Default is False.

Returns:

The processed article information is stored in the database.

Return type:

None

peptidedigest.process_multiple_pmc_articles(database, tokenizer, model, pmc_ids, chunk_size=4200, update=False)[source]#

Process multiple PubMed Central articles, summarize the articles using the model, and store the information in the database.

Parameters:
  • database (str) – The database to store the processed articles information.

  • tokenizer (transformers.PreTrainedTokenizer) – The tokenizer to use for the model.

  • model (transformers.PreTrainedModel) – The model to use to process the articles.

  • pmc_ids (list of str) – The PMC IDs of the articles to be processed.

  • chunk_size (int, optional) – The size of the chunks to split the full text into. Default is 4200.

  • update (bool, optional) – If True, the articles will be updated in the database if they already exist. Default is False.

Returns:

The processed articles information is stored in the database.

Return type:

None

peptidedigest.summarize_article_segments(fulltext, tokenizer, model)[source]#

Summarizes a scientific article into bullet points and a concise summary.

Parameters:

fulltext (list of str) – A list of text chunks from a scientific article.

Returns:

  • final_summary (str) – A concise summary of the scientific article.

  • bullet_points (str) – Bullet points summarizing the scientific article.

peptidedigest.summarize_article_meta(fulltext, tokenizer, model)[source]#
peptidedigest.score_texts_peptide_research(texts_to_score, summary, bullet_points, metadata, tokenizer, model)[source]#