peptidedigest.clean_text#

Functions to clean text data.

Module Contents#

Functions#

split_into_chunks(text, chunk_size)

Splits a given text into chunks of approximately 'chunk_size' words.

clean_summary(summary_text)

Cleans a summary text by removing unwanted patterns and phrases.

extract_metadata(metadata_text)

Extract peptides, proteins, domains of interest, chemistry discussed,

peptidedigest.clean_text.split_into_chunks(text, chunk_size)[source]#

Splits a given text into chunks of approximately ‘chunk_size’ words.

Parameters:
  • text (str) – The text to split into chunks.

  • chunk_size (int) – The approximate number of words to include in each chunk.

Returns:

chunks – A list of text chunks, each containing approximately ‘chunk_size’ words.

Return type:

list of str

peptidedigest.clean_text.clean_summary(summary_text)[source]#

Cleans a summary text by removing unwanted patterns and phrases.

Parameters:

summary_text (str) – The summary text to clean.

Returns:

cleaned_summary – The cleaned summary text.

Return type:

str

peptidedigest.clean_text.extract_metadata(metadata_text)[source]#

Extract peptides, proteins, domains of interest, chemistry discussed, biology discussed, and computational methods discussed from the model metadata text.

Parameters:

metadata_text (str) – The model metadata text to be parsed.

Returns:

A dictionary containing the extracted metadata as lists.

Return type:

dict