sciencescraper.pmc.pmc_clean#

Functions to clean the data extracted from PMC.

Module Contents#

Functions#

clean_full_text(pmc_article, chunk_size)

Returns the full text of the article, excluding figures, tables,

split_into_chunks(text, chunk_size)

Splits a given text into chunks of approximately 'chunk_size' words.

sciencescraper.pmc.pmc_clean.clean_full_text(pmc_article, chunk_size)[source]#

Returns the full text of the article, excluding figures, tables, and supplementary information sections, and removes reference numbers

Parameters:
  • pmc_article (BeautifulSoup) – The article as a BeautifulSoup object

  • chunk_size (int) – The size of the chunks to split the full text into

Returns:

full_text – The full text of the article, excluding figures, tables, supplementary information sections, and reference numbers

Return type:

str

sciencescraper.pmc.pmc_clean.split_into_chunks(text, chunk_size)[source]#

Splits a given text into chunks of approximately ‘chunk_size’ words.

Parameters:
  • text (str) – The text to split into chunks.

  • chunk_size (int) – The size of the chunks to split the text into.

Returns:

List of the text split into chunks.

Return type:

list of str