sciencescraper.sciencedirect.scidir_clean#

Functions to clean the text extracted from ScienceDirect articles.

Module Contents#

Functions#

clean_fulltext(xml_text, chunk_size)

Clean the raw XML text of an ScienceDirect article to remove unnecessary information,

split_into_chunks(text, chunk_size)

Splits a given text into chunks of approximately 'chunk_size' words.

sciencescraper.sciencedirect.scidir_clean.clean_fulltext(xml_text, chunk_size)[source]#

Clean the raw XML text of an ScienceDirect article to remove unnecessary information, leaving only the full text of the article.

Parameters:
  • xml_text (str) – The raw XML text of an article.

  • chunk_size (int) – The size of the chunks to split the full text into.

Returns:

List of the full text of the article, split into chunks

Return type:

list of str

sciencescraper.sciencedirect.scidir_clean.split_into_chunks(text, chunk_size)[source]#

Splits a given text into chunks of approximately ‘chunk_size’ words.

Parameters:
  • text (str) – The text to split into chunks.

  • chunk_size (int) – The size of the chunks to split the text into.

Returns:

List of the text split into chunks.

Return type:

list of str