Skip to main content

RemoveRepetitiveElements

Documentation / extractor/pdf-to-html/transformations/line-item/RemoveRepetitiveElements

default​

Defined in: packages/ai-research-agent/src/extractor/pdf-to-html/transformations/line-item/RemoveRepetitiveElements.js:21

Remove elements with similar content on same page positions, like page numbers, licenes information, etc...

Extends​

Constructors​

Constructor​

new default(): default;

Defined in: packages/ai-research-agent/src/extractor/pdf-to-html/transformations/line-item/RemoveRepetitiveElements.js:22

Returns​

default

Overrides​

default.constructor

Properties​

name​

name: any;

Defined in: packages/ai-research-agent/src/extractor/pdf-to-html/transformations/Transformation.js:11

Inherited from​

default.name

itemType​

itemType: any;

Defined in: packages/ai-research-agent/src/extractor/pdf-to-html/transformations/Transformation.js:12

Inherited from​

default.itemType

Methods​

transform()​

transform(parseResult: any): default;

Defined in: packages/ai-research-agent/src/extractor/pdf-to-html/transformations/line-item/RemoveRepetitiveElements.js:30

The idea is the following:

  • For each page, collect all items of the first, and all items of the last line
  • Calculate how often these items occur accros all pages (hash ignoring numbers, whitespace, upper/lowercase)
  • Delete items occuring on more then 2/3 of all pages
Parameters​
ParameterType

parseResult

any

Returns​

default

Overrides​

default.transform

completeTransform()​

completeTransform(parseResult: any): any;

Defined in: packages/ai-research-agent/src/extractor/pdf-to-html/transformations/ToLineItemTransformation.js:19

Sometimes the transform() does only visualize a change. This methods then does the actual change.

Parameters​
ParameterType

parseResult

any

Returns​

any

Inherited from​

default.completeTransform