Skip to main content



def recursive_get_id(values_to_unpack: Union[dict, list],
tmpl: Optional[set] = None)

Pull ID values out of the LIST/NSMI results from Spot.


def spot(text: str,
lower: float = 0.25,
pred: float = 0.5,
upper: float = 0.6,
verbose: float = 0,
token: str = "")

Call the Spot API ( to classify the text of a PDF using the NSMIv2/LIST taxonomy (, but returns only the IDs of issues found in the text.


def re_case(text: str) -> str

Capture PascalCase, snake_case and kebab-case terms and add spaces to separate the joined words


def regex_norm_field(text: str)

Apply some heuristics to a field name to see if we can get it to match AssemblyLine conventions. See:


def reformat_field(text: str, max_length: int = 30, tools_token=None)

Transforms a string of text into a snake_case variable close in length to max_length name by summarizing the string and stitching the summary together in snake_case. h/t


def norm(row)

Normalize a word vector.


def vectorize(text: Union[List[str], str], tools_token: Optional[str] = None)

Vectorize a string of text.


  • text - a string of multiple words to vectorize
  • tools_token - the token to, used for micro-service to reduce the amount of memory you need on your machine. If not passed, you need to have en_core_web_lg installed


def normalize_name(jur: str,
group: str,
n: int,
last_field: str,
this_field: str,
tools_token: Optional[str] = None) -> Tuple[str, float]

Add hard coded conversions maybe by calling a function if returns 0 then fail over to ML or other way around poor prob -> check hard-coded. Retuns the new name and a confidence value between 0 and 1


def cluster_screens(fields: List[str] = [],
damping: float = 0.7,
tools_token: Optional[str] = None) -> Dict[str, List[str]]

Groups the given fields into screens based on how much they are related.


  • fields - a list of field names
  • damping - a value >= 0.5 and < 1. Tunes how related screens should be
  • tools_token - the token to, needed of doing micro-service vectorization
  • returs - a suggested screen grouping, each screen name mapped to the list of fields on it

InputType Objects

class InputType(Enum)

Input type maps onto the type of input the PDF author chose for the field. We only handle text, checkbox, and signature fields.


def field_types_and_sizes(
fields: Optional[Iterable[FormField]]) -> List[FieldInfo]

Transform the fields provided by get_existing_pdf_fields into a summary format. Result will look like: [ { "var_name": var_name, "type": "text | checkbox | signature", "max_length": n } ]

AnswerType Objects

class AnswerType(Enum)

Answer type describes the effort the user answering the form will require. "Slot-in" answers are a matter of almost instantaneous recall, e.g., name, address, etc. "Gathered" answers require looking around one's desk, for e.g., a health insurance number. "Third party" answers require picking up the phone to call someone else who is the keeper of the information. "Created" answers don't exist before the user is presented with the question. They may include a choice, creating a narrative, or even applying legal reasoning. "Affidavits" are a special form of created answers. See Jarret and Gaffney, Forms That Work (2008)


def classify_field(field: FieldInfo, new_name: str) -> AnswerType

Apply heuristics to the field's original and "normalized" name to classify it as either a "slot-in", "gathered", "third party" or "created" field type.


def get_adjusted_character_count(field: FieldInfo) -> float

Determines the bracketed length of an input field based on its max_length attribute, returning a float representing the approximate length of the field content.

The function chunks the answers into 5 different lengths (checkboxes, 2 words, short, medium, and long) instead of directly using the character count, as forms can allocate different spaces for the same data without considering the space the user actually needs.


  • field FieldInfo - An object containing information about the input field, including the "max_length" attribute.


  • float - The approximate length of the field content, categorized into checkboxes, 2 words, short, medium, or long based on the max_length attribute.


>>> get_adjusted_character_count({"type"}: InputType.CHECKBOX) 4.7 >>> get_adjusted_character_count({"max_length": 100}) 9.4 >>> get_adjusted_character_count({"max_length": 300}) 230 >>> get_adjusted_character_count({"max_length": 600}) 115 >>> get_adjusted_character_count({"max_length": 1200}) 1150


def time_to_answer_field(field: FieldInfo,
new_name: str,
cpm: int = 40,
cpm_std_dev: int = 17) -> Callable[[int], np.ndarray]

Apply a heuristic for the time it takes to answer the given field, in minutes. It is hand-written for now. It will factor in the input type, the answer type (slot in, gathered, third party or created), and the amount of input text allowed in the field. The return value is a function that can return N samples of how long it will take to answer the field (in minutes)


def time_to_answer_form(processed_fields,
normalized_fields) -> Tuple[float, float]

Provide an estimate of how long it would take an average user to respond to the questions on the provided form. We use signals such as the field type, name, and space provided for the response to come up with a rough estimate, based on whether the field is:

  1. fill in the blank
  2. gathered - e.g., an id number, case number, etc.
  3. third party: need to actually ask someone the information - e.g., income of not the user, anything else?
  4. created: a. short created (3 lines or so?) b. long created (anything over 3 lines)


def cleanup_text(text: str, fields_to_sentences: bool = False) -> str

Apply cleanup routines to text to provide more accurate readability statistics.


def complete_with_command(text,
creds: Optional[OpenAiCreds] = None) -> str

Combines some text with a command to send to open ai.


def needs_calculations(text: Union[str, Doc]) -> bool

A conservative guess at if a given form needs the filler to make math calculations, something that should be avoided. If


def get_passive_sentences(
text: Union[List, str]) -> List[Tuple[str, List[Tuple[int, int]]]]

Return a list of tuples, where each tuple represents a sentence in which passive voice was detected along with a list of the starting and ending position of each fragment that is phrased in the passive voice. The combination of the two can be used in the PDFStats frontend to highlight the passive text in an individual sentence.

Text can either be a string or a list of strings. If provided a single string, it will be tokenized with NTLK and sentences containing fewer than 2 words will be ignored.


def get_citations(text: str, tokenized_sentences: List[str]) -> List[str]

Get citations and some extra surrounding context (the full sentence), if the citation is fewer than 5 characters (often eyecite only captures a section symbol for state-level short citation formats)


def substitute_phrases(
input_string: str,
substitution_phrases: Dict[str,
str]) -> Tuple[str, List[Tuple[int, int]]]

Substitute phrases in the input string and return the new string and positions of substituted phrases.


  • input_string str - The input string containing phrases to be replaced.
  • substitution_phrases Dict[str, str] - A dictionary mapping original phrases to their replacement phrases.


Tuple[str, List[Tuple[int, int]]]: A tuple containing the new string with substituted phrases and a list of tuples, each containing the start and end positions of the substituted phrases in the new string.


>>> input_string = "The quick brown fox jumped over the lazy dog." >>> substitution_phrases = {"quick brown": "swift reddish", "lazy dog": "sleepy canine"} >>> new_string, positions = substitute_phrases(input_string, substitution_phrases) >>> print(new_string) "The swift reddish fox jumped over the sleepy canine." >>> print(positions) [(4, 17), (35, 48)]


def substitute_neutral_gender(
input_string: str) -> Tuple[str, List[Tuple[int, int]]]

Substitute gendered phrases with neutral phrases in the input string. Primary source is


def substitute_plain_language(
input_string: str) -> Tuple[str, List[Tuple[int, int]]]

Substitute complex phrases with simpler alternatives. Source of terms is drawn from


def transformed_sentences(
sentence_list: List[str],
fun: Callable) -> List[Tuple[str, str, List[Tuple[int, int]]]]

Apply a function to a list of sentences and return only the sentences with changed terms. The result is a tuple of the original sentence, new sentence, and the starting and ending position of each changed fragment in the sentence.


def parse_form(in_file: str,
title: Optional[str] = None,
jur: Optional[str] = None,
cat: Optional[str] = None,
normalize: bool = True,
spot_token: Optional[str] = None,
tools_token: Optional[str] = None,
openai_creds: Optional[OpenAiCreds] = None,
rewrite: bool = False,
debug: bool = False)

Read in a pdf, pull out basic stats, attempt to normalize its form fields, and re-write the in_file with the new fields (if rewrite=1). If you pass a spot token, we will guess the NSMI code. If you pass openai creds, we will give suggestions for the title and description.


def form_complexity(stats)

Gets a single number of how hard the form is to complete. Higher is harder.