psykoda.preprocess package

Module contents

Preprocessing

class psykoda.preprocess.FastRoundDatetime(time_unit: str)[source]

Bases: object

gen_roundfunc()[source]
class psykoda.preprocess.RoundDatetime(time_unit: str)[source]

Bases: object

class psykoda.preprocess.ScreeningConfig(min: int, max: int = 100000000)[source]

Bases: object

Log screening settings.

max: int = 100000000
min: int
psykoda.preprocess.addr_in_subnets(sub_networks: list)Callable[[str], bool][source]

Build “in some of these subnets” filter for IP addresses

Returns

predicate for IP addresses

Return type

in_subnets(addr)

Warning

Optimized for IPv4. Does not support IPv6.

psykoda.preprocess.drop_null(df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]
psykoda.preprocess.exclude_log(log: pandas.core.frame.DataFrame, exclusion: Iterable[dict])pandas.core.frame.DataFrame[source]
psykoda.preprocess.extract_log(log: pandas.core.frame.DataFrame, subnets: Optional[List[str]], include_ports: Optional[List[int]] = None, exclude_ports: Optional[List[int]] = None)pandas.core.frame.DataFrame[source]

extract logs with subnets and service_dport

Parameters
  • subnets – List of subnets to which the IP addresses to be extracted belong. e.g [“10.25.148.0/24”, “192.168.0.0/16”] (CIDR format) None to extract all IP addresses.

  • include_ports – List of port numbers to extract. e.g [22, 3389] None to extract all port numbers.

  • exclude_ports – List of port numbers not to extract, e.g [22, 3389], Empty or None to exclude no port numbers. Exclusion takes precedence over inclusion.

psykoda.preprocess.filter_out(log: pandas.core.frame.DataFrame, column_name: str, filter_patterns: pandas.core.indexes.base.Index)pandas.core.frame.DataFrame[source]

Filter out rows according to patterns of column values.

Parameters
  • log

  • column_name – name of data or index column to match patterns against.

  • filter_patterns – patterns to filter out matching rows. if column_name is col.SRC_IP or col.DEST_IP, a pattern is a CIDR notation (ipaddress.ip_network() accepts). otherwise, a pattern is a string to match the values exactly.

psykoda.preprocess.round_datetime(dt: datetime.datetime, time_unit: str)[source]
psykoda.preprocess.screening_numlog(log: pandas.core.frame.DataFrame, config: psykoda.preprocess.ScreeningConfig)pandas.core.frame.DataFrame[source]

exclude ip addresses whose numbers of logs are out of [ config.min, config.max ]

Parameters
  • log – Source log.

  • config – Settings for screening.

Returns

Screened log.

Return type

log

psykoda.preprocess.set_index(df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]